lm-zoo commands

lm-zoo

lm-zoo provides black-box access to computing with state-of-the-art language models.

lm-zoo [OPTIONS] COMMAND [ARGS]...

Options

--backend <backend>

Specify a backend (containerization platform) to run the specified model.

Options

docker|singularity

-v, --verbose

get-predictions

Compute token-level predictive distributions from a language model for the given natural language sentences.

INFILE should be a raw natural language text file with one sentence per line.

This command writes a HDF5 file to the given OUTFILE, with the following structure:

/sentence/<i>/predictions: N_tokens_i * N_vocabulary array of
    log-probabilities (rows are log-probability distributions)
/sentence/<i>/tokens: sequence of integer token IDs corresponding to
    indices in ``/vocabulary``
/vocabulary: byte-encoded string array of vocabulary items (decode with
    ``numpy.char.decode(vocabulary, "utf-8")``)
lm-zoo get-predictions [OPTIONS] MODEL INFILE OUTFILE

Arguments

MODEL

Required argument

INFILE

Required argument

OUTFILE

Required argument

get-surprisals

Get word-level surprisals from a language model for the given natural language text. Tab-separated results will be sent to standard output, following the format:

sentence_id       token_id        token   surprisal
1                 1               This    0.000
1                 2               is      1.000
1                 3               a       1.000
1                 4               <unk>   0.500
1                 5               line    1.000
1                 6               .       0.250
1                 7               <eos>   0.100

The surprisal of a token \(w_i\) is the negative logarithm of that token’s probability under a language model’s predictive distribution:

\[S(w_i) = -\log_2 p(w_i \mid w_1, w_2, \ldots, w_{i-1})\]

Note that surprisals are computed on the level of tokens, not words. Models that insert extra tokens (e.g., an end-of-sentence token as above) or which tokenize on the sub-word level (e.g. GPT2) will not have a one-to-one mapping between rows of surprisal output from this command and words.

There is guaranteed to be a one-to-one mapping, however, between the rows of this file and the tokens produced by lm-zoo tokenize.

lm-zoo get-surprisals [OPTIONS] MODEL FILE

Arguments

MODEL

Required argument

FILE

Required argument

list

List language models available in the central repository.

lm-zoo list [OPTIONS]

Options

--short

Output just a list of shortnames rather than a pretty list

tokenize

Tokenize natural-language text according to a model’s preprocessing standards.

FILE should be a raw natural language text file with one sentence per line.

This command returns a text file with one tokenized sentence per line, with tokens separated by single spaces. For each sentence, there is a one-to-one mapping between the tokens output by this command and the tokens used by the get-surprisals command.

lm-zoo tokenize [OPTIONS] MODEL FILE

Arguments

MODEL

Required argument

FILE

Required argument

unkify

Detect unknown words for a language model for the given natural language text.

FILE should be a raw natural language text file with one sentence per line.

This command returns a text file with one sentence per line, where each sentence is represented as a sequence of 0 and 1 values. These values correspond one-to-one with the model’s tokenization of the sentence (as returned by lm-zoo tokenize). The value 0 indicates that the corresponding token is in the model’s vocabulary; the value 1 indicates that the corresponding token is an unknown word for the model.

lm-zoo unkify [OPTIONS] MODEL FILE

Arguments

MODEL

Required argument

FILE

Required argument