lm-zoo commands¶
lm-zoo¶
lm-zoo provides black-box access to computing with state-of-the-art language models.
lm-zoo [OPTIONS] COMMAND [ARGS]...
Options
- 
--backend<backend>¶
- Specify a backend (containerization platform) to run the specified model. - Options
- docker|singularity 
 
- 
-v,--verbose¶
get-predictions¶
Compute token-level predictive distributions from a language model for the given natural language sentences.
INFILE should be a raw natural language text file with one sentence per line.
This command writes a HDF5 file to the given OUTFILE, with the following structure:
/sentence/<i>/predictions: N_tokens_i * N_vocabulary array of
    log-probabilities (rows are log-probability distributions)
/sentence/<i>/tokens: sequence of integer token IDs corresponding to
    indices in ``/vocabulary``
/vocabulary: byte-encoded string array of vocabulary items (decode with
    ``numpy.char.decode(vocabulary, "utf-8")``)
lm-zoo get-predictions [OPTIONS] MODEL INFILE OUTFILE
Arguments
- 
MODEL¶
- Required argument 
- 
INFILE¶
- Required argument 
- 
OUTFILE¶
- Required argument 
get-surprisals¶
Get word-level surprisals from a language model for the given natural language text. Tab-separated results will be sent to standard output, following the format:
sentence_id       token_id        token   surprisal
1                 1               This    0.000
1                 2               is      1.000
1                 3               a       1.000
1                 4               <unk>   0.500
1                 5               line    1.000
1                 6               .       0.250
1                 7               <eos>   0.100
The surprisal of a token \(w_i\) is the negative logarithm of that token’s probability under a language model’s predictive distribution:
Note that surprisals are computed on the level of tokens, not words. Models that insert extra tokens (e.g., an end-of-sentence token as above) or which tokenize on the sub-word level (e.g. GPT2) will not have a one-to-one mapping between rows of surprisal output from this command and words.
There is guaranteed to be a one-to-one mapping, however, between the rows
of this file and the tokens produced by lm-zoo tokenize.
lm-zoo get-surprisals [OPTIONS] MODEL FILE
Arguments
- 
MODEL¶
- Required argument 
- 
FILE¶
- Required argument 
list¶
List language models available in the central repository.
lm-zoo list [OPTIONS]
Options
- 
--short¶
- Output just a list of shortnames rather than a pretty list 
tokenize¶
Tokenize natural-language text according to a model’s preprocessing standards.
FILE should be a raw natural language text file with one sentence per line.
This command returns a text file with one tokenized sentence per line, with
tokens separated by single spaces. For each sentence, there is a one-to-one
mapping between the tokens output by this command and the tokens used by
the get-surprisals command.
lm-zoo tokenize [OPTIONS] MODEL FILE
Arguments
- 
MODEL¶
- Required argument 
- 
FILE¶
- Required argument 
unkify¶
Detect unknown words for a language model for the given natural language text.
FILE should be a raw natural language text file with one sentence per line.
This command returns a text file with one sentence per line, where each
sentence is represented as a sequence of 0 and 1 values. These
values correspond one-to-one with the model’s tokenization of the sentence
(as returned by lm-zoo tokenize). The value 0 indicates that the
corresponding token is in the model’s vocabulary; the value 1 indicates
that the corresponding token is an unknown word for the model.
lm-zoo unkify [OPTIONS] MODEL FILE
Arguments
- 
MODEL¶
- Required argument 
- 
FILE¶
- Required argument 
