lm-zoo
commands¶
lm-zoo¶
lm-zoo
provides black-box access to computing with state-of-the-art language models.
lm-zoo [OPTIONS] COMMAND [ARGS]...
Options
-
--backend
<backend>
¶ Specify a backend (containerization platform) to run the specified model.
- Options
docker|singularity
-
-v
,
--verbose
¶
get-predictions¶
Compute token-level predictive distributions from a language model for the given natural language sentences.
INFILE should be a raw natural language text file with one sentence per line.
This command writes a HDF5 file to the given OUTFILE, with the following structure:
/sentence/<i>/predictions: N_tokens_i * N_vocabulary array of
log-probabilities (rows are log-probability distributions)
/sentence/<i>/tokens: sequence of integer token IDs corresponding to
indices in ``/vocabulary``
/vocabulary: byte-encoded string array of vocabulary items (decode with
``numpy.char.decode(vocabulary, "utf-8")``)
lm-zoo get-predictions [OPTIONS] MODEL INFILE OUTFILE
Arguments
-
MODEL
¶
Required argument
-
INFILE
¶
Required argument
-
OUTFILE
¶
Required argument
get-surprisals¶
Get word-level surprisals from a language model for the given natural language text. Tab-separated results will be sent to standard output, following the format:
sentence_id token_id token surprisal
1 1 This 0.000
1 2 is 1.000
1 3 a 1.000
1 4 <unk> 0.500
1 5 line 1.000
1 6 . 0.250
1 7 <eos> 0.100
The surprisal of a token \(w_i\) is the negative logarithm of that token’s probability under a language model’s predictive distribution:
Note that surprisals are computed on the level of tokens, not words. Models that insert extra tokens (e.g., an end-of-sentence token as above) or which tokenize on the sub-word level (e.g. GPT2) will not have a one-to-one mapping between rows of surprisal output from this command and words.
There is guaranteed to be a one-to-one mapping, however, between the rows
of this file and the tokens produced by lm-zoo tokenize
.
lm-zoo get-surprisals [OPTIONS] MODEL FILE
Arguments
-
MODEL
¶
Required argument
-
FILE
¶
Required argument
list¶
List language models available in the central repository.
lm-zoo list [OPTIONS]
Options
-
--short
¶
Output just a list of shortnames rather than a pretty list
tokenize¶
Tokenize natural-language text according to a model’s preprocessing standards.
FILE should be a raw natural language text file with one sentence per line.
This command returns a text file with one tokenized sentence per line, with
tokens separated by single spaces. For each sentence, there is a one-to-one
mapping between the tokens output by this command and the tokens used by
the get-surprisals
command.
lm-zoo tokenize [OPTIONS] MODEL FILE
Arguments
-
MODEL
¶
Required argument
-
FILE
¶
Required argument
unkify¶
Detect unknown words for a language model for the given natural language text.
FILE should be a raw natural language text file with one sentence per line.
This command returns a text file with one sentence per line, where each
sentence is represented as a sequence of 0
and 1
values. These
values correspond one-to-one with the model’s tokenization of the sentence
(as returned by lm-zoo tokenize
). The value 0
indicates that the
corresponding token is in the model’s vocabulary; the value 1
indicates
that the corresponding token is an unknown word for the model.
lm-zoo unkify [OPTIONS] MODEL FILE
Arguments
-
MODEL
¶
Required argument
-
FILE
¶
Required argument