Quickstart¶
Requirements¶
syntaxgym
is supported for Windows, OS X, and Linux systems. It wraps
around the LM Zoo standard, which requires
Docker to run language model images.
Installation¶
You can install syntaxgym
using the Python package manager pip
:
pip install -U syntaxgym
Define your first test suite¶
Next, we’ll define a simple test suite: an experiment which tests a language model’s knowledge of some grammatical phenomenon. The below test suite tests knowledge of English subject–verb number agreement.
For more information on the structure of SyntaxGym test suites, see The SyntaxGym architecture. The JSON standard for test suites is documented in Test suite JSON representation.
{
"meta": {"name": "Sample subject--verb suite", "metric": "sum"},
"predictions": [{"type": "formula", "formula": "(2;%mismatch%) > (2;%match%)"}],
"region_meta": {"1": "Subject NP", "2": "Verb", "3": "Continuation"},
"items": [
{
"item_number": 1,
"conditions": [
{"condition_name": "match",
"regions": [{"region_number": 1, "content": "The woman"},
{"region_number": 2, "content": "plays"},
{"region_number": 3, "content": "the guitar"}]},
{"condition_name": "mismatch",
"regions": [{"region_number": 1, "content": "The woman"},
{"region_number": 2, "content": "play"},
{"region_number": 3, "content": "the guitar"}]}
]
}
]
}
Run evaluations¶
Now we’ll evaluate the performance of a language model on our new test suite.
Pick a language model¶
SyntaxGym interfaces with LM Zoo language models. We can pick any language model, then, from the LM Zoo registry. If you’re interested in evaluating your own language model in SyntaxGym, you’ll first need to prepare a model image under the LM Zoo standard. For more information, see the LM Zoo documentation.
We’ll use the model gpt2
as an example here.
Command line usage¶
You can run evaluations using the syntaxgym
command-line tool. The
evaluation will return per-item prediction results given a particular language
model.
$ syntaxgym run gpt2 my_suite.json
...
suite prediction_id item_number result
Sample subject--verb suite 0 1 True
The run
command outputs a tab-separated list of per-item results on the
test suite.
Python API usage¶
We can also trigger evaluations from Python scripts. For example, to replicate the above evaluation:
from lm_zoo import get_registry
from syntaxgym import compute_surprisals, evaluate
# Retrieve an LM Zoo ``Model`` instance for GPT2
model = get_registry()["gpt2"]
# Compute region-level surprisal data for our suite.
suite = compute_surprisals(model, "my_suite.json")
# Check predictions given the suite containing surprisals. This returns a
# Pandas data frame by default.
results = evaluate(suite)
print(results.to_csv(sep="\t"))