syntaxgym

syntaxgym is a Python package which provides easy, standardized, reproducible access to targeted syntactic evaluations of language models. It replicates the core behavior of the SyntaxGym website.

Quick example

You can define targeted syntactic evaluations using our standard JSON format. Here’s a simple one-item evaluation which tests language models’ knowledge of subject–verb number agreement:

  {
    "meta": {"name": "Sample subject--verb suite", "metric": "sum"},
    "predictions": [{"type": "formula", "formula": "(2;%mismatch%) > (2;%match%)"}],
    "region_meta": {"1": "Subject NP", "2": "Verb", "3": "Continuation"},
    "items": [
      {
        "item_number": 1,
        "conditions": [
          {"condition_name": "match",
           "regions": [{"region_number": 1, "content": "The woman"},
                       {"region_number": 2, "content": "plays"},
                       {"region_number": 3, "content": "the guitar"}]},
          {"condition_name": "mismatch",
           "regions": [{"region_number": 1, "content": "The woman"},
                       {"region_number": 2, "content": "play"},
                       {"region_number": 3, "content": "the guitar"}]}
        ]
      }
    ]
  }

You can then use syntaxgym to evaluate a language model’s performance on this test. Our tool is integrated with the LM Zoo, so you can instantly use any of the models available in the Zoo.

Below, we evaluate GPT-2’s performance on the test suite:

$ syntaxgym run gpt2 my_suite.json
...
suite                           prediction_id   item_number     result
Sample subject--verb suite      0               1               True

We can do the same thing using a Python API:

from lm_zoo import get_registry
from syntaxgym import compute_surprisals, evaluate

model = get_registry()["gpt2"]
suite = compute_surprisals(model, "my_suite.json")
results = evaluate(suite)
print(results.to_csv(sep="\t"))

Next steps

For more information on getting started, please see our Quickstart guide.


Acknowledgements

LM Zoo is maintained by the MIT Computational Psycholinguistics Laboratory.

If you use the website or command-line tools in your research, we ask that you please cite the ACL 2020 system demonstration paper:

@inproceedings{gauthier-etal-2020-syntaxgym,
    title = "{S}yntax{G}ym: An Online Platform for Targeted Evaluation of Language Models",
    author = "Gauthier, Jon and Hu, Jennifer and Wilcox, Ethan and Qian, Peng and Levy, Roger",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.10",
    pages = "70--76",
}

If you use the original test suites, models, or results presented on the website, please cite the ACL 2020 long paper:

@inproceedings{hu-etal-2020-systematic,
    title = "A Systematic Assessment of Syntactic Generalization in Neural Language Models",
    author = "Hu, Jennifer and Gauthier, Jon and Qian, Peng and Wilcox, Ethan and Levy, Roger",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.158",
    pages = "1725--1744",
}