- Saffron ACL data: includes top 15 topics for each publication (based on the Saffron score).
- Sample domain models: contains 3 domain models for Computer Science, Food and Agriculture, and the Biomedical domain.
- Sample topical hierarchies: includes 3 topical hierarchies automatically constructed for Computational Linguistics, Finance, and Semantic Web.
- Expert search evaluation: evaluation dataset for domain-specific expert search based on workshop program committees.
BitterCorpus – Bilingual IT Terminology Annotated Corpus
- The dataset (generated in collaboration with HLT FBK) contains two annotated corpora produced to evaluate monolingual and bilingual domain-specific term extractors. To download the dataset, please visit this webpage.
PE²rr – PostEdited and ERRor annotated corpus
- The PE²rr corpus contains machine translations, their post-edited versions and error annotations of the performed edit-operations. In addition, particular language-related issues are defined for each sentence where possible. The examples below illustrate the corpus for the English-Slovene translation direction. To download the dataset, please visit this webpage.
- Polylingual WordNet is an extension of Princeton WordNet developed by Mihael Arcan, John P. McCrae and Paul Buitelaar at the Insight Centre for Data Analytics at the National University of Ireland Galway. Polylingual WordNet extends WordNet for 23 languages by automatic translation and is released as both OntoLex JSON-LD as well as in the Global WordNet LMF. This resource is available for re-use under the Creative Commons Attribution 4.0 License. To download the dataset, please visit the Polylingual WordNet webpage.