- Saffron ACL data: includes top 15 topics for each publication (based on the Saffron score).
- Sample domain models: contains 3 domain models for Computer Science, Food and Agriculture, and the Biomedical domain.
- Sample topical hierarchies: includes 3 topical hierarchies automatically constructed for Computational Linguistics, Finance, and Semantic Web.
- Expert search evaluation: evaluation dataset for domain-specific expert search based on workshop program committees.
BitterCorpus – Bilingual IT Terminology Annotated Corpus
- The dataset (generated in collaboration with HLT FBK) contains two annotated corpora produced to evaluate monolingual and bilingual domain-specific term extractors. To download the dataset, please visit this webpage.
PE²rr – PostEdited and ERRor annotated corpus
- The PE²rr corpus contains machine translations, their post-edited versions and error annotations of the performed edit-operations. In addition, particular language-related issues are defined for each sentence where possible. The examples below illustrate the corpus for the English-Slovene translation direction. To download the dataset, please visit this webpage.
- Polylingual WordNet is an extension of Princeton WordNet developed by Mihael Arcan, John P. McCrae and Paul Buitelaar at the Insight Centre for Data Analytics at the National University of Ireland Galway. Polylingual WordNet extends WordNet for 23 languages by automatic translation and is released as both OntoLex JSON-LD as well as in the Global WordNet LMF. This resource is available for re-use under the Creative Commons Attribution 4.0 License. To download the dataset, please visit the Polylingual WordNet webpage.
Subjunctive Mood Dataset
Suggestion Mining Datasets
Sentences from different domains tagged as suggestion and non-suggestion. Published in the following two conferences.
Emotion Annotated Tweets
- Tweet data annotated on each of four emotion dimensions: Valence, Arousal, Dominance and Surprise. The resource contains 2019 tweets annotated both on a 5-point ordinal scale and as tweet pairs annotated as pair-wise comparisons.
- Ekman annotated tweets: A set of 360 tweets containing common emoji annotated by many annotators on the presence or absence of each of the 6 emotion categories identified by Ekman: Joy, Sadness, Surprise, Anger, Fear and Disgust. Annotations were conducted with the emoji removed from the tweets. [Reference]