Saffron data

  • Saffron ACL data: includes top 15 topics for each publication (based on the Saffron score).
  • Sample domain models: contains 3 domain models for Computer Science, Food and Agriculture, and the Biomedical domain.
  • Sample topical hierarchies: includes 3 topical hierarchies automatically constructed for Computational Linguistics, Finance, and Semantic Web.
  • Expert search evaluation: evaluation dataset for domain-specific expert search based on workshop program committees.

BitterCorpus – Bilingual IT Terminology Annotated Corpus

  • The dataset (generated in collaboration with HLT FBK) contains two annotated corpora produced to evaluate monolingual and bilingual domain-specific term extractors. To download the dataset, please visit this webpage.

PE²rr – PostEdited and ERRor annotated corpus

  • The PE²rr corpus contains machine translations, their post-edited versions and error annotations of the performed edit-operations. In addition, particular language-related issues are defined for each sentence where possible. The examples below illustrate the corpus for the English-Slovene translation direction. To download the dataset, please visit this webpage.

Polylingual WordNet

Subjunctive Mood Dataset

Example Sentences for Subjunctive Mood : Sentences which contain subjunctive mood. [Reference]

Suggestion Mining Datasets

Sentences from different domains tagged as suggestion and non-suggestion.  Published in the following two conferences.

Emnlp 2015


Emotion Annotated Tweets

  • Tweet data annotated on each of four emotion dimensions: Valence, Arousal, Dominance and Surprise. The resource contains 2019 tweets annotated both on a 5-point ordinal scale and as tweet pairs annotated as pair-wise comparisons.
  • Ekman annotated tweets: A set of 360 tweets containing common emoji annotated by many annotators on the presence or absence of each of the 6 emotion categories identified by Ekman: Joy, Sadness, Surprise, Anger, Fear and Disgust. Annotations were conducted with the emoji removed from the tweets. [Reference]