数据集:

tweet_eval

任务:

文本分类

子任务:

intent-classification multi-class-classification sentiment-classification

语言:

计算机处理:

monolingual

大小:

100K<n<1M 10K<n<100K 1K<n<10K

语言创建人:

found

批注创建人:

found

源数据集:

extended|other-tweet-datasets

预印本库:

arxiv:2010.12421

许可:

license:unknown

数据集介绍文件清单

英文

tweet_eval 数据集卡

数据集概要

tweet_eval 由 Twitter 上的七个异构任务组成，全部以多类别推文分类的形式呈现。这些任务包括 - 讽刺、仇恨、冒犯、立场、表情符号、情感和情感倾向。所有任务都被统一到同一个基准中，每个数据集都以相同的格式提供，并具有固定的训练、验证和测试集划分。

支持的任务和排行榜

text_classification：可以使用HuggingFace transformers中的SentenceClassification模型训练数据集。

语言

数据集中的文本为 Twitter 用户使用的英语。

数据集结构

数据实例

emoji 配置的一个实例：

{'label': 12, 'text': 'Sunday afternoon walking through Venice in the sun with @user ️ ️ ️ @ Abbot Kinney, Venice'}

emotion 配置的一个实例：

{'label': 2, 'text': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry"}

hate 配置的一个实例：

{'label': 0, 'text': '@user nice new signage. Are you not concerned by Beatlemania -style hysterical crowds crongregating on you…'}

irony 配置的一个实例：

{'label': 1, 'text': 'seeing ppl walking w/ crutches makes me really excited for the next 3 weeks of my life'}

offensive 配置的一个实例：

{'label': 0, 'text': '@user Bono... who cares. Soon people will understand that they gain nothing from following a phony celebrity. Become a Leader of your people instead or help and support your fellow countrymen.'}

sentiment 配置的一个实例：

{'label': 2, 'text': '"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"'}

stance_abortion 配置的一个实例：

{'label': 1, 'text': 'we remind ourselves that love means to be willing to give until it hurts - Mother Teresa'}

stance_atheism 配置的一个实例：

{'label': 1, 'text': '@user Bless Almighty God, Almighty Holy Spirit and the Messiah. #SemST'}

stance_climate 配置的一个实例：

{'label': 0, 'text': 'Why Is The Pope Upset?  via @user #UnzippedTruth #PopeFrancis #SemST'}

stance_feminist 配置的一个实例：

{'label': 1, 'text': "@user @user is the UK's answer to @user and @user  #GamerGate #SemST"}

stance_hillary 配置的一个实例：

{'label': 1, 'text': "If a man demanded staff to get him an ice tea he'd be called a sexists elitist pig.. Oink oink #Hillary #SemST"}

数据字段

对于 emoji 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : ❤

1 : 😍

2 : 😂

3 : 💕

4 : 🔥

5 : 😊

6 : 😎

7 : ✨

8 : 💙

9 : 😘

10 : 📷

11 : 🇺🇸

12 : ☀

13 : 💜

14 : 😉

15 : 💯

16 : 😁

17 : 🎄

18 : 📸

19 : 😜

对于 emotion 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : 愤怒

1 : 喜悦

2 : 乐观

3 : 悲伤

对于 hate 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : 非仇恨

1 : 仇恨

对于 irony 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : 非讽刺

1 : 讽刺

对于 offensive 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : 非冒犯

1 : 冒犯

对于 sentiment 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : 消极

1 : 中性

2 : 积极

对于 stance_abortion 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : 无

1 : 反对

2 : 赞成

对于 stance_atheism 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : 无

1 : 反对

2 : 赞成

对于 stance_climate 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : 无

1 : 反对

2 : 赞成

对于 stance_feminist 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : 无

1 : 反对

2 : 赞成

对于 stance_hillary 配置：

text : 包含推文的字符串特征。
label : 具有以下映射关系的整数分类标签：

0 : 无

1 : 反对

2 : 赞成

数据拆分

name	train	validation	test
emoji	45000	5000	50000
emotion	3257	374	1421
hate	9000	1000	2970
irony	2862	955	784
offensive	11916	1324	860
sentiment	45615	2000	12284
stance_abortion	587	66	280
stance_atheism	461	52	220
stance_climate	355	40	169
stance_feminist	597	67	285
stance_hillary	620	69	295

数据集创建

策划理由

[需要更多信息]

数据源

初始数据收集和规范化

[需要更多信息]

谁是源语言的生产者？

[需要更多信息]

注释

注释过程

[需要更多信息]

谁是注释者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策划者

Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa-Anke 和 Leonardo Neves 来自 Cardiff NLP。

许可信息

这不是一个单一的数据集，因此每个子集都有自己的许可证（集合本身没有其他限制）。

所有数据集都需要遵守 Twitter 的规则 Terms Of Service 和 Twitter API 的规则 Terms Of Service

此外，许可证为：

emoji：未定义
emotion(EmoInt)：未定义
hate (HateEval)：需要许可 here
irony：未定义
Offensive：未定义
Sentiment： Creative Commons Attribution 3.0 Unported License
Stance：未定义

引用信息

@inproceedings{barbieri2020tweeteval,
title={{TweetEval:Unified Benchmark and Comparative Evaluation for Tweet Classification}},
author={Barbieri, Francesco and Camacho-Collados, Jose and Espinosa-Anke, Luis and Neves, Leonardo},
booktitle={Proceedings of Findings of EMNLP},
year={2020}
}

如果使用了 TweetEval 数据集，请引用它们的原始出版物：

Emotion Recognition:

@inproceedings{mohammad2018semeval,
  title={Semeval-2018 task 1: Affect in tweets},
  author={Mohammad, Saif and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},
  booktitle={Proceedings of the 12th international workshop on semantic evaluation},
  pages={1--17},
  year={2018}
}

Emoji Prediction:

@inproceedings{barbieri2018semeval,
  title={Semeval 2018 task 2: Multilingual emoji prediction},
  author={Barbieri, Francesco and Camacho-Collados, Jose and Ronzano, Francesco and Espinosa-Anke, Luis and
    Ballesteros, Miguel and Basile, Valerio and Patti, Viviana and Saggion, Horacio},
  booktitle={Proceedings of The 12th International Workshop on Semantic Evaluation},
  pages={24--33},
  year={2018}
}

Irony Detection:

@inproceedings{van2018semeval,
  title={Semeval-2018 task 3: Irony detection in english tweets},
  author={Van Hee, Cynthia and Lefever, Els and Hoste, V{\'e}ronique},
  booktitle={Proceedings of The 12th International Workshop on Semantic Evaluation},
  pages={39--50},
  year={2018}
}

Hate Speech Detection:

@inproceedings{basile-etal-2019-semeval,
    title = "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter",
    author = "Basile, Valerio  and Bosco, Cristina  and Fersini, Elisabetta  and Nozza, Debora and Patti, Viviana and
      Rangel Pardo, Francisco Manuel  and Rosso, Paolo  and Sanguinetti, Manuela",
    booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
    year = "2019",
    address = "Minneapolis, Minnesota, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/S19-2007",
    doi = "10.18653/v1/S19-2007",
    pages = "54--63"
}

Offensive Language Identification:

@inproceedings{zampieri2019semeval,
  title={SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)},
  author={Zampieri, Marcos and Malmasi, Shervin and Nakov, Preslav and Rosenthal, Sara and Farra, Noura and Kumar, Ritesh},
  booktitle={Proceedings of the 13th International Workshop on Semantic Evaluation},
  pages={75--86},
  year={2019}
}

Sentiment Analysis:

@inproceedings{rosenthal2017semeval,
  title={SemEval-2017 task 4: Sentiment analysis in Twitter},
  author={Rosenthal, Sara and Farra, Noura and Nakov, Preslav},
  booktitle={Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017)},
  pages={502--518},
  year={2017}
}

Stance Detection:

@inproceedings{mohammad2016semeval,
  title={Semeval-2016 task 6: Detecting stance in tweets},
  author={Mohammad, Saif and Kiritchenko, Svetlana and Sobhani, Parinaz and Zhu, Xiaodan and Cherry, Colin},
  booktitle={Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)},
  pages={31--41},
  year={2016}
}

贡献

感谢 @gchhablani 和 @abhishekkrthakur 添加了此数据集。

作者:

佚名

数据集大小:

61.68 KB