数据集:
dutch_social
任务:
计算机处理:
multilingual大小:
100K<n<1M语言创建人:
crowdsourced批注创建人:
machine-generated源数据集:
original许可:
该数据集包含10个文件,大约有271,342条推文。通过官方Twitter API过滤出包含荷兰语推文或在荷兰地理边界内指定位置信息的用户的推文。我们使用自然语言处理对这些推文进行了HISCO代码分类。如果用户在荷兰地理边界内提供了位置信息,我们还将它们分类到各自的省份。此数据集的目的是以FAIR(可找到、可访问、可互操作、可重用)的方式公开提供研究数据。Twitter的服务条款是Attribution-NonCommercial 4.0 International(CC BY-NC 4.0)(2020-10-27)。
情感分析,多标签分类,实体抽取
文本主要为荷兰语,其中一些推文为英语和其他语言。BCP 47代码是nl和en
数据字段的示例将是:
{
"full_text": "@pflegearzt @Friedelkorn @LAguja44 Pardon, wollte eigentlich das zitieren: \nhttps://t.co/ejO7bIMyj8\nMeine mentions sind inzw komplett undurchschaubar weil da Leute ihren supporterclub zwecks Likes zusammengerufen haben.",
"text_translation": "@pflegearzt @Friedelkorn @ LAguja44 Pardon wollte zitieren eigentlich das:\nhttps://t.co/ejO7bIMyj8\nMeine mentions inzw sind komplett undurchschaubar weil da Leute ihren supporter club Zwecks Likes zusammengerufen haben.",
"created_at": 1583756789000,
"screen_name": "TheoRettich",
"description": "I ❤️science, therefore a Commie. ☭ FALGSC: Part of a conspiracy which wants to achieve world domination. Tankie-Cornucopian. Ecology is a myth",
"desc_translation": "I ❤️science, Therefore a Commie. ☭ FALGSC: Part of a conspiracy How many followers wants to Achieve World Domination. Tankie-Cornucopian. Ecology is a myth",
"weekofyear": 11,
"weekday": 0,
"day": 9,
"month": 3,
"year": 2020,
"location": "Netherlands",
"point_info": "Nederland",
"point": "(52.5001698, 5.7480821, 0.0)",
"latitude": 52.5001698,
"longitude": 5.7480821,
"altitude": 0,
"province": "Flevoland",
"hisco_standard": null,
"hisco_code": null,
"industry": false,
"sentiment_pattern": 0,
"subjective_pattern": 0
}
| Column Name | Description |
|---|---|
| full_text | Original text in the tweet |
| text_translation | English translation of the full text |
| created_at | Date of tweet creation |
| screen_name | username of the tweet author |
| description | description as provided in the users bio |
| desc_translation | English translation of user's bio/ description |
| location | Location information as provided in the user's bio |
| weekofyear | week of the year |
| weekday | Day of the week information; Monday=0....Sunday = 6 |
| month | Month of tweet creation |
| year | year of tweet creation |
| day | day of tweet creation |
| point_info | point information from location columnd |
| point | tuple giving lat, lon & altitude information |
| latitude | geo-referencing information derived from location data |
| longitude | geo-referencing information derived from location data |
| altitude | geo-referencing information derived from location data |
| province | Province given location data of user |
| hisco_standard | HISCO standard key word; if available in tweet |
| hisco_code | HISCO standard code as derived from hisco_standard |
| industry | Whether the tweet talks about industry (True/False) |
| sentiment_score | Sentiment score -1.0 to 1.0 |
| subjectivity_score | Subjectivity scores 0 to 1 |
缺失值被替换为空字符串或-1(情感分数缺失为-100)。
数据已被分为训练集:60%,验证集:20%和测试集:20%
[需要更多信息]
推文使用Twitter的API进行了识别,并筛选出了荷兰语推文和/或提到来自荷兰地理边界的用户的推文。
谁是源语言的作者?语言的作者是在荷兰地理边界内确定其位置的Twitter用户。或那些用荷兰语发推文的用户!
使用自然语言处理,我们对推文进行了行业和HSN HISCO代码的分类。根据用户的位置,还添加了他们的省份信息。请查看文件/列以获取详细信息。
推文还根据情感和主观评分进行了分类。情感评分介于-1到+1之间,主观评分介于0到1之间
注释过程[需要更多信息]
谁是标注者?[需要更多信息]
在撰写本数据卡时,推文或用户数据尚未进行任何匿名化处理。因此,如果Twitter用户分享了任何个人和敏感信息,则可能在此数据集中可用。
[需要更多信息]
[需要更多信息]
仅供研究目的提供数据集。请查看数据集许可证获取附加信息。
Aakash Gupta,Th!nkEvolve Consulting和CoronaWhy的研究员
CC BY-NC 4.0
@data{FK2/MTPTL7_2020,author = {Gupta, Aakash},publisher = {Coronavirus数据集},title = {{荷兰社交媒体收集}},year = {2020},version = {初稿},doi = {10.5072/FK2/MTPTL7},url = { https://doi.org/10.5072/FK2/MTPTL7} }
感谢 @skyprince999 添加此数据集。