数据集 “cardiffnlp/tweet_topic_single”的数据卡片

数据集概述

这是TweetTopic（ "Twitter Topic Classification , COLING main conference 2022" ）的官方存储库，这是一个在Twitter上进行主题分类的数据集，包含6个标签。每个TweetTopic实例都带有时间戳，时间范围从2019年9月到2021年8月。有关TweetTopic的多标签版本，请参见 cardiffnlp/tweet_topic_multi 。TweetTopic中使用的推文收集与 TweetNER7 中使用的相同。该数据集也集成在 TweetNLP 中。

预处理

我们在注释之前对推文进行预处理，以规范化一些工件，将URL转换为特殊标记{{URL}}和非验证用户名转换为{{USERNAME}}。对于经过验证的用户名，我们用符号{@}替换其显示名称（或帐户名称）。例如，一个推文

Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from @herbiehancock
via @bluenoterecords link below: 
http://bluenote.lnk.to/AlbumOfTheWeek

转换为以下文本。

Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from {@herbiehancock@}
via {@bluenoterecords@} link below: {{URL}}

以下是格式化推文的简单函数。

import re
from urlextract import URLExtract
extractor = URLExtract()
def format_tweet(tweet):
    # mask web urls
    urls = extractor.find_urls(tweet)
    for url in urls:
        tweet = tweet.replace(url, "{{URL}}")
    # format twitter account
    tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet)
    return tweet
target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek"""
target_format = format_tweet(target)
print(target_format)
'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'

数据拆分

split	number of texts	description
test_2020	376	test dataset from September 2019 to August 2020
test_2021	1693	test dataset from September 2020 to August 2021
train_2020	2858	training dataset from September 2019 to August 2020
train_2021	1516	training dataset from September 2020 to August 2021
train_all	4374	combined training dataset of train_2020 and train_2021
validation_2020	352	validation dataset from September 2019 to August 2020
validation_2021	189	validation dataset from September 2020 to August 2021
train_random	2830	randomly sampled training dataset with the same size as train_2020 from train_all
validation_random	354	randomly sampled training dataset with the same size as validation_2020 from validation_all
test_coling2022_random	3399	random split used in the COLING 2022 paper
train_coling2022_random	3598	random split used in the COLING 2022 paper
test_coling2022	3399	temporal split used in the COLING 2022 paper
train_coling2022	3598	temporal split used in the COLING 2022 paper

对于时间偏移设置，模型应该在train_2020上训练，用validation_2020进行验证，并在test_2021上进行评估。一般情况下，模型应该在train_all上进行训练，该数据集是最具代表性的训练集，包括validation_2021，并在test_2021上进行评估。

重要说明：为了获得与COLING 2022 Tweet Topic论文结果可比较的结果，请在时间偏移中使用train_coling2022和test_coling2022，在随机拆分中使用train_coling2022_random和test_coling2022_random（coling2022拆分没有验证集）。

模型

model	training data	F1	F1 (macro)	Accuracy
12310321	all (2020 + 2021)	0.896043	0.800061	0.896043
12311321	all (2020 + 2021)	0.887773	0.79793	0.887773
12312321	all (2020 + 2021)	0.892499	0.774494	0.892499
12313321	all (2020 + 2021)	0.890136	0.776025	0.890136
12314321	all (2020 + 2021)	0.894861	0.800952	0.894861
12315321	2020 only	0.878913	0.70565	0.878913
12316321	2020 only	0.868281	0.729667	0.868281
12317321	2020 only	0.882457	0.740187	0.882457
12318321	2020 only	0.87596	0.746275	0.87596
12319321	2020 only	0.877732	0.746119	0.877732

可以在 here 中找到模型微调脚本。

数据集结构

数据实例

train的一个示例如下。

{
    "text": "Game day for {{USERNAME}} U18\u2019s against {{USERNAME}} U18\u2019s. Even though it\u2019s a \u2018home\u2019 game for the people that have settled in Mid Wales it\u2019s still a 4 hour round trip for us up to Colwyn Bay. Still enjoy it though!",
    "date": "2019-09-08",
    "label": 4,
    "id": "1170606779568463874",
    "label_name": "sports_&_gaming"
}

标签ID

label2id字典可以在 here 中找到。

{
    "arts_&_culture": 0,
    "business_&_entrepreneurs": 1,
    "pop_culture": 2,
    "daily_life": 3,
    "sports_&_gaming": 4,
    "science_&_technology": 5
}

引用信息

@inproceedings{dimosthenis-etal-2022-twitter,
    title = "{T}witter {T}opic {C}lassification",
    author = "Antypas, Dimosthenis  and
    Ushio, Asahi  and
    Camacho-Collados, Jose  and
    Neves, Leonardo  and
    Silva, Vitor  and
    Barbieri, Francesco",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics"
}

作者:

cardiffnlp

数据集大小:

6.38 MB