数据集:
cardiffnlp/tweet_topic_multi
这是 TweetTopic( "Twitter Topic Classification , COLING main conference 2022" )的官方存储库,它是一个包含19个标签的Twitter主题分类数据集。每个TweetTopic实例都带有一个时间戳,时间范围从2019年9月到2021年8月。有关TweetTopic的单标签版本,请参见 cardiffnlp/tweet_topic_single 。TweetTopic中使用的推文收集与 TweetNER7 中所使用的相同。该数据集也集成在 TweetNLP 中。
在注释之前,我们对推文进行预处理以规范化一些文本特征,将URL转换为特殊令牌 {{URL}},将未经验证的用户名替换为 {{USERNAME}}。对于经过验证的用户名,我们使用符号 {@} 替换其显示名称(或帐户名)。例如,一个推文
Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek
将转换为以下文本。
Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from {@herbiehancock@}
via {@bluenoterecords@} link below: {{URL}}
下面是一个格式化推文的简单函数。
import re
from urlextract import URLExtract
extractor = URLExtract()
def format_tweet(tweet):
# mask web urls
urls = extractor.find_urls(tweet)
for url in urls:
tweet = tweet.replace(url, "{{URL}}")
# format twitter account
tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet)
return tweet
target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek"""
target_format = format_tweet(target)
print(target_format)
'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'
| split | number of texts | description |
|---|---|---|
| test_2020 | 573 | test dataset from September 2019 to August 2020 |
| test_2021 | 1679 | test dataset from September 2020 to August 2021 |
| train_2020 | 4585 | training dataset from September 2019 to August 2020 |
| train_2021 | 1505 | training dataset from September 2020 to August 2021 |
| train_all | 6090 | combined training dataset of train_2020 and train_2021 |
| validation_2020 | 573 | validation dataset from September 2019 to August 2020 |
| validation_2021 | 188 | validation dataset from September 2020 to August 2021 |
| train_random | 4564 | randomly sampled training dataset with the same size as train_2020 from train_all |
| validation_random | 573 | randomly sampled training dataset with the same size as validation_2020 from validation_all |
| test_coling2022_random | 5536 | random split used in the COLING 2022 paper |
| train_coling2022_random | 5731 | random split used in the COLING 2022 paper |
| test_coling2022 | 5536 | temporal split used in the COLING 2022 paper |
| train_coling2022 | 5731 | temporal split used in the COLING 2022 paper |
对于时间偏移设置,模型应在 train_2020 上进行训练,使用 validation_2020 进行验证,并在 test_2021 上进行评估。通常,模型将在 train_all 上进行训练,这是最具代表性的训练集,使用 validation_2021 进行验证,并在 test_2021 上进行评估。
重要提示:为了得到与 COLING 2022 Tweet Topic 论文结果可比较的结果,请使用 train_coling2022 进行时间偏移,使用 test_coling2022 进行评估,并使用 train_coling2022_random 进行随机拆分(coling2022拆分没有验证集)。
| model | training data | F1 | F1 (macro) | Accuracy |
|---|---|---|---|---|
| 12310321 | all (2020 + 2021) | 0.763104 | 0.620257 | 0.536629 |
| 12311321 | all (2020 + 2021) | 0.751814 | 0.600782 | 0.531864 |
| 12312321 | all (2020 + 2021) | 0.762513 | 0.603533 | 0.547945 |
| 12313321 | all (2020 + 2021) | 0.759917 | 0.59901 | 0.536033 |
| 12314321 | all (2020 + 2021) | 0.764767 | 0.618702 | 0.548541 |
| 12315321 | 2020 only | 0.732366 | 0.579456 | 0.493746 |
| 12316321 | 2020 only | 0.725229 | 0.561261 | 0.499107 |
| 12317321 | 2020 only | 0.73671 | 0.565624 | 0.513401 |
| 12318321 | 2020 only | 0.729446 | 0.534799 | 0.50268 |
| 12319321 | 2020 only | 0.731106 | 0.532141 | 0.509827 |
可以在 here 找到模型微调脚本.
train 的一个示例如下。
{
"date": "2021-03-07",
"text": "The latest The Movie theater Daily! {{URL}} Thanks to {{USERNAME}} {{USERNAME}} {{USERNAME}} #lunchtimeread #amc1000",
"id": "1368464923370676231",
"label": [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"label_name": ["film_tv_&_video"]
}
可以在 here 找到 label2id 字典。
{
"arts_&_culture": 0,
"business_&_entrepreneurs": 1,
"celebrity_&_pop_culture": 2,
"diaries_&_daily_life": 3,
"family": 4,
"fashion_&_style": 5,
"film_tv_&_video": 6,
"fitness_&_health": 7,
"food_&_dining": 8,
"gaming": 9,
"learning_&_educational": 10,
"music": 11,
"news_&_social_concern": 12,
"other_hobbies": 13,
"relationships": 14,
"science_&_technology": 15,
"sports": 16,
"travel_&_adventure": 17,
"youth_&_student_life": 18
}
@inproceedings{dimosthenis-etal-2022-twitter,
title = "{T}witter {T}opic {C}lassification",
author = "Antypas, Dimosthenis and
Ushio, Asahi and
Camacho-Collados, Jose and
Neves, Leonardo and
Silva, Vitor and
Barbieri, Francesco",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics"
}