数据集:
cardiffnlp/tweet_topic_single
这是TweetTopic( "Twitter Topic Classification , COLING main conference 2022" )的官方存储库,这是一个在Twitter上进行主题分类的数据集,包含6个标签。每个TweetTopic实例都带有时间戳,时间范围从2019年9月到2021年8月。有关TweetTopic的多标签版本,请参见 cardiffnlp/tweet_topic_multi 。TweetTopic中使用的推文收集与 TweetNER7 中使用的相同。该数据集也集成在 TweetNLP 中。
我们在注释之前对推文进行预处理,以规范化一些工件,将URL转换为特殊标记{{URL}}和非验证用户名转换为{{USERNAME}}。对于经过验证的用户名,我们用符号{@}替换其显示名称(或帐户名称)。例如,一个推文
Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek转换为以下文本。
Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from {@herbiehancock@}
via {@bluenoterecords@} link below: {{URL}}
 以下是格式化推文的简单函数。
import re
from urlextract import URLExtract
extractor = URLExtract()
def format_tweet(tweet):
    # mask web urls
    urls = extractor.find_urls(tweet)
    for url in urls:
        tweet = tweet.replace(url, "{{URL}}")
    # format twitter account
    tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet)
    return tweet
target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek"""
target_format = format_tweet(target)
print(target_format)
'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'
 | split | number of texts | description | 
|---|---|---|
| test_2020 | 376 | test dataset from September 2019 to August 2020 | 
| test_2021 | 1693 | test dataset from September 2020 to August 2021 | 
| train_2020 | 2858 | training dataset from September 2019 to August 2020 | 
| train_2021 | 1516 | training dataset from September 2020 to August 2021 | 
| train_all | 4374 | combined training dataset of train_2020 and train_2021 | 
| validation_2020 | 352 | validation dataset from September 2019 to August 2020 | 
| validation_2021 | 189 | validation dataset from September 2020 to August 2021 | 
| train_random | 2830 | randomly sampled training dataset with the same size as train_2020 from train_all | 
| validation_random | 354 | randomly sampled training dataset with the same size as validation_2020 from validation_all | 
| test_coling2022_random | 3399 | random split used in the COLING 2022 paper | 
| train_coling2022_random | 3598 | random split used in the COLING 2022 paper | 
| test_coling2022 | 3399 | temporal split used in the COLING 2022 paper | 
| train_coling2022 | 3598 | temporal split used in the COLING 2022 paper | 
对于时间偏移设置,模型应该在train_2020上训练,用validation_2020进行验证,并在test_2021上进行评估。一般情况下,模型应该在train_all上进行训练,该数据集是最具代表性的训练集,包括validation_2021,并在test_2021上进行评估。
重要说明:为了获得与COLING 2022 Tweet Topic论文结果可比较的结果,请在时间偏移中使用train_coling2022和test_coling2022,在随机拆分中使用train_coling2022_random和test_coling2022_random(coling2022拆分没有验证集)。
| model | training data | F1 | F1 (macro) | Accuracy | 
|---|---|---|---|---|
| 12310321 | all (2020 + 2021) | 0.896043 | 0.800061 | 0.896043 | 
| 12311321 | all (2020 + 2021) | 0.887773 | 0.79793 | 0.887773 | 
| 12312321 | all (2020 + 2021) | 0.892499 | 0.774494 | 0.892499 | 
| 12313321 | all (2020 + 2021) | 0.890136 | 0.776025 | 0.890136 | 
| 12314321 | all (2020 + 2021) | 0.894861 | 0.800952 | 0.894861 | 
| 12315321 | 2020 only | 0.878913 | 0.70565 | 0.878913 | 
| 12316321 | 2020 only | 0.868281 | 0.729667 | 0.868281 | 
| 12317321 | 2020 only | 0.882457 | 0.740187 | 0.882457 | 
| 12318321 | 2020 only | 0.87596 | 0.746275 | 0.87596 | 
| 12319321 | 2020 only | 0.877732 | 0.746119 | 0.877732 | 
可以在 here 中找到模型微调脚本。
train的一个示例如下。
{
    "text": "Game day for {{USERNAME}} U18\u2019s against {{USERNAME}} U18\u2019s. Even though it\u2019s a \u2018home\u2019 game for the people that have settled in Mid Wales it\u2019s still a 4 hour round trip for us up to Colwyn Bay. Still enjoy it though!",
    "date": "2019-09-08",
    "label": 4,
    "id": "1170606779568463874",
    "label_name": "sports_&_gaming"
}
 label2id字典可以在 here 中找到。
{
    "arts_&_culture": 0,
    "business_&_entrepreneurs": 1,
    "pop_culture": 2,
    "daily_life": 3,
    "sports_&_gaming": 4,
    "science_&_technology": 5
}
 @inproceedings{dimosthenis-etal-2022-twitter,
    title = "{T}witter {T}opic {C}lassification",
    author = "Antypas, Dimosthenis  and
    Ushio, Asahi  and
    Camacho-Collados, Jose  and
    Neves, Leonardo  and
    Silva, Vitor  and
    Barbieri, Francesco",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics"
}