数据集:
tner/tweetner7
This is the official repository of TweetNER7 ( "Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts, AACL main conference 2022" ), an NER dataset on Twitter with 7 entity labels. Each instance of TweetNER7 comes with a timestamp which distributes from September 2019 to August 2021. The tweet collection used in TweetNER7 is same as what used in TweetTopic . The dataset is integrated in TweetNLP too.
We pre-process tweets before the annotation to normalize some artifacts, converting URLs into a special token {{URL}} and non-verified usernames into {{USERNAME}} . For verified usernames, we replace its display name (or account name) with symbols {@} . For example, a tweet
Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek
is transformed into the following text.
Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from {@herbiehancock@}
via {@bluenoterecords@} link below: {{URL}}
A simple function to format tweet follows below.
import re
from urlextract import URLExtract
extractor = URLExtract()
def format_tweet(tweet):
# mask web urls
urls = extractor.find_urls(tweet)
for url in urls:
tweet = tweet.replace(url, "{{URL}}")
# format twitter account
tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet)
return tweet
target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek"""
target_format = format_tweet(target)
print(target_format)
'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'
We ask annotators to ignore those special tokens but label the verified users' mentions.
| split | number of instances | description |
|---|---|---|
| train_2020 | 4616 | training dataset from September 2019 to August 2020 |
| train_2021 | 2495 | training dataset from September 2020 to August 2021 |
| train_all | 7111 | combined training dataset of train_2020 and train_2021 |
| validation_2020 | 576 | validation dataset from September 2019 to August 2020 |
| validation_2021 | 310 | validation dataset from September 2020 to August 2021 |
| test_2020 | 576 | test dataset from September 2019 to August 2020 |
| test_2021 | 2807 | test dataset from September 2020 to August 2021 |
| train_random | 4616 | randomly sampled training dataset with the same size as train_2020 from train_all |
| validation_random | 576 | randomly sampled training dataset with the same size as validation_2020 from validation_all |
| extra_2020 | 87880 | extra tweet without annotations from September 2019 to August 2020 |
| extra_2021 | 93594 | extra tweet without annotations from September 2020 to August 2021 |
For the temporal-shift setting, model should be trained on train_2020 with validation_2020 and evaluate on test_2021 . In general, model would be trained on train_all , the most representative training set with validation_2021 and evaluate on test_2021 .
An example of train looks as follows.
{
'tokens': ['Morning', '5km', 'run', 'with', '{{USERNAME}}', 'for', 'breast', 'cancer', 'awareness', '#', 'pinkoctober', '#', 'breastcancerawareness', '#', 'zalorafit', '#', 'zalorafitxbnwrc', '@', 'The', 'Central', 'Park', ',', 'Desa', 'Parkcity', '{{URL}}'],
'tags': [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 2, 14, 2, 14, 14, 14, 14, 14, 14, 4, 11, 11, 11, 11, 14],
'id': '1183344337016381440',
'date': '2019-10-13'
}
The label2id dictionary can be found at here .
{
"B-corporation": 0,
"B-creative_work": 1,
"B-event": 2,
"B-group": 3,
"B-location": 4,
"B-person": 5,
"B-product": 6,
"I-corporation": 7,
"I-creative_work": 8,
"I-event": 9,
"I-group": 10,
"I-location": 11,
"I-person": 12,
"I-product": 13,
"O": 14
}
See full evaluation metrics here .
Model description follows below.
| Model (link) | Data | Language Model | Micro F1 (2021) | Macro F1 (2021) |
|---|---|---|---|---|
| tner/roberta-large-tweetner7-random | tweetner7 | roberta-large | 66.33 | 60.96 |
| tner/twitter-roberta-base-2019-90m-tweetner7-random | tweetner7 | cardiffnlp/twitter-roberta-base-2019-90m | 63.29 | 58.5 |
| tner/roberta-base-tweetner7-random | tweetner7 | roberta-base | 64.04 | 59.23 |
| tner/twitter-roberta-base-dec2020-tweetner7-random | tweetner7 | cardiffnlp/twitter-roberta-base-dec2020 | 64.72 | 59.97 |
| tner/bertweet-large-tweetner7-random | tweetner7 | cardiffnlp/twitter-roberta-base-dec2021vinai/bertweet-large | 64.86 | 60.49 |
| tner/bertweet-base-tweetner7-random | tweetner7 | vinai/bertweet-base | 65.55 | 59.58 |
| tner/bert-large-tweetner7-random | tweetner7 | bert-large | 62.39 | 57.54 |
| tner/bert-base-tweetner7-random | tweetner7 | bert-base | 60.91 | 55.92 |
| Model (link) | Data | Language Model | Micro F1 (2021) | Macro F1 (2021) |
|---|---|---|---|---|
| tner/roberta-large-tweetner7-selflabel2020 | tweetner7 | roberta-large | 64.56 | 59.63 |
| tner/roberta-large-tweetner7-selflabel2021 | tweetner7 | roberta-large | 64.6 | 59.45 |
| tner/roberta-large-tweetner7-2020-selflabel2020-all | tweetner7 | roberta-large | 65.46 | 60.39 |
| tner/roberta-large-tweetner7-2020-selflabel2021-all | tweetner7 | roberta-large | 64.52 | 59.45 |
| tner/roberta-large-tweetner7-selflabel2020-continuous | tweetner7 | roberta-large | 65.15 | 60.23 |
| tner/roberta-large-tweetner7-selflabel2021-continuous | tweetner7 | roberta-large | 64.48 | 59.41 |
Model description follows below.
To reproduce the experimental result on our AACL paper, please see the repository https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper .
@inproceedings{ushio-etal-2022-tweet,
title = "{N}amed {E}ntity {R}ecognition in {T}witter: {A} {D}ataset and {A}nalysis on {S}hort-{T}erm {T}emporal {S}hifts",
author = "Ushio, Asahi and
Neves, Leonardo and
Silva, Vitor and
Barbieri, Francesco. and
Camacho-Collados, Jose",
booktitle = "The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing",
month = nov,
year = "2022",
address = "Online",
publisher = "Association for Computational Linguistics",
}