数据集:
ruanchaves/hashset_manual
计算机处理:
multilingual语言创建人:
machine-generated批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2201.06741许可:
HashSet 是一个新的数据集,包含了1.9k个手动注释和3.3M个宽松监督的推文,用于测试哈希标记分割模型的效率。我们在 HashSet 和其他基准数据集(STAN和BOUN)上比较了最先进的哈希标记分割模型。我们比较和分析了不同数据集的结果,以证明 HashSet 可以作为哈希标记分割任务的良好基准。
HashSet Manual: 包含了1.9k个手动注释的哈希标记。每一行包含了哈希标记、分割后的哈希标记、命名实体注释、哈希标记是否包含混合的印地语和英文标记以及是否包含非英文标记。
主要为印地语和英文。
{
"index": 10,
"hashtag": "goodnewsmegan",
"segmentation": "good news megan",
"spans": {
"start": [
8
],
"end": [
13
],
"text": [
"megan"
]
},
"source": "roman",
"gold_position": null,
"mix": false,
"other": false,
"ner": true,
"annotator_id": 1,
"annotation_id": 2088,
"created_at": "2021-12-30 17:10:33.800607",
"updated_at": "2021-12-30 17:10:59.714840",
"lead_time": 3896.182,
"rank": {
"position": [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10
],
"candidate": [
"goodnewsmegan",
"goodnewsmeg an",
"goodnews megan",
"goodnewsmega n",
"go odnewsmegan",
"good news megan",
"good newsmegan",
"g oodnewsmegan",
"goodnewsme gan",
"goodnewsm egan"
]
}
}
@article{kodali2022hashset,
title={HashSet--A Dataset For Hashtag Segmentation},
author={Kodali, Prashant and Bhatnagar, Akshala and Ahuja, Naman and Shrivastava, Manish and Kumaraguru, Ponnurangam},
journal={arXiv preprint arXiv:2201.06741},
year={2022}
}
此数据集由 @ruanchaves 在开发 hashformers 库时添加。