数据集:
guardian_authorship
任务:
语言:
计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
found源数据集:
original许可:
这是一个跨主题的作者归属数据集。该数据集由Stamatatos 2013.1提供。1- 跨主题的场景基于Stamatatos 2017中的Table-4(例如cross_topic_1 => row 1:P S U&W)。2- 跨类型的场景基于同一论文中的Table-5(例如cross_genre_1 => row 1:B P S&U&W)。
3- 同一主题/类型的场景是通过按以下方式分组所有数据集创建的。例如,要使用same_topic并将数据集拆分为60-40,请使用:train_ds = load_dataset('guardian_authorship',name="cross_topic_<<#>>",split='train[:60%]+validation[:60%]+test[:60%]')tests_ds = load_dataset('guardian_authorship',name="cross_topic_<<#>>",split='train[-40%:]+validation[-40%:]+test[-40%:]')
重要提示:train+validation+test[:60%]会生成错误的拆分,因为数据不平衡
'train'的示例如下所示。
{
"article": "File 1a\n",
"author": 0,
"topic": 4
}
cross_genre_2 'validation'的示例如下所示。
{
"article": "File 1a\n",
"author": 0,
"topic": 1
}
cross_genre_3 'validation'的示例如下所示。
{
"article": "File 1a\n",
"author": 0,
"topic": 2
}
cross_genre_4 'validation'的示例如下所示。
{
"article": "File 1a\n",
"author": 0,
"topic": 3
}
cross_topic_1 'validation'的示例如下所示。
{
"article": "File 1a\n",
"author": 0,
"topic": 1
}
所有拆分的数据字段都是相同的。
cross_genre_1| name | train | validation | test |
|---|---|---|---|
| cross_genre_1 | 63 | 112 | 269 |
| cross_genre_2 | 63 | 62 | 319 |
| cross_genre_3 | 63 | 90 | 291 |
| cross_genre_4 | 63 | 117 | 264 |
| cross_topic_1 | 112 | 62 | 207 |
@article{article,
author = {Stamatatos, Efstathios},
year = {2013},
month = {01},
pages = {421-439},
title = {On the robustness of authorship attribution based on character n-gram features},
volume = {21},
journal = {Journal of Law and Policy}
}
@inproceedings{stamatatos2017authorship,
title={Authorship attribution using text distortion},
author={Stamatatos, Efstathios},
booktitle={Proc. of the 15th Conf. of the European Chapter of the Association for Computational Linguistics},
volume={1}
pages={1138--1149},
year={2017}
}
感谢 @thomwolf 、 @eltoto1219 、 @malikaltakrori 添加了该数据集。