数据集:
guardian_authorship
任务:
语言:
计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
found源数据集:
original许可:
这是一个跨主题的作者归属数据集。该数据集由Stamatatos 2013.1提供。1- 跨主题的场景基于Stamatatos 2017中的Table-4(例如cross_topic_1 => row 1:P S U&W)。2- 跨类型的场景基于同一论文中的Table-5(例如cross_genre_1 => row 1:B P S&U&W)。
3- 同一主题/类型的场景是通过按以下方式分组所有数据集创建的。例如,要使用same_topic并将数据集拆分为60-40,请使用:train_ds = load_dataset('guardian_authorship',name="cross_topic_<<#>>",split='train[:60%]+validation[:60%]+test[:60%]')tests_ds = load_dataset('guardian_authorship',name="cross_topic_<<#>>",split='train[-40%:]+validation[-40%:]+test[-40%:]')
重要提示:train+validation+test[:60%]会生成错误的拆分,因为数据不平衡
'train'的示例如下所示。
{
    "article": "File 1a\n",
    "author": 0,
    "topic": 4
}
 cross_genre_2 'validation'的示例如下所示。
{
    "article": "File 1a\n",
    "author": 0,
    "topic": 1
}
 cross_genre_3 'validation'的示例如下所示。
{
    "article": "File 1a\n",
    "author": 0,
    "topic": 2
}
 cross_genre_4 'validation'的示例如下所示。
{
    "article": "File 1a\n",
    "author": 0,
    "topic": 3
}
 cross_topic_1 'validation'的示例如下所示。
{
    "article": "File 1a\n",
    "author": 0,
    "topic": 1
}
 所有拆分的数据字段都是相同的。
cross_genre_1| name | train | validation | test | 
|---|---|---|---|
| cross_genre_1 | 63 | 112 | 269 | 
| cross_genre_2 | 63 | 62 | 319 | 
| cross_genre_3 | 63 | 90 | 291 | 
| cross_genre_4 | 63 | 117 | 264 | 
| cross_topic_1 | 112 | 62 | 207 | 
@article{article,
    author = {Stamatatos, Efstathios},
    year = {2013},
    month = {01},
    pages = {421-439},
    title = {On the robustness of authorship attribution based on character n-gram features},
    volume = {21},
    journal = {Journal of Law and Policy}
}
@inproceedings{stamatatos2017authorship,
    title={Authorship attribution using text distortion},
    author={Stamatatos, Efstathios},
    booktitle={Proc. of the 15th Conf. of the European Chapter of the Association for Computational Linguistics},
    volume={1}
    pages={1138--1149},
    year={2017}
}
 感谢 @thomwolf 、 @eltoto1219 、 @malikaltakrori 添加了该数据集。