数据集:
blimp
任务:
语言:
计算机处理:
monolingual大小:
10K<n<100K语言创建人:
machine-generated批注创建人:
crowdsourced源数据集:
original许可:
BLiMP是用于评估语言模型(LMs)对英语中主要语法现象的了解程度的挑战集。BLiMP由67个子数据集组成,每个数据集都包含1000个针对语法、形态或语义中特定对比的最小对。数据根据专家设计的语法自动生成。
“train”的示例如下所示。
{
"UID": "tough_vs_raising_1",
"field": "syntax_semantics",
"lexically_identical": false,
"linguistics_term": "control_raising",
"one_prefix_method": false,
"pair_id": 2,
"sentence_bad": "Benjamin's tutor was certain to boast about.",
"sentence_good": "Benjamin's tutor was easy to boast about.",
"simple_LM_method": true,
"two_prefix_method": false
}
anaphor_gender_agreement “train”的示例如下所示。
{
"UID": "tough_vs_raising_1",
"field": "syntax_semantics",
"lexically_identical": false,
"linguistics_term": "control_raising",
"one_prefix_method": false,
"pair_id": 2,
"sentence_bad": "Benjamin's tutor was certain to boast about.",
"sentence_good": "Benjamin's tutor was easy to boast about.",
"simple_LM_method": true,
"two_prefix_method": false
}
anaphor_number_agreement “train”的示例如下所示。
{
"UID": "tough_vs_raising_1",
"field": "syntax_semantics",
"lexically_identical": false,
"linguistics_term": "control_raising",
"one_prefix_method": false,
"pair_id": 2,
"sentence_bad": "Benjamin's tutor was certain to boast about.",
"sentence_good": "Benjamin's tutor was easy to boast about.",
"simple_LM_method": true,
"two_prefix_method": false
}
animate_subject_passive “train”的示例如下所示。
{
"UID": "tough_vs_raising_1",
"field": "syntax_semantics",
"lexically_identical": false,
"linguistics_term": "control_raising",
"one_prefix_method": false,
"pair_id": 2,
"sentence_bad": "Benjamin's tutor was certain to boast about.",
"sentence_good": "Benjamin's tutor was easy to boast about.",
"simple_LM_method": true,
"two_prefix_method": false
}
animate_subject_trans “train”的示例如下所示。
{
"UID": "tough_vs_raising_1",
"field": "syntax_semantics",
"lexically_identical": false,
"linguistics_term": "control_raising",
"one_prefix_method": false,
"pair_id": 2,
"sentence_bad": "Benjamin's tutor was certain to boast about.",
"sentence_good": "Benjamin's tutor was easy to boast about.",
"simple_LM_method": true,
"two_prefix_method": false
}
所有拆分中的数据字段相同。
adjunct_island| name | train |
|---|---|
| adjunct_island | 1000 |
| anaphor_gender_agreement | 1000 |
| anaphor_number_agreement | 1000 |
| animate_subject_passive | 1000 |
| animate_subject_trans | 1000 |
@article{warstadt2019blimp,
title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
journal={arXiv preprint arXiv:1912.00582},
year={2019}
}
感谢 @lhoestq 、 @patrickvonplaten 和 @thomwolf 添加此数据集。