数据集:

blimp

任务:

文本分类

子任务:

acceptability-classification

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

machine-generated

批注创建人:

crowdsourced

源数据集:

original

许可:

cc-by-4.0

数据集介绍文件清单

英文

"blimp" 数据集卡片

数据集简介

BLiMP是用于评估语言模型（LMs）对英语中主要语法现象的了解程度的挑战集。BLiMP由67个子数据集组成，每个数据集都包含1000个针对语法、形态或语义中特定对比的最小对。数据根据专家设计的语法自动生成。

支持的任务和榜单

More Information Needed

语言

More Information Needed

数据集结构

数据实例

adjunct_island

下载的数据集文件大小：0.36 MB
生成的数据集大小：0.17 MB
使用的总磁盘空间：0.52 MB

“train”的示例如下所示。

{
    "UID": "tough_vs_raising_1",
    "field": "syntax_semantics",
    "lexically_identical": false,
    "linguistics_term": "control_raising",
    "one_prefix_method": false,
    "pair_id": 2,
    "sentence_bad": "Benjamin's tutor was certain to boast about.",
    "sentence_good": "Benjamin's tutor was easy to boast about.",
    "simple_LM_method": true,
    "two_prefix_method": false
}

anaphor_gender_agreement

下载的数据集文件大小：0.44 MB
生成的数据集大小：0.14 MB
使用的总磁盘空间：0.57 MB

“train”的示例如下所示。

{
    "UID": "tough_vs_raising_1",
    "field": "syntax_semantics",
    "lexically_identical": false,
    "linguistics_term": "control_raising",
    "one_prefix_method": false,
    "pair_id": 2,
    "sentence_bad": "Benjamin's tutor was certain to boast about.",
    "sentence_good": "Benjamin's tutor was easy to boast about.",
    "simple_LM_method": true,
    "two_prefix_method": false
}

anaphor_number_agreement

下载的数据集文件大小：0.45 MB
生成的数据集大小：0.14 MB
使用的总磁盘空间：0.59 MB

“train”的示例如下所示。

{
    "UID": "tough_vs_raising_1",
    "field": "syntax_semantics",
    "lexically_identical": false,
    "linguistics_term": "control_raising",
    "one_prefix_method": false,
    "pair_id": 2,
    "sentence_bad": "Benjamin's tutor was certain to boast about.",
    "sentence_good": "Benjamin's tutor was easy to boast about.",
    "simple_LM_method": true,
    "two_prefix_method": false
}

animate_subject_passive

下载的数据集文件大小：0.46 MB
生成的数据集大小：0.15 MB
使用的总磁盘空间：0.61 MB

“train”的示例如下所示。

{
    "UID": "tough_vs_raising_1",
    "field": "syntax_semantics",
    "lexically_identical": false,
    "linguistics_term": "control_raising",
    "one_prefix_method": false,
    "pair_id": 2,
    "sentence_bad": "Benjamin's tutor was certain to boast about.",
    "sentence_good": "Benjamin's tutor was easy to boast about.",
    "simple_LM_method": true,
    "two_prefix_method": false
}

animate_subject_trans

下载的数据集文件大小：0.43 MB
生成的数据集大小：0.13 MB
使用的总磁盘空间：0.57 MB

“train”的示例如下所示。

{
    "UID": "tough_vs_raising_1",
    "field": "syntax_semantics",
    "lexically_identical": false,
    "linguistics_term": "control_raising",
    "one_prefix_method": false,
    "pair_id": 2,
    "sentence_bad": "Benjamin's tutor was certain to boast about.",
    "sentence_good": "Benjamin's tutor was easy to boast about.",
    "simple_LM_method": true,
    "two_prefix_method": false
}

数据字段

所有拆分中的数据字段相同。

adjunct_island

sentence_good：字符串特征。
sentence_bad：字符串特征。
field：字符串特征。
linguistics_term：字符串特征。
UID：字符串特征。
simple_LM_method：布尔特征。
one_prefix_method：布尔特征。
two_prefix_method：布尔特征。
lexically_identical：布尔特征。
pair_id：int32特征。

anaphor_gender_agreement

sentence_good：字符串特征。
sentence_bad：字符串特征。
field：字符串特征。
linguistics_term：字符串特征。
UID：字符串特征。
simple_LM_method：布尔特征。
one_prefix_method：布尔特征。
two_prefix_method：布尔特征。
lexically_identical：布尔特征。
pair_id：int32特征。

anaphor_number_agreement

sentence_good：字符串特征。
sentence_bad：字符串特征。
field：字符串特征。
linguistics_term：字符串特征。
UID：字符串特征。
simple_LM_method：布尔特征。
one_prefix_method：布尔特征。
two_prefix_method：布尔特征。
lexically_identical：布尔特征。
pair_id：int32特征。

animate_subject_passive

sentence_good：字符串特征。
sentence_bad：字符串特征。
field：字符串特征。
linguistics_term：字符串特征。
UID：字符串特征。
simple_LM_method：布尔特征。
one_prefix_method：布尔特征。
two_prefix_method：布尔特征。
lexically_identical：布尔特征。
pair_id：int32特征。

animate_subject_trans

sentence_good：字符串特征。
sentence_bad：字符串特征。
field：字符串特征。
linguistics_term：字符串特征。
UID：字符串特征。
simple_LM_method：布尔特征。
one_prefix_method：布尔特征。
two_prefix_method：布尔特征。
lexically_identical：布尔特征。
pair_id：int32特征。

数据拆分

name	train
adjunct_island	1000
anaphor_gender_agreement	1000
anaphor_number_agreement	1000
animate_subject_passive	1000
animate_subject_trans	1000

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和规范化

More Information Needed

数据的源语言制作者是谁？

More Information Needed

注释

注释过程

More Information Needed

注释者是谁？

More Information Needed

个人和敏感信息

More Information Needed

使用数据时的注意事项

其他信息

数据集策划者

More Information Needed

许可信息

More Information Needed

引用信息

@article{warstadt2019blimp,
  title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
  author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1912.00582},
  year={2019}
}

贡献者

感谢 @lhoestq 、 @patrickvonplaten 和 @thomwolf 添加此数据集。

作者:

佚名

数据集大小:

194.21 KB

"blimp" 数据集卡片

数据集简介

支持的任务和榜单

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

策划理由

源数据

注释

个人和敏感信息

使用数据时的注意事项

数据集的社会影响

偏见讨论

其他已知限制

其他信息

数据集策划者

许可信息

引用信息

贡献者