数据集:

xcsr

任务:

问答

子任务:

multiple-choice-qa

语言:

计算机处理:

multilingual

大小:

1K<n<10K

语言创建人:

crowdsourced machine-generated

批注创建人:

crowdsourced

源数据集:

extended|codah extended|commonsense_qa

预印本库:

arxiv:2106.06937

许可:

mit

数据集介绍文件清单

英文

X-CSR 数据集卡片

数据集概述

为了在多语言通识推理（ML-LMs）中评估跨语言零-shot转移（X-CSR）设置下的模型，即在英语进行训练并在其他语言进行测试，我们创建了两个基准数据集，即X-CSQA和X-CODAH。具体来说，我们自动将原始的仅有英文版本的CSQA和CODAH数据集翻译成其他15种语言，形成用于研究X-CSR的开发和测试集。鉴于我们的目标是在一个统一的评估协议中评估不同的ML-LMs，我们认为这些翻译的示例虽然可能会包含噪声，但可以作为我们获得有意义的分析的起点，直到将来获得更多人工翻译的数据集为止。

语言

X-CSR 的总共16种语言：{英语, 中文, 德语, 西班牙语, 法语, 意大利语, 日语, 荷兰语, 波兰语, 葡萄牙语, 俄语, 阿拉伯语, 越南语, 印地语, 斯瓦希里语, 乌尔都语}。

数据集结构

数据实例

X-CSQA 数据集的一个例子：

{
  "id": "be1920f7ba5454ad",  # an id shared by all languages
  "lang": "en", # one of the 16 language codes.
  "question": { 
    "stem": "What will happen to your knowledge with more learning?",   # question text
    "choices": [
      {"label": "A",  "text": "headaches" },
      {"label": "B",  "text": "bigger brain" },
      {"label": "C",  "text": "education" },
      {"label": "D",  "text": "growth" },
      {"label": "E",  "text": "knowing more" }
    ] },
  "answerKey": "D"    # hidden for test data.
}

X-CODAH 数据集的一个例子：

{
  "id": "b8eeef4a823fcd4b",   # an id shared by all languages
  "lang": "en", # one of the 16 language codes.
  "question_tag": "o",  # one of 6 question types
  "question": {
    "stem": " ", # always a blank as a dummy question
    "choices": [
      {"label": "A",
        "text": "Jennifer loves her school very much, she plans to drop every courses."},
      {"label": "B",
        "text": "Jennifer loves her school very much, she is never absent even when she's sick."},
      {"label": "C",
        "text": "Jennifer loves her school very much, she wants to get a part-time job."},
      {"label": "D",
        "text": "Jennifer loves her school very much, she quits school happily."}
    ]
  },
  "answerKey": "B"  # hidden for test data.
}

数据字段

id: 所有语言共享的 id
lang: 16种语言代码之一
question_tag: 6种问题类型之一
stem: 占位符作为虚拟问题
choices: 一个答案列表，每个答案有：
- label: 每个答案的字符串标识符
- text: 答案文本

数据拆分

X-CSQA: 训练集有 8,888 个例子的英语版本，每种语言的开发集有 1,000 个例子，每种语言的测试集有 1,074 个例子。
X-CODAH: 训练集有 8,476 个例子的英语版本，每种语言的开发集有 300 个例子，每种语言的测试集有 1,000 个例子。

数据集创建

选择理由

为了评估多语言通识推理模型（ML-LMs）在跨语言零-shot转移（X-CSR）设置下的性能，即在英语训练、其他语言测试的情况下，我们创建了两个基准数据集，分别是X-CSQA和X-CODAH。

数据集构造的详细信息，特别是翻译过程的细节，可以在附录A的第 paper 部分中找到。

来源数据

初始化数据收集和规范化

[需要更多信息]

源语言的生产者是谁？

[需要更多信息]

注释

注释过程

[需要更多信息]

注释者是谁？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集维护者

[需要更多信息]

许可信息

[需要更多信息]

引用信息

# X-CSR
@inproceedings{lin-etal-2021-common,
    title = "Common Sense Beyond {E}nglish: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning",
    author = "Lin, Bill Yuchen  and
      Lee, Seyeon  and
      Qiao, Xiaoyang  and
      Ren, Xiang",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.102",
    doi = "10.18653/v1/2021.acl-long.102",
    pages = "1274--1287",
    abstract = "Commonsense reasoning research has so far been limited to English. We aim to evaluate and improve popular multilingual language models (ML-LMs) to help advance commonsense reasoning (CSR) beyond English. We collect the Mickey corpus, consisting of 561k sentences in 11 different languages, which can be used for analyzing and improving ML-LMs. We propose Mickey Probe, a language-general probing task for fairly evaluating the common sense of popular ML-LMs across different languages. In addition, we also create two new datasets, X-CSQA and X-CODAH, by translating their English versions to 14 other languages, so that we can evaluate popular ML-LMs for cross-lingual commonsense reasoning. To improve the performance beyond English, we propose a simple yet effective method {---} multilingual contrastive pretraining (MCP). It significantly enhances sentence representations, yielding a large performance gain on both benchmarks (e.g., +2.7{\%} accuracy for X-CSQA over XLM-R{\_}L).",
}

# CSQA
@inproceedings{Talmor2019commonsenseqaaq,
    address = {Minneapolis, Minnesota},
    author = {Talmor, Alon  and Herzig, Jonathan  and Lourie, Nicholas and Berant, Jonathan},
    booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
    doi = {10.18653/v1/N19-1421},
    pages = {4149--4158},
    publisher = {Association for Computational Linguistics},
    title = {CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge},
    url = {https://www.aclweb.org/anthology/N19-1421},
    year = {2019}
}

# CODAH
@inproceedings{Chen2019CODAHAA,
    address = {Minneapolis, USA},
    author = {Chen, Michael  and D{'}Arcy, Mike  and Liu, Alisa  and Fernandez, Jared  and Downey, Doug},
    booktitle = {Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for {NLP}},
    doi = {10.18653/v1/W19-2008},
    pages = {63--69},
    publisher = {Association for Computational Linguistics},
    title = {CODAH: An Adversarially-Authored Question Answering Dataset for Common Sense},
    url = {https://www.aclweb.org/anthology/W19-2008},
    year = {2019}
}

贡献者

感谢 Bill Yuchen Lin ， Seyeon Lee ， Xiaoyang Qiao ， Xiang Ren 添加了这个数据集。

作者:

佚名

数据集大小:

161.99 KB

X-CSR 数据集卡片

数据集概述

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

选择理由

来源数据

注释

个人和敏感信息

使用数据的注意事项

数据的社会影响

偏见讨论

其他已知限制

附加信息

数据集维护者

许可信息

引用信息

贡献者