数据集:

adversarial_qa

任务:

问答

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

crowdsourced

源数据集:

original
英文

adversarialQA 数据集卡片

数据集摘要

我们使用对抗模型构建了三个新的阅读理解数据集。

在注释环节中,我们使用了三种不同的模型:BiDAF (Seo et al., 2016)、BERT-Large (Devlin et al., 2018)和RoBERTa-Large (Liu et al., 2019),并构建了三个数据集:D(BiDAF)、D(BERT)和D(RoBERTa),每个数据集都包含10,000个训练示例、1,000个验证示例和1,000个测试示例。

对抗性人工注释范式确保了这些数据集由当前最先进的模型(至少是在注释环节中用作对手的模型)认为具有挑战性的问题组成。三个AdversarialQA round 1数据集为此类方法提供了训练和评估资源。

支持的任务和排行榜

抽取式问答(extractive-qa):该数据集可用于训练抽取式问答模型,该模型通过从段落中选择答案来回答问题。通常通过实现高重叠的单词来衡量此任务的成功。训练整个数据集的 F1 score 模型目前达到了64.35%的F1分数。该任务有一个活动排行榜,并且作为 Dynabench 的QA任务的第一轮,根据F1分数对模型进行排序。

语言

数据集中的文本为英文。关联的BCP-47代码是 en。

数据集结构

数据实例

数据以与SQuAD 1.1相同的格式提供。下面是一个示例:

{
  "data": [
    {
      "title": "Oxygen",
      "paragraphs": [
        {
          "context": "Among the most important classes of organic compounds that contain oxygen are (where \"R\" is an organic group): alcohols (R-OH); ethers (R-O-R); ketones (R-CO-R); aldehydes (R-CO-H); carboxylic acids (R-COOH); esters (R-COO-R); acid anhydrides (R-CO-O-CO-R); and amides (R-C(O)-NR2). There are many important organic solvents that contain oxygen, including: acetone, methanol, ethanol, isopropanol, furan, THF, diethyl ether, dioxane, ethyl acetate, DMF, DMSO, acetic acid, and formic acid. Acetone ((CH3)2CO) and phenol (C6H5OH) are used as feeder materials in the synthesis of many different substances. Other important organic compounds that contain oxygen are: glycerol, formaldehyde, glutaraldehyde, citric acid, acetic anhydride, and acetamide. Epoxides are ethers in which the oxygen atom is part of a ring of three atoms.",
          "qas": [
            {
              "id": "22bbe104aa72aa9b511dd53237deb11afa14d6e3",
              "question": "In addition to having oxygen, what do alcohols, ethers and esters have in common, according to the article?",
              "answers": [
                {
                  "answer_start": 36,
                  "text": "organic compounds"
                }
              ]
            },
            {
              "id": "4240a8e708c703796347a3702cf1463eed05584a",
              "question": "What letter does the abbreviation for acid anhydrides both begin and end in?",
              "answers": [
                {
                  "answer_start": 244,
                  "text": "R"
                }
              ]
            },
            {
              "id": "0681a0a5ec852ec6920d6a30f7ef65dced493366",
              "question": "Which of the organic compounds, in the article, contains nitrogen?",
              "answers": [
                {
                  "answer_start": 262,
                  "text": "amides"
                }
              ]
            },
            {
              "id": "2990efe1a56ccf81938fa5e18104f7d3803069fb",
              "question": "Which of the important classes of organic compounds, in the article, has a number in its abbreviation?",
              "answers": [
                {
                  "answer_start": 262,
                  "text": "amides"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

数据字段

  • title: 来源于文章上下文的维基百科页面的标题
  • context: 上下文/段落
  • id: 每个问题的字符串标识符
  • answers: 所有提供的答案的列表(在我们的情况下每个问题一个,但在SQuAD中可能存在多个),具有 answer_start 字段(答案跨度的起始字符索引)和 text 字段(答案文本).

请注意,测试集中没有提供答案。实际上,该数据集是DynaBench基准测试的一部分,您可以在 website 上提交您的预测结果。

数据拆分

该数据集由使用不同模型构建的三个不同数据集组成:BiDAF,BERT-Large和RoBERTa-Large。每个数据集都包含10,000个训练示例,1,000个验证示例和1,000个测试示例,总共30,000个/3,000个/3,000个训练/验证/测试示例。

数据集创建

策划原因

收集此数据集是为了为最先进的模型提供更具挑战性和多样化的阅读理解数据集。

源数据

初始数据收集和归一化

源段落来自维基百科,与 SQuAD v1.1 中使用的相同。

谁是源语言的生产者?

源语言的生产者是维基百科编辑用于段落,人工注释者通过Mechanical Turk用于问题。

注释

注释过程

该数据集通过对抗性人工注释过程收集,该过程将人工注释者和阅读理解模型配对在交互环境中。人工注释者会看到一个段落,然后他们会提出一个问题并突出显示正确的答案。然后,模型试图回答问题,如果没有回答正确,人工注释者获胜。否则,人工注释者会修正或重写问题,直到成功欺骗模型。

谁是注释者?

注释员来自于Amazon Mechanical Turk,地理位置限于美国、英国和加拿大,并且之前成功完成了至少1,000个HIT,并且具有大于98%的HIT批准率。众包工作者在注释之前经过了密集的培训和资格认证。

个人和敏感信息

不提供注释者的身份细节。

使用数据的注意事项

数据的社会影响

该数据集的目的是帮助开发更好的问答系统。

成功完成支持的任务的系统将能够从短文中提供准确的抽取式答案。这个数据集应该被视为一个测试平台,用于测试当前最先进的模型难以正确回答的问题,因此通常需要比检测与问题具有高重叠的短语更复杂的理解能力。

但是需要注意的是,源段落既具有领域限制,也具有语言上的特定性,并且提供的问题和答案并不构成任何特定的社会应用。

偏见讨论

该数据集可能在源段选择、注释的问题和答案以及对抗性注释协议导致的算法偏见方面存在各种偏见。

其他已知限制

N/A

其他信息

数据集策划者

该数据集最初由Max Bartolo、Alastair Roberts、Johannes Welbl、Sebastian Riedel和Pontus Stenetorp在伦敦大学学院(UCL)开展的工作中创建。

许可信息

此数据集根据 CC BY-SA 3.0 进行分发。

引用信息

@article{bartolo2020beat,
    author = {Bartolo, Max and Roberts, Alastair and Welbl, Johannes and Riedel, Sebastian and Stenetorp, Pontus},
    title = {Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension},
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {8},
    number = {},
    pages = {662-678},
    year = {2020},
    doi = {10.1162/tacl\_a\_00338},
    URL = { https://doi.org/10.1162/tacl_a_00338 },
    eprint = { https://doi.org/10.1162/tacl_a_00338 },
    abstract = { Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: Humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalization to data collected without a model. We find that training on adversarially collected samples leads to strong generalization to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F1 on questions that it cannot answer when trained on SQuAD—only marginally lower than when trained on data collected using RoBERTa itself (41.0F1). }
}

贡献

感谢 @maxbartolo 添加了该数据集。