数据集:

clips/mfaq

任务:

问答

计算机处理:

multilingual

语言创建人:

other

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:2109.12870

许可:

cc0-1.0
英文

MFAQ

?请查看 MQA 或MFAQ Light以获取更新的数据集。

MFAQ是从 Common Crawl 中解析出的多语言常见问题集。

from datasets import load_dataset
load_dataset("clips/mfaq", "en")
{
  "qa_pairs": [
    {
      "question": "Do  I need a rental Car in Cork?",
      "answer": "If you plan on travelling outside of Cork City, for instance to  Kinsale [...]"
    },
    ...
  ]
}

语言

我们收集了21种不同语言的约600万个问题和答案对。要下载特定语言的子集,需要在配置中指定语言键。以下是一个示例。

load_dataset("clips/mfaq", "en") # replace "en" by any language listed below
Language Key Pairs Pages
All all 6,346,693 1,035,649
English en 3,719,484 608,796
German de 829,098 111,618
Spanish es 482,818 75,489
French fr 351,458 56,317
Italian it 155,296 24,562
Dutch nl 150,819 32,574
Portuguese pt 138,778 26,169
Turkish tr 102,373 19,002
Russian ru 91,771 22,643
Polish pl 65,182 10,695
Indonesian id 45,839 7,910
Norwegian no 37,711 5,143
Swedish sv 37,003 5,270
Danish da 32,655 5,279
Vietnamese vi 27,157 5,261
Finnish fi 20,485 2,795
Romanian ro 17,066 3,554
Czech cs 16,675 2,568
Hebrew he 11,212 1,921
Hungarian hu 8,598 1,264
Croatian hr 5,215 819

数据字段

嵌套(按页面,默认)

数据按页面组织。每个页面包含一组问题和答案。

  • id
  • language
  • num_pairs: 页面上的FAQ数量
  • domain:FAQ的来源网域
  • qa_pairs:问题和答案的列表
    • question
    • answer
    • language
展平

数据按对(即页面已展平)组织。您可以通过在配置中加上_flat(例如en_flat)来访问任何语言的平面版本。数据将逐个对返回,而不是逐个页面。

  • domain_id
  • pair_id
  • language
  • domain:FAQ的来源网域
  • question
  • answer

源数据

本节摘自 OSCAR 的源数据描述。

Common Crawl是一个非盈利基金会,他们提供并维护着一个开放的网络抓取数据存储库,这个存储库是可访问和可分析的。Common Crawl的完整网络存档包含了8年的网络抓取数据,总量达到了几个PB。该组织的网络爬虫始终尊重nofollow和robots.txt策略。

为构建MFAQ,我们使用了Common Crawl的WARC文件。我们在HTML中寻找FAQPage标记,并从页面上解析出FAQItem。

人员

本模型是由 Maxime De Bruyn 、Ehsan Lotfi、Jeska Buhmann和Walter Daelemans开发的。

许可信息

These data are released under this licensing scheme.
We do not own any of the text from which these data has been extracted.
We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/

Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
* Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
* Clearly identify the copyrighted work claimed to be infringed.
* Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

引用信息

@misc{debruyn2021mfaq,
      title={MFAQ: a Multilingual FAQ Dataset}, 
      author={Maxime {De Bruyn} and Ehsan Lotfi and Jeska Buhmann and Walter Daelemans},
      year={2021},
      eprint={2109.12870},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}