数据集:

csebuetnlp/xlsum

计算机处理:

multilingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

found

源数据集:

original

预印本库:

arxiv:1607.01759
英文

"XL-Sum" 数据集卡片

数据集摘要

我们介绍了XLSum数据集,它由来自BBC的135万对经过专业注释的文章摘要对组成,这些对是通过一组精心设计的启发式方法提取的。该数据集涵盖45种语言,从资源较低到较高,其中许多语言目前没有公开数据集。 XL-Sum是高度抽象,简洁且质量高的,如人工和内在评估所示。

支持的任务和排行榜

More information needed

语言

  • 阿姆哈拉语
  • 阿拉伯语
  • 阿塞拜疆语
  • 孟加拉语
  • 缅甸语
  • 简体中文
  • 繁体中文
  • 英语
  • 法语
  • 古吉拉特语
  • 豪萨语
  • 印地语
  • 伊博语
  • 印尼语
  • 日语
  • 基隆迪语
  • 韩语
  • 吉尔吉斯语
  • 马拉地语
  • 尼泊尔语
  • 奥罗莫语
  • 普什图语
  • 波斯语
  • 皮钦语
  • 葡萄牙语
  • 旁遮普语
  • 俄语
  • 苏格兰盖尔语
  • 塞尔维亚语西里尔字母
  • 塞尔维亚语拉丁字母
  • 僧伽罗语
  • 索马里语
  • 西班牙语
  • 斯瓦希里语
  • 泰米尔语
  • 泰卢固语
  • 泰语
  • 提格里尼亚语
  • 土耳其语
  • 乌克兰语
  • 乌尔都语
  • 乌兹别克语
  • 越南语
  • 威尔士语
  • 约鲁巴语

数据集结构

数据实例

下面是来自英语数据集的一个示例,以JSON格式给出。

{
  "id": "technology-17657859",
  "url": "https://www.bbc.com/news/technology-17657859",
  "title": "Yahoo files e-book advert system patent applications",
  "summary": "Yahoo has signalled it is investigating e-book adverts as a way to stimulate its earnings.",
  "text": "Yahoo's patents suggest users could weigh the type of ads against the sizes of discount before purchase. It says in two US patent applications that ads for digital book readers have been \"less than optimal\" to date. The filings suggest that users could be offered titles at a variety of prices depending on the ads' prominence They add that the products shown could be determined by the type of book being read, or even the contents of a specific chapter, phrase or word. The paperwork was published by the US Patent and Trademark Office late last week and relates to work carried out at the firm's headquarters in Sunnyvale, California. \"Greater levels of advertising, which may be more valuable to an advertiser and potentially more distracting to an e-book reader, may warrant higher discounts,\" it states. Free books It suggests users could be offered ads as hyperlinks based within the book's text, in-laid text or even \"dynamic content\" such as video. Another idea suggests boxes at the bottom of a page could trail later chapters or quotes saying \"brought to you by Company A\". It adds that the more willing the customer is to see the ads, the greater the potential discount. \"Higher frequencies... may even be great enough to allow the e-book to be obtained for free,\" it states. The authors write that the type of ad could influence the value of the discount, with \"lower class advertising... such as teeth whitener advertisements\" offering a cheaper price than \"high\" or \"middle class\" adverts, for things like pizza. The inventors also suggest that ads could be linked to the mood or emotional state the reader is in as a they progress through a title. For example, they say if characters fall in love or show affection during a chapter, then ads for flowers or entertainment could be triggered. The patents also suggest this could applied to children's books - giving the Tom Hanks animated film Polar Express as an example. It says a scene showing a waiter giving the protagonists hot drinks \"may be an excellent opportunity to show an advertisement for hot cocoa, or a branded chocolate bar\". Another example states: \"If the setting includes young characters, a Coke advertisement could be provided, inviting the reader to enjoy a glass of Coke with his book, and providing a graphic of a cool glass.\" It adds that such targeting could be further enhanced by taking account of previous titles the owner has bought. 'Advertising-free zone' At present, several Amazon and Kobo e-book readers offer full-screen adverts when the device is switched off and show smaller ads on their menu screens, but the main text of the titles remains free of marketing. Yahoo does not currently provide ads to these devices, and a move into the area could boost its shrinking revenues. However, Philip Jones, deputy editor of the Bookseller magazine, said that the internet firm might struggle to get some of its ideas adopted. \"This has been mooted before and was fairly well decried,\" he said. \"Perhaps in a limited context it could work if the merchandise was strongly related to the title and was kept away from the text. \"But readers - particularly parents - like the fact that reading is an advertising-free zone. Authors would also want something to say about ads interrupting their narrative flow.\""
}

数据字段

  • 'id': 表示文章ID的字符串。
  • 'url': 表示文章URL的字符串。
  • 'title': 包含文章标题的字符串。
  • 'summary': 包含文章摘要的字符串。
  • 'text' : 包含文章内容的字符串。

数据拆分

对于所有语言,我们使用了80%-10%-10%的拆分,有一些例外情况。 英语 的拆分为 93%-3.5%-3.5%,以类似于 CNN/DM 和 XSum 的评估集大小,而 苏格兰盖尔语,吉尔吉斯语和僧伽罗语 的样本相对较少,它们的评估集增加到500个样本,以进行更可靠的评估。为了防止多语言训练中的数据泄漏,在两个汉语和塞尔维亚语的变体中使用了相同的文章进行评估。下面给出了各自数据集的下载链接和训练-开发-测试示例计数:

Language ISO 639-1 Code BBC subdomain(s) Train Dev Test Total
Amharic am 1233321 5761 719 719 7199
Arabic ar 1234321 37519 4689 4689 46897
Azerbaijani az 1235321 6478 809 809 8096
Bengali bn 1236321 8102 1012 1012 10126
Burmese my 1237321 4569 570 570 5709
Chinese (Simplified) zh-CN 1238321 , 1239321 37362 4670 4670 46702
Chinese (Traditional) zh-TW 12310321 , 12311321 37373 4670 4670 46713
English en 12312321 , 12313321 * 306522 11535 11535 329592
French fr 12314321 8697 1086 1086 10869
Gujarati gu 12315321 9119 1139 1139 11397
Hausa ha 12316321 6418 802 802 8022
Hindi hi 12317321 70778 8847 8847 88472
Igbo ig 12318321 4183 522 522 5227
Indonesian id 12319321 38242 4780 4780 47802
Japanese ja 12320321 7113 889 889 8891
Kirundi rn 12321321 5746 718 718 7182
Korean ko 12322321 4407 550 550 5507
Kyrgyz ky 12323321 2266 500 500 3266
Marathi mr 12324321 10903 1362 1362 13627
Nepali np 12325321 5808 725 725 7258
Oromo om 12326321 6063 757 757 7577
Pashto ps 12327321 14353 1794 1794 17941
Persian fa 12328321 47251 5906 5906 59063
Pidgin ** n/a 12329321 9208 1151 1151 11510
Portuguese pt 12330321 57402 7175 7175 71752
Punjabi pa 12331321 8215 1026 1026 10267
Russian ru 12332321 , 12333321 * 62243 7780 7780 77803
Scottish Gaelic gd 12334321 1313 500 500 2313
Serbian (Cyrillic) sr 12335321 7275 909 909 9093
Serbian (Latin) sr 12336321 7276 909 909 9094
Sinhala si 12313321 3249 500 500 4249
Somali so 12338321 5962 745 745 7452
Spanish es 12339321 38110 4763 4763 47636
Swahili sw 12340321 7898 987 987 9872
Tamil ta 12341321 16222 2027 2027 20276
Telugu te 12342321 10421 1302 1302 13025
Thai th 12343321 6616 826 826 8268
Tigrinya ti 12344321 5451 681 681 6813
Turkish tr 12345321 27176 3397 3397 33970
Ukrainian uk 12333321 43201 5399 5399 53999
Urdu ur 12347321 67665 8458 8458 84581
Uzbek uz 12348321 4728 590 590 5908
Vietnamese vi 12349321 32111 4013 4013 40137
Welsh cy 12350321 9732 1216 1216 12164
Yoruba yo 12351321 6350 793 793 7936

* BBC Sinhala和BBC Ukrainian的许多文章是用英语和俄语写的。它们通过 Fasttext 进行了检测并进行了移动。

** 西非派金英语

数据集创建

策划理由

More information needed

源数据

BBC News

初始数据收集和标准化

Detailed in the paper

源语言制片人是谁?

Detailed in the paper

注释

Detailed in the paper

注释流程

Detailed in the paper

注释人员是谁?

Detailed in the paper

个人和敏感信息

More information needed

使用数据的注意事项

数据的社会影响

More information needed

偏见讨论

More information needed

其他已知限制

More information needed

其他信息

数据集策划者

More information needed

许可信息

本存储库的内容仅限于非商业研究目的,受 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) 约束。数据集内容的版权属于原始版权持有人。

引用信息

如果您使用任何数据集、模型或代码模块,请引用以下论文:

@inproceedings{hasan-etal-2021-xl,
    title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Islam, Md. Saiful  and
      Mubasshir, Kazi  and
      Li, Yuan-Fang  and
      Kang, Yong-Bin  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.413",
    pages = "4693--4703",
}

贡献

感谢 @abhik1505040 @Tahmid 添加了这个数据集。