英文

XGLUE 数据集卡片

数据集概要

XGLUE 是一个新的基准数据集,用于评估跨语言预训练模型在跨语言自然语言理解和生成方面的性能。

XGLUE 包含11个任务,涵盖19种语言。对于每个任务,训练数据仅提供英文。这意味着要在XGLUE上获得成功,模型必须具备强大的零样本跨语言迁移能力,从特定任务的英文数据中学习并将所学应用于其他语言。与其同时进行的工作 XTREME 相比,XGLUE 具有两个特点:首先,它同时包括了跨语言自然语言理解(NLU)和跨语言自然语言生成(NLG)任务;其次,除了包括5个现有的跨语言任务(即NER,POS,MLQA,PAWS-X 和 XNLI),XGLUE 还从必应 (Bing) 场景中选择了6个新任务,包括新闻分类(NC),查询广告匹配(QADSM),网页排名(WPR),问答匹配(QAM),问题生成(QG)和新闻标题生成(NTG)。这种语言、任务和任务来源的多样性为评估预训练模型在跨语言自然语言理解和生成方面的质量提供了综合性的基准。

每个配置的训练数据都是英文,而验证和测试数据则以多种不同语言呈现。下表显示了每个配置的验证和测试数据中包含的语言。

因此,对于每个配置,跨语言预训练模型应该在英文训练数据上进行微调,并在所有语言上进行评估。

支持的任务和排行榜

可以在 homepage 上找到 XGLUE 排行榜,其中包括 XGLUE-Understanding 分数(任务 ner ,pos ,mlqa ,nc ,xnli ,paws-x ,qadsm ,wpr ,qam 的平均分数)和 XGLUE-Generation 分数(任务 qg ,ntg 的平均分数)。

语言

对于所有任务(配置),"train" 拆分为英文(en)。

对于每个任务,"validation" 和 "test" 拆分以以下语言呈现:

  • ner: en, de, es, nl
  • pos: en, de, es, nl, bg, el, fr, pl, tr, vi, zh, ur, hi, it, ar, ru, th
  • mlqa: en, de, ar, es, hi, vi, zh
  • nc: en, de, es, fr, ru
  • xnli: en, ar, bg, de, el, es, fr, hi, ru, sw, th, tr, ur, vi, zh
  • paws-x: en, de, es, fr
  • qadsm: en, de, fr
  • wpr: en, de, es, fr, it, pt, zh
  • qam: en, de, fr
  • qg: en, de, es, fr, it, pt
  • ntg: en, de, es, fr, ru

数据集结构

数据实例

ner

"test.nl" 的示例如下所示。

{
  "ner": [
    "O",
    "O",
    "O",
    "B-LOC",
    "O",
    "B-LOC",
    "O",
    "B-LOC",
    "O",
    "O",
    "O",
    "O",
    "O",
    "O",
    "O",
    "B-PER",
    "I-PER",
    "O",
    "O",
    "B-LOC",
    "O",
    "O"
  ],
  "words": [
    "Dat",
    "is",
    "in",
    "Itali\u00eb",
    ",",
    "Spanje",
    "of",
    "Engeland",
    "misschien",
    "geen",
    "probleem",
    ",",
    "maar",
    "volgens",
    "'",
    "Der",
    "Kaiser",
    "'",
    "in",
    "Duitsland",
    "wel",
    "."
  ]
}
pos

"test.fr" 的示例如下所示。

{
  "pos": [
    "PRON",
    "VERB",
    "SCONJ",
    "ADP",
    "PRON",
    "CCONJ",
    "DET",
    "NOUN",
    "ADP",
    "NOUN",
    "CCONJ",
    "NOUN",
    "ADJ",
    "PRON",
    "PRON",
    "AUX",
    "ADV",
    "VERB",
    "PUNCT",
    "PRON",
    "VERB",
    "VERB",
    "DET",
    "ADJ",
    "NOUN",
    "ADP",
    "DET",
    "NOUN",
    "PUNCT"
  ],
  "words": [
    "Je",
    "sens",
    "qu'",
    "entre",
    "\u00e7a",
    "et",
    "les",
    "films",
    "de",
    "m\u00e9decins",
    "et",
    "scientifiques",
    "fous",
    "que",
    "nous",
    "avons",
    "d\u00e9j\u00e0",
    "vus",
    ",",
    "nous",
    "pourrions",
    "emprunter",
    "un",
    "autre",
    "chemin",
    "pour",
    "l'",
    "origine",
    "."
  ]
}
mlqa

"test.hi" 的示例如下所示。

{
  "answers": {
    "answer_start": [
      378
    ],
    "text": [
      "\u0909\u0924\u094d\u0924\u0930 \u092a\u0942\u0930\u094d\u0935"
    ]
  },
  "context": "\u0909\u0938\u0940 \"\u090f\u0930\u093f\u092f\u093e XX \" \u0928\u093e\u092e\u0915\u0930\u0923 \u092a\u094d\u0930\u0923\u093e\u0932\u0940 \u0915\u093e \u092a\u094d\u0930\u092f\u094b\u0917 \u0928\u0947\u0935\u093e\u0926\u093e \u092a\u0930\u0940\u0915\u094d\u0937\u0923 \u0938\u094d\u0925\u0932 \u0915\u0947 \u0905\u0928\u094d\u092f \u092d\u093e\u0917\u094b\u0902 \u0915\u0947 \u0932\u093f\u090f \u0915\u093f\u092f\u093e \u0917\u092f\u093e \u0939\u0948\u0964\u092e\u0942\u0932 \u0930\u0942\u092a \u092e\u0947\u0902 6 \u092c\u091f\u0947 10 \u092e\u0940\u0932 \u0915\u093e \u092f\u0939 \u0906\u092f\u0924\u093e\u0915\u093e\u0930 \u0905\u0921\u094d\u0921\u093e \u0905\u092c \u0924\u0925\u093e\u0915\u0925\u093f\u0924 '\u0917\u094d\u0930\u0942\u092e \u092c\u0949\u0915\u094d\u0938 \" \u0915\u093e \u090f\u0915 \u092d\u093e\u0917 \u0939\u0948, \u091c\u094b \u0915\u093f 23 \u092c\u091f\u0947 25.3 \u092e\u0940\u0932 \u0915\u093e \u090f\u0915 \u092a\u094d\u0930\u0924\u093f\u092c\u0902\u0927\u093f\u0924 \u0939\u0935\u093e\u0908 \u0915\u094d\u0937\u0947\u0924\u094d\u0930 \u0939\u0948\u0964 \u092f\u0939 \u0915\u094d\u0937\u0947\u0924\u094d\u0930 NTS \u0915\u0947 \u0906\u0902\u0924\u0930\u093f\u0915 \u0938\u0921\u093c\u0915 \u092a\u094d\u0930\u092c\u0902\u0927\u0928 \u0938\u0947 \u091c\u0941\u0921\u093c\u093e \u0939\u0948, \u091c\u093f\u0938\u0915\u0940 \u092a\u0915\u094d\u0915\u0940 \u0938\u0921\u093c\u0915\u0947\u0902 \u0926\u0915\u094d\u0937\u093f\u0923 \u092e\u0947\u0902 \u092e\u0930\u0915\u0930\u0940 \u0915\u0940 \u0913\u0930 \u0914\u0930 \u092a\u0936\u094d\u091a\u093f\u092e \u092e\u0947\u0902 \u092f\u0941\u0915\u094d\u0915\u093e \u092b\u094d\u0932\u0948\u091f \u0915\u0940 \u0913\u0930 \u091c\u093e\u0924\u0940 \u0939\u0948\u0902\u0964 \u091d\u0940\u0932 \u0938\u0947 \u0909\u0924\u094d\u0924\u0930 \u092a\u0942\u0930\u094d\u0935 \u0915\u0940 \u0913\u0930 \u092c\u0922\u093c\u0924\u0947 \u0939\u0941\u090f \u0935\u094d\u092f\u093e\u092a\u0915 \u0914\u0930 \u0914\u0930 \u0938\u0941\u0935\u094d\u092f\u0935\u0938\u094d\u0925\u093f\u0924 \u0917\u094d\u0930\u0942\u092e \u091d\u0940\u0932 \u0915\u0940 \u0938\u0921\u093c\u0915\u0947\u0902 \u090f\u0915 \u0926\u0930\u094d\u0930\u0947 \u0915\u0947 \u091c\u0930\u093f\u092f\u0947 \u092a\u0947\u091a\u0940\u0926\u093e \u092a\u0939\u093e\u0921\u093c\u093f\u092f\u094b\u0902 \u0938\u0947 \u0939\u094b\u0915\u0930 \u0917\u0941\u091c\u0930\u0924\u0940 \u0939\u0948\u0902\u0964 \u092a\u0939\u0932\u0947 \u0938\u0921\u093c\u0915\u0947\u0902 \u0917\u094d\u0930\u0942\u092e \u0918\u093e\u091f\u0940",
  "question": "\u091d\u0940\u0932 \u0915\u0947 \u0938\u093e\u092a\u0947\u0915\u094d\u0937 \u0917\u094d\u0930\u0942\u092e \u0932\u0947\u0915 \u0930\u094b\u0921 \u0915\u0939\u093e\u0901 \u091c\u093e\u0924\u0940 \u0925\u0940?"
}
nc

"test.es" 的示例如下所示。

{
  "news_body": "El bizcocho es seguramente el producto m\u00e1s b\u00e1sico y sencillo de toda la reposter\u00eda : consiste en poco m\u00e1s que mezclar unos cuantos ingredientes, meterlos al horno y esperar a que se hagan. Por obra y gracia del impulsor qu\u00edmico, tambi\u00e9n conocido como \"levadura de tipo Royal\", despu\u00e9s de un rato de calorcito esta combinaci\u00f3n de harina, az\u00facar, huevo, grasa -aceite o mantequilla- y l\u00e1cteo se transforma en uno de los productos m\u00e1s deliciosos que existen para desayunar o merendar . Por muy manazas que seas, es m\u00e1s que probable que tu bizcocho casero supere en calidad a cualquier infamia industrial envasada. Para lograr un bizcocho digno de admiraci\u00f3n s\u00f3lo tienes que respetar unas pocas normas que afectan a los ingredientes, proporciones, mezclado, horneado y desmoldado. Todas las tienes resumidas en unos dos minutos el v\u00eddeo de arriba, en el que adem \u00e1s aprender\u00e1s alg\u00fan truquillo para que tu bizcochaco quede m\u00e1s fino, jugoso, esponjoso y amoroso. M\u00e1s en MSN:",
  "news_category": "foodanddrink",
  "news_title": "Cocina para lerdos: las leyes del bizcocho"
}
xnli

"validation.th" 的示例如下所示。

{
  "hypothesis": "\u0e40\u0e02\u0e32\u0e42\u0e17\u0e23\u0e2b\u0e32\u0e40\u0e40\u0e21\u0e48\u0e02\u0e2d\u0e07\u0e40\u0e02\u0e32\u0e2d\u0e22\u0e48\u0e32\u0e07\u0e23\u0e27\u0e14\u0e40\u0e23\u0e47\u0e27\u0e2b\u0e25\u0e31\u0e07\u0e08\u0e32\u0e01\u0e17\u0e35\u0e48\u0e23\u0e16\u0e42\u0e23\u0e07\u0e40\u0e23\u0e35\u0e22\u0e19\u0e2a\u0e48\u0e07\u0e40\u0e02\u0e32\u0e40\u0e40\u0e25\u0e49\u0e27",
  "label": 1,
  "premise": "\u0e41\u0e25\u0e30\u0e40\u0e02\u0e32\u0e1e\u0e39\u0e14\u0e27\u0e48\u0e32, \u0e21\u0e48\u0e32\u0e21\u0e4a\u0e32 \u0e1c\u0e21\u0e2d\u0e22\u0e39\u0e48\u0e1a\u0e49\u0e32\u0e19"
}
paws-x

"test.es" 的示例如下所示。

{
  "label": 1,
  "sentence1": "La excepci\u00f3n fue entre fines de 2005 y 2009 cuando jug\u00f3 en Suecia con Carlstad United BK, Serbia con FK Borac \u010ca\u010dak y el FC Terek Grozny de Rusia.",
  "sentence2": "La excepci\u00f3n se dio entre fines del 2005 y 2009, cuando jug\u00f3 con Suecia en el Carlstad United BK, Serbia con el FK Borac \u010ca\u010dak y el FC Terek Grozny de Rusia."
}
qadsm

"train" 的示例如下所示。

{
  "ad_description": "Your New England Cruise Awaits! Holland America Line Official Site.",
  "ad_title": "New England Cruises",
  "query": "cruise portland maine",
  "relevance_label": 1
}
wpr

"test.zh" 的示例如下所示。

{
  "query": "maxpro\u5b98\u7f51",
  "relavance_label": 0,
  "web_page_snippet": "\u5728\u7ebf\u8d2d\u4e70\uff0c\u552e\u540e\u670d\u52a1\u3002vivo\u667a\u80fd\u624b\u673a\u5f53\u5b63\u660e\u661f\u673a\u578b\u6709NEX\uff0cvivo X21\uff0cvivo X20\uff0c\uff0cvivo X23\u7b49\uff0c\u5728vivo\u5b98\u7f51\u8d2d\u4e70\u624b\u673a\u53ef\u4ee5\u4eab\u53d712 \u671f\u514d\u606f\u4ed8\u6b3e\u3002 \u54c1\u724c Funtouch OS \u4f53\u9a8c\u5e97 | ...",
  "wed_page_title": "vivo\u667a\u80fd\u624b\u673a\u5b98\u65b9\u7f51\u7ad9-AI\u975e\u51e1\u6444\u5f71X23"
}
qam

"validation.en" 的示例如下所示。

{
  "annswer": "Erikson has stated that after the last novel of the Malazan Book of the Fallen was finished, he and Esslemont would write a comprehensive guide tentatively named The Encyclopaedia Malazica.",
  "label": 0,
  "question": "main character of malazan book of the fallen"
}
qg

"test.de" 的示例如下所示。

{
  "answer_passage": "Medien bei WhatsApp automatisch speichern. Tippen Sie oben rechts unter WhatsApp auf die drei Punkte oder auf die Men\u00fc-Taste Ihres Smartphones. Dort wechseln Sie in die \"Einstellungen\" und von hier aus weiter zu den \"Chat-Einstellungen\". Unter dem Punkt \"Medien Auto-Download\" k\u00f6nnen Sie festlegen, wann die WhatsApp-Bilder heruntergeladen werden sollen.",
  "question": "speichenn von whats app bilder unterbinden"
}
ntg

"test.en" 的示例如下所示。

{
  "news_body": "Check out this vintage Willys Pickup! As they say, the devil is in the details, and it's not every day you see such attention paid to every last area of a restoration like with this 1961 Willys Pickup . Already the Pickup has a unique look that shares some styling with the Jeep, plus some original touches you don't get anywhere else. It's a classy way to show up to any event, all thanks to Hollywood Motors . A burgundy paint job contrasts with white lower panels and the roof. Plenty of tasteful chrome details grace the exterior, including the bumpers, headlight bezels, crossmembers on the grille, hood latches, taillight bezels, exhaust finisher, tailgate hinges, etc. Steel wheels painted white and chrome hubs are a tasteful addition. Beautiful oak side steps and bed strips add a touch of craftsmanship to this ride. This truck is of real showroom quality, thanks to the astoundingly detailed restoration work performed on it, making this Willys Pickup a fierce contender for best of show. Under that beautiful hood is a 225 Buick V6 engine mated to a three-speed manual transmission, so you enjoy an ideal level of control. Four wheel drive is functional, making it that much more utilitarian and downright cool. The tires are new, so you can enjoy a lot of life out of them, while the wheels and hubs are in great condition. Just in case, a fifth wheel with a tire and a side mount are included. Just as important, this Pickup runs smoothly, so you can go cruising or even hit the open road if you're interested in participating in some classic rallies. You might associate Willys with the famous Jeep CJ, but the automaker did produce a fair amount of trucks. The Pickup is quite the unique example, thanks to distinct styling that really turns heads, making it a favorite at quite a few shows. Source: Hollywood Motors Check These Rides Out Too: Fear No Trails With These Off-Roaders 1965 Pontiac GTO: American Icon For Sale In Canada Low-Mileage 1955 Chevy 3100 Represents Turn In Pickup Market",
  "news_title": "This 1961 Willys Pickup Will Let You Cruise In Style"
}

数据字段

ner

下面分别解释了 ner 中的每个数据字段。数据字段在所有拆分中是相同的。

  • words: 由组成句子的单词列表。
  • ner: 与每个单词分别对应的实体类别列表。
pos

下面分别解释了 pos 中的每个数据字段。数据字段在所有拆分中是相同的。

  • words: 由组成句子的单词列表。
  • pos: 与每个单词分别对应的词性类别列表。
mlqa

下面分别解释了 mlqa 中的每个数据字段。数据字段在所有拆分中是相同的。

  • context: 字符串,包含答案的上下文。
  • question: 字符串,待回答的问题。
  • answers: 字符串,答案对于问题。
nc

下面分别解释了 nc 中的每个数据字段。数据字段在所有拆分中是相同的。

  • news_title: 字符串,新闻报道的标题。
  • news_body: 字符串,实际新闻报道的内容。
  • news_category: 字符串,新闻报道的类别,例如食品与饮料。
xnli

下面分别解释了 xnli 中的每个数据字段。数据字段在所有拆分中是相同的。

  • premise: 字符串,上下文/前提,即自然语言推理的第一句话。
  • hypothesis: 字符串,一句话,与预提的关系需要进行分类,即自然语言推理的第二句话。
  • label: 类别(int),关于 hypothesis 和 premise 之间的自然语言推理关系类别。0:蕴含,1:矛盾,2:中立。
paws-x

下面分别解释了 paws-x 中的每个数据字段。数据字段在所有拆分中是相同的。

  • sentence1: 字符串,一句话。
  • sentence2: 字符串,表示该句子是否是 sentence1 的释义。
  • label: 类标签(int),sentence2 是否是 sentence1 的释义。0:不同,1:相同。
qadsm

下面分别解释了 qadsm 中的每个数据字段。数据字段在所有拆分中是相同的。

  • query: 字符串,要插入搜索引擎的搜索查询。
  • ad_title: 字符串,广告的标题。
  • ad_description: 字符串,广告内容,即主体。
  • relevance_label: 类别标签(int),广告 ad_title + ad_description 相对于搜索查询 query 的相关性。0:差,1:好。
wpr

下面分别解释了 wpr 中的每个数据字段。数据字段在所有拆分中是相同的。

  • query: 字符串,要插入搜索引擎的搜索查询。
  • web_page_title: 字符串,网页的标题。
  • web_page_snippet: 字符串,网页内容,即主体。
  • relavance_label: 类别标签(int),网页 web_page_snippet + web_page_snippet 相对于搜索查询 query 的相关性。0:差,1:一般,2:好,3:优秀,4:完美。
qam

下面分别解释了 qam 中的每个数据字段。数据字段在所有拆分中是相同的。

  • question: 字符串,一个问题。
  • answer: 字符串,问题的一个可能答案。
  • label: 类别标签(int),答案对于问题是否相关。0:假,1:真。
qg

下面分别解释了 qg 中的每个数据字段。数据字段在所有拆分中是相同的。

  • answer_passage: 字符串,问题的详细答案。
  • question: 字符串,一个问题。
ntg

下面分别解释了 ntg 中的每个数据字段。数据字段在所有拆分中是相同的。

  • news_body: 字符串,新闻文章的内容。
  • news_title: 字符串,对应新闻文章 news_body 的标题。

数据拆分

ner

下表显示了 ner 拆分中数据样本数/行数。

train validation.en validation.de validation.es validation.nl test.en test.de test.es test.nl
ner 14042 3252 2874 1923 2895 3454 3007 1523 5202
pos

下表显示了 pos 拆分中数据样本数/行数。

train validation.en validation.de validation.es validation.nl validation.bg validation.el validation.fr validation.pl validation.tr validation.vi validation.zh validation.ur validation.hi validation.it validation.ar validation.ru validation.th test.en test.de test.es test.nl test.bg test.el test.fr test.pl test.tr test.vi test.zh test.ur test.hi test.it test.ar test.ru test.th
pos 25376 2001 798 1399 717 1114 402 1475 2214 987 799 499 551 1658 563 908 578 497 2076 976 425 595 1115 455 415 2214 982 799 499 534 1683 481 679 600 497
mlqa

下表显示了 mlqa 拆分中数据样本数/行数。

train validation.en validation.de validation.ar validation.es validation.hi validation.vi validation.zh test.en test.de test.ar test.es test.hi test.vi test.zh
mlqa 87599 1148 512 517 500 507 511 504 11590 4517 5335 5253 4918 5495 5137
nc

下表显示了 nc 拆分中数据样本数/行数。

train validation.en validation.de validation.es validation.fr validation.ru test.en test.de test.es test.fr test.ru
nc 100000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000
xnli

下表显示了 xnli 拆分中数据样本数/行数。

train validation.en validation.ar validation.bg validation.de validation.el validation.es validation.fr validation.hi validation.ru validation.sw validation.th validation.tr validation.ur validation.vi validation.zh test.en test.ar test.bg test.de test.el test.es test.fr test.hi test.ru test.sw test.th test.tr test.ur test.vi test.zh
xnli 392702 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010
nc

下表显示了 nc 拆分中数据样本数/行数。

train validation.en validation.de validation.es validation.fr validation.ru test.en test.de test.es test.fr test.ru
nc 100000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000
xnli

下表显示了 xnli 拆分中数据样本数/行数。

train validation.en validation.ar validation.bg validation.de validation.el validation.es validation.fr validation.hi validation.ru validation.sw validation.th validation.tr validation.ur validation.vi validation.zh test.en test.ar test.bg test.de test.el test.es test.fr test.hi test.ru test.sw test.th test.tr test.ur test.vi test.zh
xnli 392702 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 2490 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010 5010
paws-x

下表显示了 paws-x 拆分中数据样本数/行数。

train validation.en validation.de validation.es validation.fr test.en test.de test.es test.fr
paws-x 49401 2000 2000 2000 2000 2000 2000 2000 2000
qadsm

下表显示了 qadsm 拆分中数据样本数/行数。

train validation.en validation.de validation.fr test.en test.de test.fr
qadsm 100000 10000 10000 10000 10000 10000 10000
wpr

下表显示了 wpr 拆分中数据样本数/行数。

train validation.en validation.de validation.es validation.fr validation.it validation.pt validation.zh test.en test.de test.es test.fr test.it test.pt test.zh
wpr 99997 10008 10004 10004 10005 10003 10001 10002 10004 9997 10006 10020 10001 10015 9999
qam

下表显示了 qam 拆分中数据样本数/行数。

train validation.en validation.de validation.fr test.en test.de test.fr
qam 100000 10000 10000 10000 10000 10000 10000
qg

下表显示了 qg 拆分中数据样本数/行数。

train validation.en validation.de validation.es validation.fr validation.it validation.pt test.en test.de test.es test.fr test.it test.pt
qg 100000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000
ntg

下表显示了 ntg 拆分中数据样本数/行数。

train validation.en validation.de validation.es validation.fr validation.ru test.en test.de test.es test.fr test.ru
ntg 300000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000

数据集创建

策划理由

[需要更多信息]

来源数据

初始数据收集和标准化

[需要更多信息]

谁是源语言的生产者?

[需要更多信息]

注释

[需要更多信息]

注释过程

[需要更多信息]

谁是注释者?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

数据集主要由来自微软研究院的 Yaobo Liang、Yeyun Gong、Nan Duan、Ming Gong、Linjun Shou 和 Daniel Campos 维护。

许可信息

XGLUE 数据集仅用于非商业研究目的,以促进人工智能和相关领域的进展,不提供任何许可或其他知识产权。该数据集按原样提供,不提供任何保证,使用数据存在风险,因为我们可能不拥有文档的基础权利。对于使用数据集而导致的任何损害,我们将不承担任何责任。反馈是自愿给出的,我们可以根据需要使用。违反这些条款的任何行为将自动终止您使用数据集的权利。

如果您对使用数据集或在产品或服务中使用任何研究结果有疑问,请进行自己独立的法律审查。如果有其他问题,请随时与我们联系。

引文信息

如果您使用此数据集,请对其进行引用。另外,由于 XGLUE 还是由现有的 5 个数据集构建而成,请确保同时引用它们。

例如:

We evaluate our model using the XGLUE benchmark \cite{Liang2020XGLUEAN}, a cross-lingual evaluation benchmark
consiting of Named Entity Resolution (NER) \cite{Sang2002IntroductionTT} \cite{Sang2003IntroductionTT},
Part of Speech Tagging (POS) \cite{11234/1-3105}, News Classification (NC), MLQA \cite{Lewis2019MLQAEC},
XNLI \cite{Conneau2018XNLIEC}, PAWS-X \cite{Yang2019PAWSXAC}, Query-Ad Matching (QADSM), Web Page Ranking (WPR),
QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG).
@article{Liang2020XGLUEAN,
  title={XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation},
  author={Yaobo Liang and Nan Duan and Yeyun Gong and Ning Wu and Fenfei Guo and Weizhen Qi and Ming Gong and Linjun Shou and Daxin Jiang and Guihong Cao and Xiaodong Fan and Ruofei Zhang and Rahul Agrawal and Edward Cui and Sining Wei and Taroon Bharti and Ying Qiao and Jiun-Hung Chen and Winnie Wu and Shuguang Liu and Fan Yang and Daniel Campos and Rangan Majumder and Ming Zhou},
  journal={arXiv},
  year={2020},
  volume={abs/2004.01401}
}

@misc{11234/1-3105,
  title={Universal Dependencies 2.5},
  author={Zeman, Daniel and Nivre, Joakim and Abrams, Mitchell and Aepli, No{\"e}mi and Agi{\'c}, {\v Z}eljko and Ahrenberg, Lars and Aleksandravi{\v c}i{\=u}t{\.e}, Gabriel{\.e} and Antonsen, Lene and Aplonova, Katya and Aranzabe, Maria Jesus and Arutie, Gashaw and Asahara, Masayuki and Ateyah, Luma and Attia, Mohammed and Atutxa, Aitziber and Augustinus, Liesbeth and Badmaeva, Elena and Ballesteros, Miguel and Banerjee, Esha and Bank, Sebastian and Barbu Mititelu, Verginica and Basmov, Victoria and Batchelor, Colin and Bauer, John and Bellato, Sandra and Bengoetxea, Kepa and Berzak, Yevgeni and Bhat, Irshad Ahmad and Bhat, Riyaz Ahmad and Biagetti, Erica and Bick, Eckhard and Bielinskien{\.e}, Agn{\.e} and Blokland, Rogier and Bobicev, Victoria and Boizou, Lo{\"{\i}}c and Borges V{\"o}lker, Emanuel and B{\"o}rstell, Carl and Bosco, Cristina and Bouma, Gosse and Bowman, Sam and Boyd, Adriane and Brokait{\.e}, Kristina and Burchardt, Aljoscha and Candito, Marie and Caron, Bernard and Caron, Gauthier and Cavalcanti, Tatiana and Cebiro{\u g}lu Eryi{\u g}it, G{\"u}l{\c s}en and Cecchini, Flavio Massimiliano and Celano, Giuseppe G. A. and {\v C}{\'e}pl{\"o}, Slavom{\'{\i}}r and Cetin, Savas and Chalub, Fabricio and Choi, Jinho and Cho, Yongseok and Chun, Jayeol and Cignarella, Alessandra T. and Cinkov{\'a}, Silvie and Collomb, Aur{\'e}lie and {\c C}{\"o}ltekin, {\c C}a{\u g}r{\i} and Connor, Miriam and Courtin, Marine and Davidson, Elizabeth and de Marneffe, Marie-Catherine and de Paiva, Valeria and de Souza, Elvis and Diaz de Ilarraza, Arantza and Dickerson, Carly and Dione, Bamba and Dirix, Peter and Dobrovoljc, Kaja and Dozat, Timothy and Droganova, Kira and Dwivedi, Puneet and Eckhoff, Hanne and Eli, Marhaba and Elkahky, Ali and Ephrem, Binyam and Erina, Olga and Erjavec, Toma{\v z} and Etienne, Aline and Evelyn, Wograine and Farkas, Rich{\'a}rd and Fernandez Alcalde, Hector and Foster, Jennifer and Freitas, Cl{\'a}udia and Fujita, Kazunori and Gajdo{\v s}ov{\'a}, Katar{\'{\i}}na and Galbraith, Daniel and Garcia, Marcos and G{\"a}rdenfors, Moa and Garza, Sebastian and Gerdes, Kim and Ginter, Filip and Goenaga, Iakes and Gojenola, Koldo and G{\"o}k{\i}rmak, Memduh and Goldberg, Yoav and G{\'o}mez Guinovart, Xavier and Gonz{\'a}lez Saavedra, Berta and Grici{\=u}t{\.e}, Bernadeta and Grioni, Matias and Gr{\=u}z{\={\i}}tis, Normunds and Guillaume, Bruno and Guillot-Barbance, C{\'e}line and Habash, Nizar and Haji{\v c}, Jan and Haji{\v c} jr., Jan and H{\"a}m{\"a}l{\"a}inen, Mika and H{\`a} M{\~y}, Linh and Han, Na-Rae and Harris, Kim and Haug, Dag and Heinecke, Johannes and Hennig, Felix and Hladk{\'a}, Barbora and Hlav{\'a}{\v c}ov{\'a}, Jaroslava and Hociung, Florinel and Hohle, Petter and Hwang, Jena and Ikeda, Takumi and Ion, Radu and Irimia, Elena and Ishola, {\d O}l{\'a}j{\'{\i}}d{\'e} and Jel{\'{\i}}nek, Tom{\'a}{\v s} and Johannsen, Anders and J{\o}rgensen, Fredrik and Juutinen, Markus and Ka{\c s}{\i}kara, H{\"u}ner and Kaasen, Andre and Kabaeva, Nadezhda and Kahane, Sylvain and Kanayama, Hiroshi and Kanerva, Jenna and Katz, Boris and Kayadelen, Tolga and Kenney, Jessica and Kettnerov{\'a}, V{\'a}clava and Kirchner, Jesse and Klementieva, Elena and K{\"o}hn, Arne and Kopacewicz, Kamil and Kotsyba, Natalia and Kovalevskait{\.e}, Jolanta and Krek, Simon and Kwak, Sookyoung and Laippala, Veronika and Lambertino, Lorenzo and Lam, Lucia and Lando, Tatiana and Larasati, Septina Dian and Lavrentiev, Alexei and Lee, John and L{\^e} H{\`{\^o}}ng, Phương and Lenci, Alessandro and Lertpradit, Saran and Leung, Herman and Li, Cheuk Ying and Li, Josie and Li, Keying and Lim, {KyungTae} and Liovina, Maria and Li, Yuan and Ljube{\v s}i{\'c}, Nikola and Loginova, Olga and Lyashevskaya, Olga and Lynn, Teresa and Macketanz, Vivien and Makazhanov, Aibek and Mandl, Michael and Manning, Christopher and Manurung, Ruli and M{\u a}r{\u a}nduc, C{\u a}t{\u a}lina and Mare{\v c}ek, David and Marheinecke, Katrin and Mart{\'{\i}}nez Alonso, H{\'e}ctor and Martins, Andr{\'e} and Ma{\v s}ek, Jan and Matsumoto, Yuji and {McDonald}, Ryan and {McGuinness}, Sarah and Mendon{\c c}a, Gustavo and Miekka, Niko and Misirpashayeva, Margarita and Missil{\"a}, Anna and Mititelu, C{\u a}t{\u a}lin and Mitrofan, Maria and Miyao, Yusuke and Montemagni, Simonetta and More, Amir and Moreno Romero, Laura and Mori, Keiko Sophie and Morioka, Tomohiko and Mori, Shinsuke and Moro, Shigeki and Mortensen, Bjartur and Moskalevskyi, Bohdan and Muischnek, Kadri and Munro, Robert and Murawaki, Yugo and M{\"u}{\"u}risep, Kaili and Nainwani, Pinkey and Navarro Hor{\~n}iacek, Juan Ignacio and Nedoluzhko, Anna and Ne{\v s}pore-B{\=e}rzkalne, Gunta and Nguy{\~{\^e}}n Th{\d i}, Lương and Nguy{\~{\^e}}n Th{\d i} Minh, Huy{\`{\^e}}n and Nikaido, Yoshihiro and Nikolaev, Vitaly and Nitisaroj, Rattima and Nurmi, Hanna and Ojala, Stina and Ojha, Atul Kr. and Ol{\'u}{\`o}kun, Ad{\'e}day{\d o}̀ and Omura, Mai and Osenova, Petya and {\"O}stling, Robert and {\O}vrelid, Lilja and Partanen, Niko and Pascual, Elena and Passarotti, Marco and Patejuk, Agnieszka and Paulino-Passos, Guilherme and Peljak-{\L}api{\'n}ska, Angelika and Peng, Siyao and Perez, Cenel-Augusto and Perrier, Guy and Petrova, Daria and Petrov, Slav and Phelan, Jason and Piitulainen, Jussi and Pirinen, Tommi A and Pitler, Emily and Plank, Barbara and Poibeau, Thierry and Ponomareva, Larisa and Popel, Martin and Pretkalni{\c n}a, Lauma and Pr{\'e}vost, Sophie and Prokopidis, Prokopis and Przepi{\'o}rkowski, Adam and Puolakainen, Tiina and Pyysalo, Sampo and Qi, Peng and R{\"a}{\"a}bis, Andriela and Rademaker, Alexandre and Ramasamy, Loganathan and Rama, Taraka and Ramisch, Carlos and Ravishankar, Vinit and Real, Livy and Reddy, Siva and Rehm, Georg and Riabov, Ivan and Rie{\ss}ler, Michael and Rimkut{\.e}, Erika and Rinaldi, Larissa and Rituma, Laura and Rocha, Luisa and Romanenko, Mykhailo and Rosa, Rudolf and Rovati, Davide and Roșca, Valentin and Rudina, Olga and Rueter, Jack and Sadde, Shoval and Sagot, Beno{\^{\i}}t and Saleh, Shadi and Salomoni, Alessio and Samard{\v z}i{\'c}, Tanja and Samson, Stephanie and Sanguinetti, Manuela and S{\"a}rg, Dage and Saul{\={\i}}te, Baiba and Sawanakunanon, Yanin and Schneider, Nathan and Schuster, Sebastian and Seddah, Djam{\'e} and Seeker, Wolfgang and Seraji, Mojgan and Shen, Mo and Shimada, Atsuko and Shirasu, Hiroyuki and Shohibussirri, Muh and Sichinava, Dmitry and Silveira, Aline and Silveira, Natalia and Simi, Maria and Simionescu, Radu and Simk{\'o}, Katalin and {\v S}imkov{\'a}, M{\'a}ria and Simov, Kiril and Smith, Aaron and Soares-Bastos, Isabela and Spadine, Carolyn and Stella, Antonio and Straka, Milan and Strnadov{\'a}, Jana and Suhr, Alane and Sulubacak, Umut and Suzuki, Shingo and Sz{\'a}nt{\'o}, Zsolt and Taji, Dima and Takahashi, Yuta and Tamburini, Fabio and Tanaka, Takaaki and Tellier, Isabelle and Thomas, Guillaume and Torga, Liisi and Trosterud, Trond and Trukhina, Anna and Tsarfaty, Reut and Tyers, Francis and Uematsu, Sumire and Ure{\v s}ov{\'a}, Zde{\v n}ka and Uria, Larraitz and Uszkoreit, Hans and Utka, Andrius and Vajjala, Sowmya and van Niekerk, Daniel and van Noord, Gertjan and Varga, Viktor and Villemonte de la Clergerie, Eric and Vincze, Veronika and Wallin, Lars and Walsh, Abigail and Wang, Jing Xian and Washington, Jonathan North and Wendt, Maximilan and Williams, Seyi and Wir{\'e}n, Mats and Wittern, Christian and Woldemariam, Tsegay and Wong, Tak-sum and Wr{\'o}blewska, Alina and Yako, Mary and Yamazaki, Naoki and Yan, Chunxiao and Yasuoka, Koichi and Yavrumyan, Marat M. and Yu, Zhuoran and {\v Z}abokrtsk{\'y}, Zden{\v e}k and Zeldes, Amir and Zhang, Manying and Zhu, Hanzhi},
  url={http://hdl.handle.net/11234/1-3105},
  note={{LINDAT}/{CLARIAH}-{CZ} digital library at the Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University},
  copyright={Licence Universal Dependencies v2.5},
  year={2019}
}

@article{Sang2003IntroductionTT,
  title={Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition},
  author={Erik F. Tjong Kim Sang and Fien De Meulder},
  journal={ArXiv},
  year={2003},
  volume={cs.CL/0306050}
}

@article{Sang2002IntroductionTT,
  title={Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition},
  author={Erik F. Tjong Kim Sang},
  journal={ArXiv},
  year={2002},
  volume={cs.CL/0209010}
}

@inproceedings{Conneau2018XNLIEC,
  title={XNLI: Evaluating Cross-lingual Sentence Representations},
  author={Alexis Conneau and Guillaume Lample and Ruty Rinott and Adina Williams and Samuel R. Bowman and Holger Schwenk and Veselin Stoyanov},
  booktitle={EMNLP},
  year={2018}
}

@article{Lewis2019MLQAEC,
  title={MLQA: Evaluating Cross-lingual Extractive Question Answering},
  author={Patrick Lewis and Barlas Oguz and Ruty Rinott and Sebastian Riedel and Holger Schwenk},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.07475}
}

@article{Yang2019PAWSXAC,
  title={PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification},
  author={Yinfei Yang and Yuan Zhang and Chris Tar and Jason Baldridge},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.11828}
}

贡献

感谢 @patrickvonplaten 为此数据集做出的贡献。