数据集:
ubuntu_dialogs_corpus
任务:
子任务:
dialogue-generation语言:
计算机处理:
monolingual大小:
1M<n<10M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:1506.08909许可:
Ubuntu对话语料库是一个包含近100万个多轮对话的数据集,总计超过700万个话语和1亿个单词。它为基于神经语言模型构建对话管理器的研究提供了独特的资源,这些模型可以利用大量未标记的数据。数据集既具有Dialog State Tracking Challenge数据集中对话的多轮属性,又具有类似Twitter等微博服务的互动的非结构化特性。
"train" 的一个示例如下所示。
This example was too long and was cropped:
{
"Context": "\"i think we could import the old comment via rsync , but from there we need to go via email . i think it be easier than cach the...",
"Label": 1,
"Utterance": "basic each xfree86 upload will not forc user to upgrad 100mb of font for noth __eou__ no someth i do in my spare time . __eou__"
}
所有拆分的数据字段都相同。
训练集| name | train |
|---|---|
| train | 127422 |
@article{DBLP:journals/corr/LowePSP15,
author = {Ryan Lowe and
Nissan Pow and
Iulian Serban and
Joelle Pineau},
title = {The Ubuntu Dialogue Corpus: {A} Large Dataset for Research in Unstructured
Multi-Turn Dialogue Systems},
journal = {CoRR},
volume = {abs/1506.08909},
year = {2015},
url = {http://arxiv.org/abs/1506.08909},
archivePrefix = {arXiv},
eprint = {1506.08909},
timestamp = {Mon, 13 Aug 2018 16:48:23 +0200},
biburl = {https://dblp.org/rec/journals/corr/LowePSP15.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
感谢 @thomwolf , @patrickvonplaten , @lewtun 添加此数据集。