数据集:
break_data
任务:
语言:
计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original许可:
Break 是一个人工标注的自然语言问题及其问题分解意义表示(Question Decomposition Meaning Representations,QDMR)的数据集。Break 数据集包含来自文本、图像和数据库的10个问答数据集中的83,978个示例。此存储库包含 Break 数据集以及有关确切数据格式的信息。
“验证”示例如下。
{
"decomposition": "return flights ;return #1 from denver ;return #2 to philadelphia ;return #3 if available",
"operators": "['select', 'filter', 'filter', 'filter']",
"question_id": "ATIS_dev_0",
"question_text": "what flights are available tomorrow from denver to philadelphia ",
"split": "dev"
}
QDMR-high-level “训练”示例如下。
{
"decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4",
"operators": "['select', 'filter', 'filter', 'filter', 'project']",
"question_id": "ATIS_dev_102",
"question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ",
"split": "dev"
}
QDMR-high-level-lexicon “训练”示例如下。
This example was too long and was cropped:
{
"allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'he', 'distinct', 'House', 'two', 'at least', 'or ', 'date', 'o...",
"source": "What office, also held by a member of the Maine House of Representatives, did James K. Polk hold before he was president?"
}
QDMR-lexicon “验证”示例如下。
This example was too long and was cropped:
{
"allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'distinct', 'two', 'at least', 'or ', 'date', 'on ', '@@14@@', ...",
"source": "what flights are available tomorrow from denver to philadelphia "
}
logical-forms “训练”示例如下。
{
"decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4",
"operators": "['select', 'filter', 'filter', 'filter', 'project']",
"program": "some program",
"question_id": "ATIS_dev_102",
"question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ",
"split": "dev"
}
所有拆分之间的数据字段相同。
QDMR| name | train | validation | test |
|---|---|---|---|
| QDMR | 44321 | 7760 | 8069 |
| QDMR-high-level | 17503 | 3130 | 3195 |
| QDMR-high-level-lexicon | 17503 | 3130 | 3195 |
| QDMR-lexicon | 44321 | 7760 | 8069 |
| logical-forms | 44098 | 7719 | 8006 |
@article{Wolfson2020Break,
title={Break It Down: A Question Understanding Benchmark},
author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan},
journal={Transactions of the Association for Computational Linguistics},
year={2020},
}
感谢 @patrickvonplaten 、 @lewtun 和 @thomwolf 添加了该数据集。