数据集:
Salesforce/rose
语言:
This repo contiains the RoSE benchmark of our paper "Revisiting the Gold Standard:Grounding Summarization Evaluation with Robust Human Evaluation".
Please visit here for a demo page of this project.
RoSE benchmark contains system outputs annotated with our ACU protocol. It contains four parts:
We summarize the statistics below.
| Dataset | Split | #Doc. | #Sys. | #Total Summ. | HF Name | 
|---|---|---|---|---|---|
| CNNDM | Test | 500 | 12 | 6000 | cnndm_test | 
| CNNDM | Validation | 1000 | 8 | 8000 | cnndm_validation | 
| XSum | Test | 500 | 8 | 4000 | xsum | 
| SamSum | Test | 500 | 8 | 4000 | samsum | 
We have system outputs annotated with four different human evaluation protocols in total.We summarize them below.
| Protocol | w/ Input Document | w/ Reference Summary | Fine-grained | 
|---|---|---|---|
| Prior | ✗ | ✗ | ✗ | 
| Ref-free | ✓ | ✗ | ✗ | 
| Ref-based | ✗ | ✓ | ✗ | 
| ACU | ✗ | ✓ | ✓ | 
We annotated two sets of system summaries.
本存储库包含了我们论文《重新审视黄金标准: 通过强大的人工评估来支持摘要评估》中的RoSE基准测试。
请访问 here 以查看此项目的演示页面。
RoSE基准测试包含使用我们的ACU协议注释的系统输出。它包含四个部分:
我们总结如下统计数据。
| Dataset | Split | #Doc. | #Sys. | #Total Summ. | HF Name | 
|---|---|---|---|---|---|
| CNNDM | Test | 500 | 12 | 6000 | cnndm_test | 
| CNNDM | Validation | 1000 | 8 | 8000 | cnndm_validation | 
| XSum | Test | 500 | 8 | 4000 | xsum | 
| SamSum | Test | 500 | 8 | 4000 | samsum | 
我们总共对系统输出进行了四种不同的人工评估协议的注释。我们总结如下。
| Protocol | w/ Input Document | w/ Reference Summary | Fine-grained | 
|---|---|---|---|
| Prior | ✗ | ✗ | ✗ | 
| Ref-free | ✓ | ✗ | ✗ | 
| Ref-based | ✗ | ✓ | ✗ | 
| ACU | ✗ | ✓ | ✓ | 
我们对两组系统摘要进行了注释。