ROSE 🌹

This repo contiains the RoSE benchmark of our paper "Revisiting the Gold Standard:Grounding Summarization Evaluation with Robust Human Evaluation".

Please visit here for a demo page of this project.

ACU Annotations

RoSE benchmark contains system outputs annotated with our ACU protocol. It contains four parts:

CNNDM, test set annotations
CNNDM, validation set annotations
XSum, test set annotations
SamSum, test set annotations

We summarize the statistics below.

Dataset	Split	#Doc.	#Sys.	#Total Summ.	HF Name
CNNDM	Test	500	12	6000	cnndm_test
CNNDM	Validation	1000	8	8000	cnndm_validation
XSum	Test	500	8	4000	xsum
SamSum	Test	500	8	4000	samsum

Human Annotations with Different Evaluation Protocols

We have system outputs annotated with four different human evaluation protocols in total.We summarize them below.

Protocol	w/ Input Document	w/ Reference Summary	Fine-grained
Prior	✗	✗	✗
Ref-free	✓	✗	✗
Ref-based	✗	✓	✗
ACU	✗	✓	✓

We annotated two sets of system summaries.

Summaries of 12 fine-tuned systems. The huggingface data split name is cnndm_protocol .

Zero-shot summaries from large langauge models (GPT3, T0), together with summaries from BRIO and BART. The huggingface data split name is cnndm_protocol_gpt3 .

ROSE 🌹

本存储库包含了我们论文《重新审视黄金标准: 通过强大的人工评估来支持摘要评估》中的RoSE基准测试。

请访问 here 以查看此项目的演示页面。

ACU注释

RoSE基准测试包含使用我们的ACU协议注释的系统输出。它包含四个部分：

CNNDM测试集注释
CNNDM验证集注释
XSum测试集注释
SamSum测试集注释

我们总结如下统计数据。

Dataset	Split	#Doc.	#Sys.	#Total Summ.	HF Name
CNNDM	Test	500	12	6000	cnndm_test
CNNDM	Validation	1000	8	8000	cnndm_validation
XSum	Test	500	8	4000	xsum
SamSum	Test	500	8	4000	samsum

使用不同评估协议的人工注释

我们总共对系统输出进行了四种不同的人工评估协议的注释。我们总结如下。

Protocol	w/ Input Document	w/ Reference Summary	Fine-grained
Prior	✗	✗	✗
Ref-free	✓	✗	✗
Ref-based	✗	✓	✗
ACU	✗	✓	✓

我们对两组系统摘要进行了注释。

12个经过精调的系统的摘要。Huggingface数据拆分名称为cnndm_protocol。

来自大型语言模型（GPT3、T0）的零-shot摘要，以及来自BRIO和BART的摘要。Huggingface数据拆分名称为cnndm_protocol_gpt3。

作者:

Salesforce

数据集大小:

12.93 KB