数据集:
hlgd
任务:
语言:
计算机处理:
monolingual大小:
10K<n<100K语言创建人:
expert-generated批注创建人:
crowdsourced源数据集:
original许可:
HLGD is a binary classification dataset consisting of 20,056 labeled news headlines pairs indicating whether the two headlines describe the same underlying world event or not. The dataset comes with an existing split between train , validation and test (60-20-20).
The paper (NAACL2021) introducing HLGD proposes three challenges making use of various amounts of data:
Dataset is in english.
A typical dataset consists of a timeline_id, two headlines (A/B), each associated with a URL, and a date. Finally, a label indicates whether the two headlines describe the same underlying event (1) or not (0). Below is an example from the training set:
{'timeline_id': 4, 'headline_a': 'France fines Google nearly $57 million for first major violation of new European privacy regime', 'headline_b': "France hits Google with record EUR50mn fine over 'forced consent' data collection", 'date_a': '2019-01-21', 'date_b': '2019-01-21', 'url_a': 'https://www.chicagotribune.com/business/ct-biz-france-fines-google-privacy-20190121-story.html', 'url_b': 'https://www.rt.com/news/449369-france-hits-google-with-record-fine/', 'label': 1}
Train | Dev | Test | |
---|---|---|---|
Number of examples | 15,492 | 2,069 | 2,495 |
The task of grouping headlines from diverse news sources discussing a same underlying event is important to enable interfaces that can present the diversity of coverage of unfolding news events. Many news aggregators (such as Google or Yahoo news) present several sources for a given event, with an objective to highlight coverage diversity. Automatic grouping of news headlines and articles remains challenging as headlines are short, heavily-stylized texts. The HeadLine Grouping Dataset introduces the first benchmark to evaluate NLU model's ability to group headlines according to the underlying event they describe.
The data was obtained by collecting 10 news timelines from the NewsLens project by selecting timelines diversified in topic each contained between 80 and 300 news articles.
Who are the source language producers?The source language producers are journalists or members of the newsroom of 34 news organizations listed in the paper.
Each timeline was annotated for group IDs by 5 independent annotators. The 5 annotations were merged into a single annotation named the global groups. The global group IDs are then used to generate all pairs of headlines within timelines with binary labels: 1 if two headlines are part of the same global group, and 0 otherwise. A heuristic is used to remove negative examples to obtain a final dataset that has class imbalance of 1 positive example to 5 negative examples.
Who are the annotators?Annotators were authors of the papers and 8 crowd-workers on the Upwork platform. The crowd-workers were native English speakers with experience either in proof-reading or data-entry.
Annotators identity has been anonymized. Due to the public nature of news headline, it is not expected that the headlines will contain personal sensitive information.
The purpose of this dataset is to facilitate applications that present diverse news coverage.
By simplifying the process of developing models that can group headlines that describe a common event, we hope the community can build applications that show news readers diverse sources covering similar events.
We note however that the annotations were performed in majority by crowd-workers and that even though inter-annotator agreement was high, it was not perfect. Bias of the annotators therefore remains in the dataset.
There are several sources of bias in the dataset:
For the task of Headline Grouping, inter-annotator agreement is high (0.814) but not perfect. Some decisions for headline grouping are subjective and depend on interpretation of the reader.
The dataset was initially created by Philippe Laban, Lucas Bandarkar and Marti Hearst at UC Berkeley.
The licensing status of the dataset depends on the legal status of news headlines. It is commonly held that News Headlines fall under "fair-use" ( American Bar blog post ) The dataset only distributes headlines, a URL and a publication date. Users of the dataset can then retrieve additional information (such as the body content, author, etc.) directly by querying the URL.
@inproceedings{Laban2021NewsHG, title={News Headline Grouping as a Challenging NLU Task}, author={Laban, Philippe and Bandarkar, Lucas and Hearst, Marti A}, booktitle={NAACL 2021}, publisher = {Association for Computational Linguistics}, year={2021} }
Thanks to @tingofurro for adding this dataset.