该语料库总共包含10472个句子,分属以下类别:
Hindi
{'Story_no': 15, 'Sentence': '从这可以看出,它花费了三卢比,现在甚至不会发出声音!是你的问题! “这里牵涉到主人有什么问题?”', 'Discourse Mode': '对话'}
句子编号、故事编号、句子和话语模式
详细信息,请参见本文 https://www.aclweb.org/anthology/2020.lrec-1.149/
[需要更多信息]
[需要更多信息]
请参阅此链接: https://github.com/midas-research/hindi-discourse
如果您使用了该数据集,请引用以下出版物: https://aclanthology.org/2020.lrec-1.149/
@inproceedings{dhanwal-etal-2020-annotated,
title = "An Annotated Dataset of Discourse Modes in {H}indi Stories",
author = "Dhanwal, Swapnil and
Dutta, Hritwik and
Nankani, Hitesh and
Shrivastava, Nilay and
Kumar, Yaman and
Li, Junyi Jessy and
Mahata, Debanjan and
Gosangi, Rakesh and
Zhang, Haimin and
Shah, Rajiv Ratn and
Stent, Amanda",
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.149",
pages = "1191--1196",
abstract = "In this paper, we present a new corpus consisting of sentences from Hindi short stories annotated for five different discourse modes argumentative, narrative, descriptive, dialogic and informative. We present a detailed account of the entire data collection and annotation processes. The annotations have a very high inter-annotator agreement (0.87 k-alpha). We analyze the data in terms of label distributions, part of speech tags, and sentence lengths. We characterize the performance of various classification algorithms on this dataset and perform ablation studies to understand the nature of the linguistic models suitable for capturing the nuances of the embedded discourse structures in the presented corpus.",
language = "English",
ISBN = "979-10-95546-34-4",
}
感谢 @duttahritwik 添加了此数据集。