CSL 数据集卡片

数据集描述

CSL 是中国科学文献数据集。

Paper: https://aclanthology.org/2022.coling-1.344
Repository: https://github.com/ydli-ai/CSL

数据集摘要

该数据集包含来自多个学术领域的中文论文的标题、摘要和关键词。

语言

中文
英文（翻译版）

数据集结构

数据实例

Split	Documents
csl	396k
en_translation	396k

数据字段

doc_id：该文档的唯一标识符
title：论文标题
abstract：论文摘要
keywords：与论文相关的关键词
category：论文的大类别
category_eng：大类别的英文翻译（例如工程）
discipline：论文的学术学科
discipline_eng：学术学科的英文翻译（例如农业工程）

en_translation 包含从谷歌翻译服务翻译的文档。所有文本均为英文，因此省略了 category_eng 和 discipline_eng 字段。

数据集使用

使用 🤗 Datasets：

from datasets import load_dataset

dataset = load_dataset('neuclir/csl')['csl']

许可证和引用

该数据集基于 Apache 2.0 下的 Chinese Scientific Literature Dataset 。主要更改是添加了 doc_id、类别和学科描述的英文翻译（由本族语者完成），以及基本的去重。执行此修改的代码可在 this repository 中找到。

如果您使用了这些数据，请引用：

@inproceedings{li-etal-2022-csl,
    title = "{CSL}: A Large-scale {C}hinese Scientific Literature Dataset",
    author = "Li, Yudong  and
      Zhang, Yuqing  and
      Zhao, Zhe  and
      Shen, Linlin  and
      Liu, Weijie  and
      Mao, Weiquan  and
      Zhang, Hui",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.344",
    pages = "3917--3923",
}

作者:

neuclir

数据集大小:

223.73 MB