数据集:
machelreid/m2d2
预印本库:
arxiv:2210.07370许可:
From the paper " M2D2: A Massively Multi-domain Language Modeling Dataset ", (Reid et al., EMNLP 2022)
Load the dataset as follows:
import datasets
dataset = datasets.load_dataset("machelreid/m2d2", "cs.CL") # replace cs.CL with the domain of your choice
print(dataset['train'][0]['text'])
Please cite this work if you found this data useful.
@article{reid2022m2d2,
title = {M2D2: A Massively Multi-domain Language Modeling Dataset},
author = {Machel Reid and Victor Zhong and Suchin Gururangan and Luke Zettlemoyer},
year = {2022},
journal = {arXiv preprint arXiv: Arxiv-2210.07370}
}