数据集:
HeNLP/HeDC4
A Hebrew Deduplicated and Cleaned Common Crawl Corpus. A thoroughly cleaned and approximately deduplicated dataset for unsupervised learning.
If you use HeDC4 in your research, please cite HeRo: RoBERTa and Longformer Hebrew Language Models .
@article{shalumov2023hero,
title={HeRo: RoBERTa and Longformer Hebrew Language Models},
author={Vitaly Shalumov and Harel Haskey},
year={2023},
journal={arXiv:2304.11077},
}