数据集:
Bingsu/KcBERT_Pre-Training_Corpus
语言:
计算机处理:
monolingual大小:
10M<n<100M语言创建人:
crowdsourced批注创建人:
no-annotation源数据集:
original许可:
Github KcBERT Repo: https://github.com/Beomi/KcBERT KcBERT is Korean Comments BERT pretrained on this Corpus set. (You can use it via Huggingface's Transformers library!)
This Kaggle Dataset contains CLEANED dataset preprocessed with the code below.
import re
import emoji
from soynlp.normalizer import repeat_normalize
emojis = ''.join(emoji.UNICODE_EMOJI.keys())
pattern = re.compile(f'[^ .,?!/@$%~%·∼()\x00-\x7Fㄱ-힣{emojis}]+')
url_pattern = re.compile(
r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')
def clean(x):
x = pattern.sub(' ', x)
x = url_pattern.sub('', x)
x = x.strip()
x = repeat_normalize(x, num_repeats=2)
return x
>>> from datasets import load_dataset
>>> dataset = load_dataset("Bingsu/KcBERT_Pre-Training_Corpus")
>>> dataset
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 86246285
})
})
download: 7.90 GiB generated: 11.86 GiB total: 19.76 GiB
※ You can download this dataset from kaggle , and it's 5 GiB. (12.48 GiB when uncompressed)
| train | |
|---|---|
| # of texts | 86246285 |