stas/openwebtext-10k | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

stas/openwebtext-10k

数据集介绍文件清单

中文

10K slice of OpenWebText - An open-source replication of the WebText dataset from OpenAI.

This is a small subset representing the first 10K records from the original dataset - created for testing.

The full 8M-record dataset is here .

$ python -c "from datasets import load_dataset; ds=load_dataset('stas/openwebtext-10k'); print(ds)"
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 10000
    })
})

Records: 10,000
compressed size: ~15MB
uncompressed size: 50MB

To convert to jsonlines:

from datasets import load_dataset
dataset_name = "stas/openwebtext-10k"
name = dataset_name.split('/')[-1]
ds = load_dataset(dataset_name, split='train')
ds.to_json(f"{name}.jsonl", orient="records", lines=True)

To see how this subset was created, here is the instructions file .

作者:

stas

数据集大小:

7.9 KB