摘要:这是来自俄罗斯博客——IT、计算机科学和与互联网相关的博客 habr.com 的帖子和评论的数据集。
脚本: create_habr.py
联系人:Ilya Gusev
语言:俄语、英语、一些编程代码。
先决条件:
pip install datasets zstandard jsonlines pysimdjson
数据集迭代:
from datasets import load_dataset
dataset = load_dataset('IlyaGusev/habr', split="train", streaming=True)
for example in dataset:
print(example["text_markdown"])
{
"id": 12730,
"language": "ru",
"url": "https://habr.com/ru/post/12730/",
"text_markdown": "...",
"text_html": "...",
"lead_markdown": "...",
"lead_html": "...",
"type": "article",
"labels": [],
"original_author": null,
"original_url": null,
"time_published": 1185962380,
"author": "...",
"title": "Хочешь в университет — сделай презентацию",
"statistics": {
"commentsCount": 23,
"favoritesCount": 1,
"readingCount": 1542,
"score": 7,
"votesCount": 15,
"votesCountPlus": 11,
"votesCountMinus": 4
},
"hubs": [
"itcompanies"
],
"flows": [
"popsci"
],
"tags": [
"PowerPoint",
"презентация",
"абитуриенты",
],
"reading_time": 1,
"format": null,
"complexity": null,
"comments": {
"id": [11653537, 11653541],
"parent_id": [null, 11653537],
"level": [0, 1],
"time_published": [1185963192, 1185967886],
"score": [-1, 0],
"votes": [1, 0],
"message_html": ["...", "..."],
"author": ["...", "..."],
"children": [[11653541], []]
}
}
您可以使用这个小工具将序列转为嵌套形式:
def revert_flattening(records):
fixed_records = []
for key, values in records.items():
if not fixed_records:
fixed_records = [{} for _ in range(len(values))]
for i, value in enumerate(values):
fixed_records[i][key] = value
return fixed_records
原始的JSONL已经是嵌套形式的。
数据集未经匿名处理,因此数据集中可能包含个人姓名。在可能的情况下,原始作者的信息已包含在数据集中。