Habr数据集

描述

摘要：这是来自俄罗斯博客——IT、计算机科学和与互联网相关的博客 habr.com 的帖子和评论的数据集。

脚本： create_habr.py

联系人：Ilya Gusev

语言：俄语、英语、一些编程代码。

用法

先决条件：

pip install datasets zstandard jsonlines pysimdjson

数据集迭代：

from datasets import load_dataset
dataset = load_dataset('IlyaGusev/habr', split="train", streaming=True)
for example in dataset:
    print(example["text_markdown"])

数据实例

{
  "id": 12730,
  "language": "ru",
  "url": "https://habr.com/ru/post/12730/",
  "text_markdown": "...",
  "text_html": "...",
  "lead_markdown": "...",
  "lead_html": "...",
  "type": "article",
  "labels": [],
  "original_author": null,
  "original_url": null,
  "time_published": 1185962380,
  "author": "...",
  "title": "Хочешь в университет — сделай презентацию",
  "statistics": {
    "commentsCount": 23,
    "favoritesCount": 1,
    "readingCount": 1542,
    "score": 7,
    "votesCount": 15,
    "votesCountPlus": 11,
    "votesCountMinus": 4
  },
  "hubs": [
    "itcompanies"
  ],
  "flows": [
    "popsci"
  ],
  "tags": [
    "PowerPoint",
    "презентация",
    "абитуриенты",
  ],
  "reading_time": 1,
  "format": null,
  "complexity": null,
  "comments": {
    "id": [11653537, 11653541],
    "parent_id": [null, 11653537],
    "level": [0, 1],
    "time_published": [1185963192, 1185967886],
    "score": [-1, 0],
    "votes": [1, 0],
    "message_html": ["...", "..."],
    "author": ["...", "..."],
    "children": [[11653541], []]
  }
}

您可以使用这个小工具将序列转为嵌套形式：

def revert_flattening(records):
    fixed_records = []
    for key, values in records.items():
        if not fixed_records:
            fixed_records = [{} for _ in range(len(values))]
        for i, value in enumerate(values):
            fixed_records[i][key] = value
    return fixed_records

原始的JSONL已经是嵌套形式的。

来源数据

数据来源是 Habr 网站。
API调用示例： post 709430 。
处理脚本是 here 。

个人和敏感信息

数据集未经匿名处理，因此数据集中可能包含个人姓名。在可能的情况下，原始作者的信息已包含在数据集中。

作者:

IlyaGusev

数据集大小:

3.25 GB