Summary: Dataset of posts and comments from pikabu.ru , a website that is Russian Reddit/9gag.
Script: convert_pikabu.py
Point of Contact: Ilya Gusev
Languages: Mostly Russian.
Prerequisites:
pip install datasets zstandard jsonlines pysimdjson
Dataset iteration:
from datasets import load_dataset
dataset = load_dataset('IlyaGusev/pikabu', split="train", streaming=True)
for example in dataset:
print(example["text_markdown"])
{
"id": 69911642,
"title": "Что можно купить в Китае за цену нового iPhone 11 Pro",
"text_markdown": "...",
"timestamp": 1571221527,
"author_id": 2900955,
"username": "chinatoday.ru",
"rating": -4,
"pluses": 9,
"minuses": 13,
"url": "...",
"tags": ["Китай", "AliExpress", "Бизнес"],
"blocks": {"data": ["...", "..."], "type": ["text", "text"]},
"comments": {
"id": [152116588, 152116426],
"text_markdown": ["...", "..."],
"text_html": ["...", "..."],
"images": [[], []],
"rating": [2, 0],
"pluses": [2, 0],
"minuses": [0, 0],
"author_id": [2104711, 2900955],
"username": ["FlyZombieFly", "chinatoday.ru"]
}
}
You can use this little helper to unflatten sequences:
def revert_flattening(records):
fixed_records = []
for key, values in records.items():
if not fixed_records:
fixed_records = [{} for _ in range(len(values))]
for i, value in enumerate(values):
fixed_records[i][key] = value
return fixed_records
The dataset is not anonymized, so individuals' names can be found in the dataset. Information about the original authors is included in the dataset where possible.