数据集:

musabg/wikipedia-tr-summarization

任务:

摘要生成

语言:

大小:

100K<n<1M

数据集介绍文件清单

中文

Wikipedia Turkish Summarization Dataset

Dataset Description

This is a Turkish summarization dataset 🇹🇷 prepared from the 2023 Wikipedia dump. The dataset has been cleaned, tokenized, and summarized using Huggingface Wikipedia dataset cleaner script, custom cleaning scripts, and OpenAI's gpt3.5-turbo API.

Data Source

Wikipedia's latest Turkish dump (2023 version) 🌐

Features

text: string (The original text extracted from Wikipedia articles 📖)
summary: string (The generated summary of the original text 📝)

Data Splits

Split	Num Bytes	Num Examples
train	324,460,408.048	119,110
validation	17,077,006.952	6,269

Download Size

216,029,002 bytes

Dataset Size

341,537,415 bytes

Data Preparation

Data Collection

The latest Turkish Wikipedia dump was downloaded 📥.

Huggingface Wikipedia dataset cleaner script was used to clean the text 🧹.

A custom script was used to further clean the text, removing sections like "Kaynakca" (References) and other irrelevant information 🛠️.

Tokenization

The dataset was tokenized using Google's MT5 tokenizer. The following criteria were applied:

Articles with a token count between 300 and 900 were selected ✔️.
Articles with less than 300 tokens were ignored ❌.
For articles with more than 900 tokens, only the first 900 tokens ending with a paragraph were selected 🔍.

Summarization

The generated raw texts were summarized using OpenAI's gpt3.5-turbo API 🤖.

Dataset Usage

This dataset can be used for various natural language processing tasks 👩‍💻, such as text summarization, machine translation, and language modeling in the Turkish language.

Example usage:

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("musabg/wikipedia-tr-summarization")

# Access the data
train_data = dataset["train"]
validation_data = dataset["validation"]

# Iterate through the data
for example in train_data:
  text = example["text"]
  summary = example["summary"]
  # Process the data as needed

Please make sure to cite the dataset as follows 📝:

@misc{musabg2023wikipediatrsummarization,
  author = {Musab Gultekin},
  title = {Wikipedia Turkish Summarization Dataset},
  year = {2023},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/datasets/musabg/wikipedia-tr-summarization}},
}

Wikipedia Türkçe Özetleme Veri Seti

Bu, 2023 Wikipedia dökümünden hazırlanan Türkçe özetleme veri kümesidir. Veri kümesi, Huggingface Wikipedia veri kümesi temizleme betiği, özel temizleme betikleri ve OpenAI'nin gpt3.5-turbo API'si kullanılarak temizlenmiş, tokenleştirilmiş ve özetlenmiştir.

Veri Kaynağı

Wikipedia'nın en güncel Türkçe dökümü (2023 sürümü)

Özellikler

text: string (Wikipedia makalelerinden çıkarılan orijinal metin)
summary: string (Orijinal metnin oluşturulan özeti)

Veri Bölümleri

Bölüm	Numara Baytı	Örnek Sayısı
train	324.460.408,048	119.110
validation	17.077.006,952	6.269

İndirme Boyutu

216.029.002 bayt

Veri Kümesi Boyutu

341.537.415 bayt

Veri Hazırlama

Veri Toplama

En güncel Türkçe Wikipedia dökümü indirildi.

Huggingface Wikipedia veri kümesi temizleme betiği metni temizlemek için kullanıldı.

"Kaynakça" (Referanslar) gibi bölümleri ve diğer alakasız bilgileri kaldırmak için özel bir betik kullanıldı.

Tokenleştirme

Veri kümesi, Google'ın MT5 tokenleştiricisi kullanılarak tokenleştirildi. Aşağıdaki kriterler uygulandı:

300 ile 900 token arasında olan makaleler seçildi.
300'den az tokeni olan makaleler dikkate alınmadı.
900'den fazla tokeni olan makalelerde, sadece bir paragraf ile biten ilk 900 token kısmı alındı.

Özetleme

Oluşturulan ham metinler, OpenAI'nin gpt3.5-turbo API'si kullanılarak özetlendi.

Veri Kümesi Kullanımı

Bu veri kümesi, Türkçe dilinde metin özetleme, makine çevirisi ve dil modelleme gibi çeşitli doğal dil işleme görevleri için kullanılabilir.

Örnek kullanım:

from datasets import load_dataset

# Veri kümesini yükle
dataset = load_dataset("musabg/wikipedia-tr-summarization")

# Verilere erişin
train_data = dataset["train"]
validation_data = dataset["validation"]

# Verilerin üzerinden geçin
for example in train_data:
  text = example["text"]
  summary = example["summary"]
  # Veriyi gerektiği gibi işleyin

作者:

musabg

数据集大小:

206.03 MB