Dataset for training models to classify human written vs GPT/ChatGPT generated text. This dataset contains Wikipedia introductions and GPT (Curie) generated introductions for 150k topics.
Prompt used for generating text
200 word wikipedia style introduction on '{title}'
{starter_text}
where title is the title for the wikipedia page, and starter_text is the first seven words of the wikipedia introduction. Here's an example of prompt used to generate the introduction paragraph for 'Secretory protein' -
'200 word wikipedia style introduction on Secretory protein
A secretory protein is any protein, whether'
Configuration used for GPT model
model="text-curie-001", prompt=prompt, temperature=0.7, max_tokens=300, top_p=1, frequency_penalty=0.4, presence_penalty=0.1
Schema for the dataset
| Column | Datatype | Description |
|---|---|---|
| id | int64 | ID |
| url | string | Wikipedia URL |
| title | string | Title |
| wiki_intro | string | Introduction paragraph from wikipedia |
| generated_intro | string | Introduction generated by GPT (Curie) model |
| title_len | int64 | Number of words in title |
| wiki_intro_len | int64 | Number of words in wiki_intro |
| generated_intro_len | int64 | Number of words in generated_intro |
| prompt | string | Prompt used to generate intro |
| generated_text | string | Text continued after the prompt |
| prompt_tokens | int64 | Number of tokens in the prompt |
| generated_text_tokens | int64 | Number of tokens in generated text |
Code to create this dataset can be found on GitHub
@misc {aaditya_bhat_2023,
author = { {Aaditya Bhat} },
title = { GPT-wiki-intro (Revision 0e458f5) },
year = 2023,
url = { https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro },
doi = { 10.57967/hf/0326 },
publisher = { Hugging Face }
}