CodeParrot 🦜

CodeParrot 🦜是一个GPT-2模型（15亿个参数），用于生成Python代码。在最初的训练和v1.0发布之后，我们对模型进行了进一步的训练并发布了v1.1（有关详情请参见下文）。

用法

您可以直接在transformers中加载CodeParrot模型和标记器：

from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot")

inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs)

或者使用一个pipeline：

from transformers import pipeline

pipe = pipeline("text-generation", model="codeparrot/codeparrot")
outputs = pipe("def hello_world():")

训练

该模型经过两个步骤在清理过的 CodeParrot 🦜 dataset 上进行了训练。在初始训练（v1.0）之后，模型经过了另外30k步的训练，得到了v1.1，并且您可以在以下表格中找到设置：

Config	v1.0	v1.1
Batch size	512	512
Context size	1024	1024
Training steps	50'000	30'000
Gradient accumulation	16	16
Gradient checkpointing	True	True
Learning rate	2e-4	5e-5
Weight decay	0.1	0.1
Warmup steps	750	750
Schedule	Cosine	Cosine

训练是在16 x A100（40GB）GPU上执行的。这个设置大约使用了260亿+150亿个标记。

性能

我们在OpenAI的 HumanEval 基准上评估了模型，该基准包含编程挑战：

Metric	v1.0	v1.1
pass@1	3.58%	3.99%
pass@10	8.03%	8.69%
pass@100	14.96%	17.88%

pass@k metric 表示至少有k个生成的代码通过测试的概率。

资源

数据集： full ， train ， valid
代码： repository
空格： generation ， highlighting

作者:

CodeParrot

数据集大小:

7.64 GB