英文

这个模型是在我们的 ChatGPT paraphrase dataset 上进行训练的。

这个数据集基于 Quora paraphrase question ,其中包括来自 SQUAD 2.0 CNN news dataset 的文本。

这个模型是基于T5-base模型的。我们使用了"迁移学习"来让我们的模型生成同义词,同时也可以进行ChatGPT。现在我们可以说这是Hugging Face最好的同义词之一。

Kaggle 链接

部署示例

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")

model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

def paraphrase(
    question,
    num_beams=5,
    num_beam_groups=5,
    num_return_sequences=5,
    repetition_penalty=10.0,
    diversity_penalty=3.0,
    no_repeat_ngram_size=2,
    temperature=0.7,
    max_length=128
):
    input_ids = tokenizer(
        f'paraphrase: {question}',
        return_tensors="pt", padding="longest",
        max_length=max_length,
        truncation=True,
    ).input_ids
    
    outputs = model.generate(
        input_ids, temperature=temperature, repetition_penalty=repetition_penalty,
        num_return_sequences=num_return_sequences, no_repeat_ngram_size=no_repeat_ngram_size,
        num_beams=num_beams, num_beam_groups=num_beam_groups,
        max_length=max_length, diversity_penalty=diversity_penalty
    )

    res = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return res

使用示例

输入:

text = 'What are the best places to see in New York?'
paraphrase(text)

输出:

['What are some must-see places in New York?',
 'Can you suggest some must-see spots in New York?',
 'Where should one go to experience the best NYC has to offer?',
 'Which places should I visit in New York?',
 'What are the top destinations to explore in New York?']

输入:

text = "Rammstein's album Mutter was recorded in the south of France in May and June 2000, and mixed in Stockholm in October of that year."
paraphrase(text)

输出:

['In May and June 2000, Rammstein travelled to the south of France to record his album Mutter, which was mixed in Stockholm in October of that year.',
 'The album Mutter by Rammstein was recorded in the south of France during May and June 2000, with mixing taking place in Stockholm in October of that year.',
 'The album Mutter by Rammstein was recorded in the south of France during May and June 2000, with mixing taking place in Stockholm in October of that year. It',
 'Mutter, the album released by Rammstein, was recorded in southern France during May and June 2000, with mixing taking place between October and September.',
 'In May and June 2000, Rammstein recorded his album Mutter in the south of France, with the mix being made at Stockholm during October.']

训练参数

epochs = 5
batch_size = 64
max_length = 128
lr = 5e-5
batches_qty = 196465
betas = (0.9, 0.999)
eps = 1e-08

BibTeX条目和引用信息

@inproceedings{chatgpt_paraphraser,
  author={Vladimir Vorobev, Maxim Kuznetsov},
  title={A paraphrasing model based on ChatGPT paraphrases},
  year={2023}
}