英文

CodeBERT-base-mlm

CodeBERT: A Pre-Trained Model for Programming and Natural Languages 的预训练权重。

训练数据

该模型是在 CodeSearchNet 的代码语料库上进行训练的。

训练目标

该模型基于Roberta-base进行初始化,并使用简单的MLM(Masked Language Model)目标进行训练。

用法

from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained('microsoft/codebert-base-mlm')
tokenizer = RobertaTokenizer.from_pretrained('microsoft/codebert-base-mlm')

code_example = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(code_example)
print(outputs)

预期结果:

{'sequence': '<s> if (x is not None) and (x>1)</s>', 'score': 0.6049249172210693, 'token': 8}
{'sequence': '<s> if (x is not None) or (x>1)</s>', 'score': 0.30680200457572937, 'token': 50}
{'sequence': '<s> if (x is not None) if (x>1)</s>', 'score': 0.02133703976869583, 'token': 114}
{'sequence': '<s> if (x is not None) then (x>1)</s>', 'score': 0.018607674166560173, 'token': 172}
{'sequence': '<s> if (x is not None) AND (x>1)</s>', 'score': 0.007619690150022507, 'token': 4248}

参考

  • Bimodal CodeBERT trained with MLM+RTD objective (适用于代码搜索和文档生成)
  • ? Hugging Face's CodeBERTa (体积较小,6层)
  • 引用

    @misc{feng2020codebert,
        title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages},
        author={Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou},
        year={2020},
        eprint={2002.08155},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
    }