4 个月前

CodeBERT:一种面向编程语言和自然语言的预训练模型

CodeBERT:一种面向编程语言和自然语言的预训练模型

摘要

我们介绍了CodeBERT,这是一种用于编程语言(PL)和自然语言(NL)的双模态预训练模型。CodeBERT学习通用表示,支持下游的NL-PL应用,如自然语言代码搜索、代码文档生成等。我们使用基于Transformer的神经架构开发了CodeBERT,并通过混合目标函数对其进行训练,该目标函数包括替换标记检测的预训练任务,即检测从生成器中采样的合理替代项。这使我们能够同时利用NL-PL对的双模态数据和单模态数据,前者为模型训练提供输入标记,后者有助于学习更好的生成器。我们在两个NL-PL应用上通过微调模型参数来评估CodeBERT。结果显示,CodeBERT在这两项自然语言代码搜索和代码文档生成任务上均达到了最先进的性能。此外,为了探究CodeBERT所学到的知识类型,我们构建了一个用于NL-PL探测的数据集,并在零样本设置下进行评估,其中预训练模型的参数保持固定。结果表明,CodeBERT在NL-PL探测方面优于之前的预训练模型。

代码仓库

zhangzwwww/dietcode
pytorch
GitHub 中提及
microsoft/CodeBERT
官方
pytorch
GitHub 中提及
zfj1998/CodeBert-Code2Text
pytorch
GitHub 中提及
salesforce/codet5
pytorch
GitHub 中提及
aminatadjer/test
pytorch
GitHub 中提及
sakirinn/llm4cvd
pytorch
GitHub 中提及
zhangzwwww/dietcodebert
pytorch
GitHub 中提及
graykode/commit-autosuggestions
pytorch
GitHub 中提及

基准测试

基准方法指标
code-documentation-generation-onTransformer
Smoothed BLEU-4: 13.44
code-documentation-generation-onCodeBERT (MLM)
Smoothed BLEU-4: 15.48
code-documentation-generation-onpre-train w/ code only
Smoothed BLEU-4: 15.12
code-documentation-generation-onseq2seq
Smoothed BLEU-4: 13.04
code-documentation-generation-onCodeBERT (MLM+RTD)
Smoothed BLEU-4: 15.41
code-documentation-generation-onRoBERTa
Smoothed BLEU-4: 14.92
code-documentation-generation-on-1pre-train w/ code only
Smoothed BLEU-4: 13.07
code-documentation-generation-on-1CodeBERT (MLM+RTD)
Smoothed BLEU-4: 14.56
code-documentation-generation-on-1RoBERTa
Smoothed BLEU-4: 13.2
code-documentation-generation-on-1Transformer
Smoothed BLEU-4: 12.57
code-documentation-generation-on-1CodeBERT (MLM)
Smoothed BLEU-4: 13.59
code-documentation-generation-on-1CodeBERT (RTD)
Smoothed BLEU-4: 12.72
code-documentation-generation-on-1seq2seq
Smoothed BLEU-4: 11.42
code-documentation-generation-on-2RoBERTa
Smoothed BLEU-4: 26.09
code-documentation-generation-on-2pre-train w/ code only
Smoothed BLEU-4: 26.39
code-documentation-generation-on-2CodeBERT (RTD)
Smoothed BLEU-4: 26.02
code-documentation-generation-on-2CodeBERT (MLM)
Smoothed BLEU-4: 26.79
code-documentation-generation-on-2seq2seq
Smoothed BLEU-4: 23.48
code-documentation-generation-on-2CodeBERT (MLM+RTD)
Smoothed BLEU-4: 26.66
code-documentation-generation-on-3pre-train w/ code only
Smoothed BLEU-4: 20.71
code-documentation-generation-on-3CodeBERT (RTD)
Smoothed BLEU-4: 20.25
code-documentation-generation-on-3CodeBERT (MLM+RTD)
Smoothed BLEU-4: 21.32
code-documentation-generation-on-3seq2seq
Smoothed BLEU-4: 18.4
code-documentation-generation-on-3RoBERTa
Smoothed BLEU-4: 19.9
code-documentation-generation-on-3Transformer
Smoothed BLEU-4: 18.25
code-documentation-generation-on-3CodeBERT (MLM)
Smoothed BLEU-4: 21
code-documentation-generation-on-4CodeBERT (MLM+RTD)
Smoothed BLEU-4: 8.46
code-documentation-generation-on-4RoBERTa
Smoothed BLEU-4: 7.26
code-documentation-generation-on-4CodeBERT (MLM)
Smoothed BLEU-4: 7.95
code-documentation-generation-on-4Transformer
Smoothed BLEU-4: 7.87
code-documentation-generation-on-4seq2seq
Smoothed BLEU-4: 6.96
code-documentation-generation-on-4pre-train w/ code only
Smoothed BLEU-4: 7.36
code-documentation-generation-on-5seq2seq
Smoothed BLEU-4: 6.88
code-documentation-generation-on-5Transformer
Smoothed BLEU-4: 25.61
code-documentation-generation-on-5pre-train w/ code only
Smoothed BLEU-4: 8.3
code-documentation-generation-on-5CodeBERT (MLM+RTD)
Smoothed BLEU-4: 9.54
code-documentation-generation-on-5CodeBERT (RTD)
Smoothed BLEU-4: 8.73
code-documentation-generation-on-5CodeBERT (MLM)
Smoothed BLEU-4: 8.51
code-documentation-generation-on-5RoBERTa
Smoothed BLEU-4: 5.72
code-documentation-generation-on-6CodeBERT (MLM+RTD)
Smoothed BLEU-4: 15.99
code-documentation-generation-on-6CodeBERT (RTD)
Smoothed BLEU-4: 15.03
code-documentation-generation-on-6RoBERTa
Smoothed BLEU-4: 14.52
code-documentation-generation-on-6Transformer
Smoothed BLEU-4: 14.31
code-documentation-generation-on-6pre-train w/ code only
Smoothed BLEU-4: 15.15
code-documentation-generation-on-6seq2seq
Smoothed BLEU-4: 13.36
code-documentation-generation-on-6CodeBERT (MLM)
Smoothed BLEU-4: 15.55
code-search-on-codesearchnetCodeBERT
Go: 69.3
JS: 74.8
Java: 86.8
Overall: 76.0
PHP: 70.6
Python: 84.0
Ruby: 70.6
type-prediction-on-manytypes4typescriptCodeBERT
Average Accuracy: 61.72
Average F1: 59.57
Average Precision: 59.34
Average Recall: 59.80

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供