HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Abstract

We present CodeBERT, a bimodal pre-trained model for programming language(PL) and nat-ural language (NL). CodeBERT learns general-purposerepresentations that support downstream NL-PL applications such as naturallanguage codesearch, code documentation generation, etc. We develop CodeBERTwith Transformer-based neural architecture, and train it with a hybridobjective function that incorporates the pre-training task of replaced tokendetection, which is to detect plausible alternatives sampled from generators.This enables us to utilize both bimodal data of NL-PL pairs and unimodal data,where the former provides input tokens for model training while the latterhelps to learn better generators. We evaluate CodeBERT on two NL-PLapplications by fine-tuning model parameters. Results show that CodeBERTachieves state-of-the-art performance on both natural language code search andcode documentation generation tasks. Furthermore, to investigate what type ofknowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, andevaluate in a zero-shot setting where parameters of pre-trained models arefixed. Results show that CodeBERT performs better than previous pre-trainedmodels on NL-PL probing.

Code Repositories

zhangzwwww/dietcode
pytorch
Mentioned in GitHub
JasonZhu-WHU/CoderAssistant
Mentioned in GitHub
microsoft/CodeBERT
Official
pytorch
Mentioned in GitHub
zfj1998/CodeBert-Code2Text
pytorch
Mentioned in GitHub
salesforce/codet5
pytorch
Mentioned in GitHub
aminatadjer/test
pytorch
Mentioned in GitHub
sakirinn/llm4cvd
pytorch
Mentioned in GitHub
zhangzwwww/dietcodebert
pytorch
Mentioned in GitHub
graykode/commit-autosuggestions
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
code-documentation-generation-onTransformer
Smoothed BLEU-4: 13.44
code-documentation-generation-onCodeBERT (MLM)
Smoothed BLEU-4: 15.48
code-documentation-generation-onpre-train w/ code only
Smoothed BLEU-4: 15.12
code-documentation-generation-onseq2seq
Smoothed BLEU-4: 13.04
code-documentation-generation-onCodeBERT (MLM+RTD)
Smoothed BLEU-4: 15.41
code-documentation-generation-onRoBERTa
Smoothed BLEU-4: 14.92
code-documentation-generation-on-1pre-train w/ code only
Smoothed BLEU-4: 13.07
code-documentation-generation-on-1CodeBERT (MLM+RTD)
Smoothed BLEU-4: 14.56
code-documentation-generation-on-1RoBERTa
Smoothed BLEU-4: 13.2
code-documentation-generation-on-1Transformer
Smoothed BLEU-4: 12.57
code-documentation-generation-on-1CodeBERT (MLM)
Smoothed BLEU-4: 13.59
code-documentation-generation-on-1CodeBERT (RTD)
Smoothed BLEU-4: 12.72
code-documentation-generation-on-1seq2seq
Smoothed BLEU-4: 11.42
code-documentation-generation-on-2RoBERTa
Smoothed BLEU-4: 26.09
code-documentation-generation-on-2pre-train w/ code only
Smoothed BLEU-4: 26.39
code-documentation-generation-on-2CodeBERT (RTD)
Smoothed BLEU-4: 26.02
code-documentation-generation-on-2CodeBERT (MLM)
Smoothed BLEU-4: 26.79
code-documentation-generation-on-2seq2seq
Smoothed BLEU-4: 23.48
code-documentation-generation-on-2CodeBERT (MLM+RTD)
Smoothed BLEU-4: 26.66
code-documentation-generation-on-3pre-train w/ code only
Smoothed BLEU-4: 20.71
code-documentation-generation-on-3CodeBERT (RTD)
Smoothed BLEU-4: 20.25
code-documentation-generation-on-3CodeBERT (MLM+RTD)
Smoothed BLEU-4: 21.32
code-documentation-generation-on-3seq2seq
Smoothed BLEU-4: 18.4
code-documentation-generation-on-3RoBERTa
Smoothed BLEU-4: 19.9
code-documentation-generation-on-3Transformer
Smoothed BLEU-4: 18.25
code-documentation-generation-on-3CodeBERT (MLM)
Smoothed BLEU-4: 21
code-documentation-generation-on-4CodeBERT (MLM+RTD)
Smoothed BLEU-4: 8.46
code-documentation-generation-on-4RoBERTa
Smoothed BLEU-4: 7.26
code-documentation-generation-on-4CodeBERT (MLM)
Smoothed BLEU-4: 7.95
code-documentation-generation-on-4Transformer
Smoothed BLEU-4: 7.87
code-documentation-generation-on-4seq2seq
Smoothed BLEU-4: 6.96
code-documentation-generation-on-4pre-train w/ code only
Smoothed BLEU-4: 7.36
code-documentation-generation-on-5seq2seq
Smoothed BLEU-4: 6.88
code-documentation-generation-on-5Transformer
Smoothed BLEU-4: 25.61
code-documentation-generation-on-5pre-train w/ code only
Smoothed BLEU-4: 8.3
code-documentation-generation-on-5CodeBERT (MLM+RTD)
Smoothed BLEU-4: 9.54
code-documentation-generation-on-5CodeBERT (RTD)
Smoothed BLEU-4: 8.73
code-documentation-generation-on-5CodeBERT (MLM)
Smoothed BLEU-4: 8.51
code-documentation-generation-on-5RoBERTa
Smoothed BLEU-4: 5.72
code-documentation-generation-on-6CodeBERT (MLM+RTD)
Smoothed BLEU-4: 15.99
code-documentation-generation-on-6CodeBERT (RTD)
Smoothed BLEU-4: 15.03
code-documentation-generation-on-6RoBERTa
Smoothed BLEU-4: 14.52
code-documentation-generation-on-6Transformer
Smoothed BLEU-4: 14.31
code-documentation-generation-on-6pre-train w/ code only
Smoothed BLEU-4: 15.15
code-documentation-generation-on-6seq2seq
Smoothed BLEU-4: 13.36
code-documentation-generation-on-6CodeBERT (MLM)
Smoothed BLEU-4: 15.55
code-search-on-codesearchnetCodeBERT
Go: 69.3
JS: 74.8
Java: 86.8
Overall: 76.0
PHP: 70.6
Python: 84.0
Ruby: 70.6
type-prediction-on-manytypes4typescriptCodeBERT
Average Accuracy: 61.72
Average F1: 59.57
Average Precision: 59.34
Average Recall: 59.80

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
CodeBERT: A Pre-Trained Model for Programming and Natural Languages | Papers | HyperAI