HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Wang Yue ; Wang Weishi ; Joty Shafiq ; Hoi Steven C. H.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
  Code Understanding and Generation

Abstract

Pre-trained models for Natural Languages (NL) like BERT and GPT have beenrecently shown to transfer well to Programming Languages (PL) and largelybenefit a broad set of code-related tasks. Despite their success, most currentmethods either rely on an encoder-only (or decoder-only) pre-training that issuboptimal for generation (resp. understanding) tasks or process the codesnippet in the same way as NL, neglecting the special characteristics of PLsuch as token types. We present CodeT5, a unified pre-trained encoder-decoderTransformer model that better leverages the code semantics conveyed from thedeveloper-assigned identifiers. Our model employs a unified framework toseamlessly support both code understanding and generation tasks and allows formulti-task learning. Besides, we propose a novel identifier-aware pre-trainingtask that enables the model to distinguish which code tokens are identifiersand to recover them when they are masked. Furthermore, we propose to exploitthe user-written code comments with a bimodal dual generation task for betterNL-PL alignment. Comprehensive experiments show that CodeT5 significantlyoutperforms prior methods on understanding tasks such as code defect detectionand clone detection, and generation tasks across various directions includingPL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can bettercapture semantic information from code. Our code and pre-trained models arereleased at https: //github.com/salesforce/CodeT5 .

Code Repositories

awsm-research/vulrepair
jax
Mentioned in GitHub
salesforce/codet5
Official
pytorch
Mentioned in GitHub
salesforce/coderl
jax
Mentioned in GitHub
fewshotcdcs/cdcs
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
code-generation-on-concodeCodeT5
BLEU: 41.48
CodeBLEU: 44.10
Exact Match: 22.70
code-translation-on-codexglue-codetransCodeT5
Accuracy (C#→Java): 66.90
Accuracy (Java→C#): 65.90
BLEU (C#→Java): 79.87
BLEU (Java→C#): 84.03
defect-detection-on-codexglue-devignCodeT5
Accuracy: 65.78
text-to-code-generation-on-codexglue-concodeCodeT5
BLEU: 41.48
CodeBLEU: 44.10
EM: 22.70

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp