Command Palette
Search for a command to run...
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
Wang Yue ; Wang Weishi ; Joty Shafiq ; Hoi Steven C. H.

Abstract
Pre-trained models for Natural Languages (NL) like BERT and GPT have beenrecently shown to transfer well to Programming Languages (PL) and largelybenefit a broad set of code-related tasks. Despite their success, most currentmethods either rely on an encoder-only (or decoder-only) pre-training that issuboptimal for generation (resp. understanding) tasks or process the codesnippet in the same way as NL, neglecting the special characteristics of PLsuch as token types. We present CodeT5, a unified pre-trained encoder-decoderTransformer model that better leverages the code semantics conveyed from thedeveloper-assigned identifiers. Our model employs a unified framework toseamlessly support both code understanding and generation tasks and allows formulti-task learning. Besides, we propose a novel identifier-aware pre-trainingtask that enables the model to distinguish which code tokens are identifiersand to recover them when they are masked. Furthermore, we propose to exploitthe user-written code comments with a bimodal dual generation task for betterNL-PL alignment. Comprehensive experiments show that CodeT5 significantlyoutperforms prior methods on understanding tasks such as code defect detectionand clone detection, and generation tasks across various directions includingPL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can bettercapture semantic information from code. Our code and pre-trained models arereleased at https: //github.com/salesforce/CodeT5 .
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| code-generation-on-concode | CodeT5 | BLEU: 41.48 CodeBLEU: 44.10 Exact Match: 22.70 |
| code-translation-on-codexglue-codetrans | CodeT5 | Accuracy (C#→Java): 66.90 Accuracy (Java→C#): 65.90 BLEU (C#→Java): 79.87 BLEU (Java→C#): 84.03 |
| defect-detection-on-codexglue-devign | CodeT5 | Accuracy: 65.78 |
| text-to-code-generation-on-codexglue-concode | CodeT5 | BLEU: 41.48 CodeBLEU: 44.10 EM: 22.70 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.