Command Palette
Search for a command to run...
Jaemin Cho Jie Lei Hao Tan Mohit Bansal

Abstract
Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j-min/VL-T5
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-captioning-on-flickr30k-captions-test | VL-T5 | CIDEr: 2.6 SPICE: 2.0 |
| image-captioning-on-nocaps-val | VL-T5 | CIDEr: 4.4 SPICE: 5.3 |
| visual-question-answering-on-vcr-q-a-test | VL-T5 | Accuracy: 75.3 |
| visual-question-answering-on-vcr-q-ar-test | VL-T5 | Accuracy: 58.9 |
| visual-question-answering-on-vcr-qa-r-test | VL-T5 | Accuracy: 77.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.