Command Palette
Search for a command to run...
Yufan Zhou; Ruiyi Zhang; Changyou Chen; Chunyuan Li; Chris Tensmeyer; Tong Yu; Jiuxiang Gu; Jinhui Xu; Tong Sun

Abstract
One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| text-to-image-generation-on-coco | Lafite | FID: 8.12 Inception score: 32.34 SOA-C: 61.09 |
| text-to-image-generation-on-coco | Lafite (zero-shot) | FID: 26.94 FID-1: 22.97 FID-2: 18.70 FID-4: 15.72 FID-8: 14.79 Inception score: 26.02 |
| text-to-image-generation-on-cub | Lafite | FID: 10.48 Inception score: 5.97 |
| text-to-image-generation-on-multi-modal | Lafite | FID: 12.54 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.