Command Palette
Search for a command to run...
Li Yuheng ; Liu Haotian ; Wu Qingyang ; Mu Fangzhou ; Yang Jianwei ; Gao Jianfeng ; Li Chunyuan ; Lee Yong Jae

Abstract
Large-scale text-to-image diffusion models have made amazing advances.However, the status quo is to use text input alone, which can impedecontrollability. In this work, we propose GLIGEN, Grounded-Language-to-ImageGeneration, a novel approach that builds upon and extends the functionality ofexisting pre-trained text-to-image diffusion models by enabling them to also beconditioned on grounding inputs. To preserve the vast concept knowledge of thepre-trained model, we freeze all of its weights and inject the groundinginformation into new trainable layers via a gated mechanism. Our model achievesopen-world grounded text2img generation with caption and bounding box conditioninputs, and the grounding ability generalizes well to novel spatialconfigurations and concepts. GLIGEN's zero-shot performance on COCO and LVISoutperforms that of existing supervised layout-to-image baselines by a largemargin.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| conditional-text-to-image-synthesis-on-coco-1 | Gligen (zero-shot) | instance success rate: 0.30 mIoU: 0.27 |
| layout-to-image-generation-on-layoutbench-1 | GLIGEN | AP: 30.7 |
| layout-to-image-generation-on-layoutbench-2 | GLIGEN | AP: 38.9 |
| layout-to-image-generation-on-layoutbench-3 | GLIGEN | AP: 33.3 |
| layout-to-image-generation-on-layoutbench-4 | GLIGEN | AP: 36.3 |
| text-to-image-generation-on-coco | GLIGEN (fine-tuned, Detection data only) | FID: 5.82 |
| text-to-image-generation-on-coco | GLIGEN (fine-tuned, Grounding data) | FID: 6.38 |
| text-to-image-generation-on-coco | GLIGEN (fine-tuned, Detection + Caption data) | FID: 5.61 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.