Command Palette
Search for a command to run...
Shuang Li Yilun Du Joshua B. Tenenbaum Antonio Torralba Igor Mordatch

Abstract
Large pre-trained models exhibit distinct and complementary capabilities dependent on the data they are trained on. Language models such as GPT-3 are capable of textual reasoning but cannot understand visual information, while vision models such as DALL-E can generate photorealistic photos but fail to understand complex language descriptions. In this work, we propose a unified framework for composing ensembles of different pre-trained models -- combining the strengths of each individual model to solve various multimodal problems in a zero-shot manner. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. The generator constructs proposals and the scorers iteratively provide feedback to refine the generated result. Such closed-loop communication enables models to correct errors caused by other models, significantly boosting performance on downstream tasks, e.g. improving accuracy on grade school math problems by 7.5%, without requiring any model finetuning. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer, by leveraging the strengths of each expert model. Results show that the proposed method can be used as a general purpose framework for a wide range of zero-shot multimodal tasks, such as image generation, video question answering, mathematical reasoning, and robotic manipulation. Project page: https://energy-based-model.github.io/composing-pretrained-models.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | GPT-2-Medium 355M + question-solution classifier (BS=1) | Accuracy: 16.8 Parameters (Billion): 0.355 |
| arithmetic-reasoning-on-gsm8k | GPT-2-Medium 355M (fine-tuned, BS=5) | Accuracy: 18.3 Parameters (Billion): 0.355 |
| arithmetic-reasoning-on-gsm8k | GPT-2-Medium 355M (BS=5) | Accuracy: 12.2 Parameters (Billion): 0.355 |
| arithmetic-reasoning-on-gsm8k | GPT-2-Medium 355M + question-solution classifier (BS=5) | Accuracy: 20.8 Parameters (Billion): 0.355 |
| image-generation-on-imagenet-64x64 | GLIDE + CLS-FREE | FID: 29.219 Inception Score: 25.926 KID: 5.325 |
| image-generation-on-imagenet-64x64 | GLIDE +CLS | KID: 7.952 |
| image-generation-on-imagenet-64x64 | GLIDE + CLIP | FID: 30.462 Inception Score: 25.017 KID: 6.174 |
| image-generation-on-imagenet-64x64 | GLIDE + CLS | FID: 30.871 Inception Score: 22.077 |
| image-generation-on-imagenet-64x64 | GLIDE + CLIP + CLS + CLS-FREE | FID: 29.184 Inception Score: 34.952 KID: 3.766 |
| video-question-answering-on-activitynet-qa | GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot) | Accuracy: 61.2 |
| video-question-answering-on-activitynet-qa | GPT-2 + CLIP-32 (Zero-Shot) | Accuracy: 58.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.