Command Palette
Search for a command to run...
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Abstract
We present SPHINX, a versatile multi-modal large language model (MLLM) with ajoint mixing of model weights, tuning tasks, and visual embeddings. First, forstronger vision-language alignment, we unfreeze the large language model (LLM)during pre-training, and introduce a weight mix strategy between LLMs trainedby real-world and synthetic data. By directly integrating the weights from twodomains, the mixed LLM can efficiently incorporate diverse semantics withfavorable robustness. Then, to enable multi-purpose capabilities, we mix avariety of tasks for joint visual instruction tuning, and design task-specificinstructions to avoid inter-task conflict. In addition to the basic visualquestion answering, we include more challenging tasks such as region-levelunderstanding, caption grounding, document layout detection, and human poseestimation, contributing to mutual enhancement over different scenarios.Additionally, we propose to extract comprehensive visual embeddings fromvarious network architectures, pre-training paradigms, and informationgranularity, providing language models with more robust image representations.Based on our proposed joint mixing, SPHINX exhibits superior multi-modalunderstanding capabilities on a wide range of applications. On top of this, wefurther propose an efficient strategy aiming to better capture fine-grainedappearances of high-resolution images. With a mixing of different scales andhigh-resolution sub-images, SPHINX attains exceptional visual parsing andreasoning performance on existing evaluation benchmarks. We hope our work maycast a light on the exploration of joint mixing in future MLLM research. Codeis released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| described-object-detection-on-description | SPHINX-7B | Intra-scenario ABS mAP: 7.9 Intra-scenario FULL mAP: 10.6 Intra-scenario PRES mAP: 11.4 |
| visual-question-answering-on-benchlmm | Sphinx-V2-1K | GPT-3.5 score: 57.43 |
| visual-question-answering-on-mm-vet | SPHINX-2k | GPT-4 score: 40.2 |
| visual-question-answering-vqa-on-core-mm | SPHINX v2 | Abductive: 49.85 Analogical: 20.69 Deductive: 42.17 Overall score: 39.48 Params: 16B |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.