8 months ago

Ziyi Lin extsuperscript1,2* Chris Liu extsuperscript1* Renrui Zhang extsuperscript1,2* Peng Gao extsuperscript1†‡ Longtian Qiu extsuperscript1,3 Han Xiao extsuperscript1 Han Qiu extsuperscript1 Chen Lin extsuperscript1 Wenqi Shao extsuperscript1 Keqin Chen extsuperscript1

Abstract

We present SPHINX, a versatile multi-modal large language model (MLLM) with ajoint mixing of model weights, tuning tasks, and visual embeddings. First, forstronger vision-language alignment, we unfreeze the large language model (LLM)during pre-training, and introduce a weight mix strategy between LLMs trainedby real-world and synthetic data. By directly integrating the weights from twodomains, the mixed LLM can efficiently incorporate diverse semantics withfavorable robustness. Then, to enable multi-purpose capabilities, we mix avariety of tasks for joint visual instruction tuning, and design task-specificinstructions to avoid inter-task conflict. In addition to the basic visualquestion answering, we include more challenging tasks such as region-levelunderstanding, caption grounding, document layout detection, and human poseestimation, contributing to mutual enhancement over different scenarios.Additionally, we propose to extract comprehensive visual embeddings fromvarious network architectures, pre-training paradigms, and informationgranularity, providing language models with more robust image representations.Based on our proposed joint mixing, SPHINX exhibits superior multi-modalunderstanding capabilities on a wide range of applications. On top of this, wefurther propose an efficient strategy aiming to better capture fine-grainedappearances of high-resolution images. With a mixing of different scales andhigh-resolution sub-images, SPHINX attains exceptional visual parsing andreasoning performance on existing evaluation benchmarks. We hope our work maycast a light on the exploration of joint mixing in future MLLM research. Codeis released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Multi-Task Learning

Multimodal Representation

Document Understanding

Method/Architecture

Natural Language Processing

Multimodality

Task/Problem

Ziyi Lin extsuperscript1,2* Chris Liu extsuperscript1* Renrui Zhang extsuperscript1,2* Peng Gao extsuperscript1†‡ Longtian Qiu extsuperscript1,3 Han Xiao extsuperscript1 Han Qiu extsuperscript1 Chen Lin extsuperscript1 Wenqi Shao extsuperscript1 Keqin Chen extsuperscript1

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Multi-Task Learning

Multimodal Representation

Document Understanding

Method/Architecture

Natural Language Processing

Multimodality

Task/Problem

Ziyi Lin extsuperscript1,2* Chris Liu extsuperscript1* Renrui Zhang extsuperscript1,2* Peng Gao extsuperscript1†‡ Longtian Qiu extsuperscript1,3 Han Xiao extsuperscript1 Han Qiu extsuperscript1 Chen Lin extsuperscript1 Wenqi Shao extsuperscript1 Keqin Chen extsuperscript1

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models | Papers | HyperAI

Command Palette

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin extsuperscript1,2* Chris Liu extsuperscript1* Renrui Zhang extsuperscript1,2* Peng Gao extsuperscript1*†‡ Longtian Qiu extsuperscript1,3* Han Xiao extsuperscript1 Han Qiu extsuperscript1 Chen Lin extsuperscript1 Wenqi Shao extsuperscript1 Keqin Chen extsuperscript16 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin extsuperscript1,2* Chris Liu extsuperscript1* Renrui Zhang extsuperscript1,2* Peng Gao extsuperscript1*†‡ Longtian Qiu extsuperscript1,3* Han Xiao extsuperscript1 Han Qiu extsuperscript1 Chen Lin extsuperscript1 Wenqi Shao extsuperscript1 Keqin Chen extsuperscript16 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin extsuperscript1,2* Chris Liu extsuperscript1* Renrui Zhang extsuperscript1,2* Peng Gao extsuperscript1*†‡ Longtian Qiu extsuperscript1,3* Han Xiao extsuperscript1 Han Qiu extsuperscript1 Chen Lin extsuperscript1 Wenqi Shao extsuperscript1 Keqin Chen extsuperscript16 more

Abstract

Build AI with AI

HyperAI Newsletters

Ziyi Lin extsuperscript1,2* Chris Liu extsuperscript1* Renrui Zhang extsuperscript1,2* Peng Gao extsuperscript1†‡ Longtian Qiu extsuperscript1,3 Han Xiao extsuperscript1 Han Qiu extsuperscript1 Chen Lin extsuperscript1 Wenqi Shao extsuperscript1 Keqin Chen extsuperscript1

Ziyi Lin extsuperscript1,2* Chris Liu extsuperscript1* Renrui Zhang extsuperscript1,2* Peng Gao extsuperscript1†‡ Longtian Qiu extsuperscript1,3 Han Xiao extsuperscript1 Han Qiu extsuperscript1 Chen Lin extsuperscript1 Wenqi Shao extsuperscript1 Keqin Chen extsuperscript1

Ziyi Lin extsuperscript1,2* Chris Liu extsuperscript1* Renrui Zhang extsuperscript1,2* Peng Gao extsuperscript1†‡ Longtian Qiu extsuperscript1,3 Han Xiao extsuperscript1 Han Qiu extsuperscript1 Chen Lin extsuperscript1 Wenqi Shao extsuperscript1 Keqin Chen extsuperscript1