3 months ago

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Seungwhan Moon Andrea Madotto Zhaojiang Lin Tushar Nagarajan Matt Smith Shashank Jain Chun-Fu Yeh Prakash Murugesan Peyman Heidari Yue Liu

Abstract

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

Code Repositories

nokia-bell-labs/papagei-foundation-model

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
video-question-answering-on-situated	AnyMAL-70B (0-shot)	Average Accuracy: 48.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette