HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Shaolei Zhang Shoutao Guo Qingkai Fang Yan Zhou Yang Feng

Stream-Omni: Simultaneous Multimodal Interactions with Large
  Language-Vision-Speech Model

Abstract

The emergence of GPT-4o-like large multimodal models (LMMs) has raised theexploration of integrating text, vision, and speech modalities to support moreflexible multimodal interaction. Existing LMMs typically concatenaterepresentation of modalities along the sequence dimension and feed them into alarge language model (LLM) backbone. While sequence-dimension concatenation isstraightforward for modality integration, it often relies heavily onlarge-scale data to learn modality alignments. In this paper, we aim to modelthe relationships between modalities more purposefully, thereby achieving moreefficient and flexible modality alignments. To this end, we proposeStream-Omni, a large language-vision-speech model with efficient modalityalignments, which can simultaneously support interactions under variousmodality combinations. Stream-Omni employs LLM as the backbone and aligns thevision and speech to the text based on their relationships. For vision that issemantically complementary to text, Stream-Omni uses sequence-dimensionconcatenation to achieve vision-text alignment. For speech that is semanticallyconsistent with text, Stream-Omni introduces a CTC-based layer-dimensionmapping to achieve speech-text alignment. In this way, Stream-Omni can achievemodality alignments with less data (especially speech), enabling the transferof text capabilities to other modalities. Experiments on various benchmarksdemonstrate that Stream-Omni achieves strong performance on visualunderstanding, speech interaction, and vision-grounded speech interactiontasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneouslyprovide intermediate text outputs (such as ASR transcriptions and modelresponses) during speech interaction, offering users a comprehensive multimodalexperience.

Code Repositories

ictnlp/stream-omni
Official
pytorch
Mentioned in GitHub
ictnlp/streamspeech
pytorch
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model | Papers | HyperAI