HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

Qwen3-Omni Technical Report

Qwen3-Omni Technical Report

Abstract

We present Qwen3-Omni, a single multimodal model that, for the first time,maintains state-of-the-art performance across text, image, audio, and videowithout any degradation relative to single-modal counterparts. Qwen3-Omnimatches the performance of same-sized single-modal models within the Qwenseries and excels particularly on audio tasks. Across 36 audio and audio-visualbenchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overallSOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro,Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoEarchitecture that unifies perception and generation across text, images, audio,and video, yielding fluent text and natural real-time speech. It supports textinteraction in 119 languages, speech understanding in 19 languages, and speechgeneration in 10 languages. To reduce first-packet latency in streamingsynthesis, Talker autoregressively predicts discrete speech codecs using amulti-codebook scheme. Leveraging the representational capacity of thesecodebooks, we replace computationally intensive block-wise diffusion with alightweight causal ConvNet, enabling streaming from the first codec frame. Incold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packetlatency of 234 ms. To further strengthen multimodal reasoning, we introduce aThinking model that explicitly reasons over inputs from any modality. Since theresearch community currently lacks a general-purpose audio captioning model, wefine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, whichproduces detailed, low-hallucination captions for arbitrary audio inputs.Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, andQwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0license.

Benchmarks

BenchmarkMethodologyMetrics
optical-character-recognition-on-ocrbench-v2-chineseQwen3-Omni-30B-A3B-Instruct
Accuracy: 60.0
optical-character-recognition-on-ocrbench-v2-englishQwen3-Omni-30B-A3B-Instruct
Accuracy: 61.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Qwen3-Omni Technical Report | Papers | HyperAI