Command Palette
Search for a command to run...
MiMo-Audio-7B-Instruct: Xiaomi's Open Source end-to-end Voice Model
1. Tutorial Introduction

MiMo-Audio is an end-to-end speech model released by Xiaomi in September 2025. Its pre-training data has been expanded to over 100 million hours, and researchers have observed that it demonstrates few-shot learning capabilities across a variety of audio tasks. The team systematically evaluated these capabilities and found that MiMo-Audio-7B-Base achieved state-of-the-art results (SOTA) on open-source model benchmarks for speech intelligence and audio understanding. Beyond standard metrics, the model also generalizes to tasks not covered in the training data, such as voice conversion, style transfer, and speech editing. Furthermore, MiMo-Audio-7B-Base possesses powerful speech continuation capabilities, enabling the generation of highly realistic talk shows, recitations, live broadcasts, and debates. In the post-training phase, researchers compiled a diverse set of instruction fine-tuning corpora and introduced a thinking mechanism into audio understanding and generation. The resulting MiMo-Audio-7B-Instruct achieved state-of-the-art results in the open-source field on audio understanding benchmarks, spoken dialogue benchmarks, and instruction-based speech synthesis (instruct-TTS), approaching or surpassing closed-source models in some scenarios. The relevant paper results are "MiMo-Audio-Technical-Report".
This tutorial uses a single RTX 5090 graphics card as computing resource.
2. Effect Examples
1. 🔊 Audio Understanding

2. 🎵 Audio Generation Text-to-Speech

3. 🎤 Spoken Dialogue

4. 💬 S2T Dialogue

5. 📝 Text-to-Text Dialogue

3. Operation steps
1. Start the container

2. Initialize weight parameters
If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.
When using the Safari browser, the audio may not be played directly and needs to be downloaded before playing.

3. Audio Understanding

4. Audio Generation

5. Voice Conversation

6. Voice-to-text conversation

7. Text-to-text conversation

Citation Information
@misc{coreteam2025mimoaudio,
      title={MiMo-Audio: Audio Language Models are Few-Shot Learners}, 
      author={LLM-Core-Team Xiaomi},
      year={2025},
      url={https://github.com/XiaomiMiMo/MiMo-Audio}, 
}
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.