HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Step-Audio 2 Technical Report

Step-Audio 2 Technical Report

Abstract

This paper presents Step-Audio~2, an end-to-end multi-modal large languagemodel designed for industry-strength audio understanding and speechconversation. By integrating a latent audio encoder and reasoning-centricreinforcement learning (RL), Step-Audio 2 achieves promising performance inautomatic speech recognition (ASR) and audio understanding. To facilitategenuine end-to-end speech conversation, Step-Audio 2 incorporates thegeneration of discrete audio tokens into language modeling, significantlyenhancing its responsiveness to paralinguistic information such as speakingstyles and emotions. To effectively leverage the rich textual and acousticknowledge in real-world data, Step-Audio 2 integrates retrieval-augmentedgeneration (RAG) and is able to call external tools such as web search tomitigate hallucination and audio search to switch timbres. Trained on millionsof hours of speech and audio data, Step-Audio 2 delivers intelligence andexpressiveness across diverse conversational scenarios. Evaluation resultsdemonstrate that Step-Audio 2 achieves state-of-the-art performance on variousaudio understanding and conversational benchmarks compared to other open-sourceand commercial solutions. Please visithttps://github.com/stepfun-ai/Step-Audio2 for more information.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Step-Audio 2 Technical Report | Papers | HyperAI