Command Palette
Search for a command to run...

Abstract
As large language models (LLMs) advance in conversational and reasoningcapabilities, their practical application in healthcare has become a criticalresearch focus. However, there is a notable gap between the performance ofmedical LLMs on static benchmarks such as USMLE and their utility in real-worldclinical decision-making. This discrepancy arises because traditional examsfail to capture the dynamic, interactive nature of medical consultations. Toaddress this challenge, we introduce a novel dynamic verification frameworkthat moves beyond static answer verifier, establishing a large-scale,high-fidelity interactive reinforcement learning system. Our frameworkcomprises two key components: a Patient Simulator that creates realisticclinical environments using de-identified medical records, and a ClinicalRubrics Generator that dynamically produces multi-dimensional evaluationmetrics. Building on this foundation, we develop Baichuan-M2, a 32B-parametermedical augmented reasoning model trained through a multi-stage reinforcementlearning strategy with an improved Group Relative Policy Optimization (GRPO)algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all otheropen-source models and most advanced closed-source counterparts, achieving ascore above 32 on the challenging HealthBench Hard benchmark-previouslyexceeded only by GPT-5. Our work demonstrates that robust dynamic verifiersystem is essential for aligning LLM capabilities with practical clinicalapplications, establishing a new Pareto front in the performance-parametertrade-off for medical AI deployment.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.