Command Palette
Search for a command to run...
Lijun Liu Ruiyang Li Zhaocheng Liu Chenglin Zhu Chong Li Jiehan Cheng Qiang Ju Jian Xie

Abstract
Visual Information Extraction (VIE) converts unstructured document imagesinto structured formats like JSON, critical for medical applications such asreport analysis and online consultations. Traditional methods rely on OCR andlanguage models, while end-to-end multimodal models offer direct JSONgeneration. However, domain-specific schemas and high annotation costs limittheir effectiveness in medical VIE. We base our approach on the ReinforcementLearning with Verifiable Rewards (RLVR) framework to address these challengesusing only 100 annotated samples. Our approach ensures dataset diversity, abalanced precision-recall reward mechanism to reduce hallucinations and improvefield coverage, and innovative sampling strategies to enhance reasoningcapabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achievestate-of-the-art performance on medical VIE tasks, significantly improving F1,precision, and recall. While our models excel on tasks similar to medicaldatasets, performance drops on dissimilar tasks, highlighting the need fordomain-specific optimization. Case studies further demonstrate the value ofreasoning during training and inference for VIE.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.