8 months ago

Document Understanding

Natural Language Processing

Lijun Liu Ruiyang Li Zhaocheng Liu Chenglin Zhu Chong Li Jiehan Cheng Qiang Ju Jian Xie

Abstract

Visual Information Extraction (VIE) converts unstructured document imagesinto structured formats like JSON, critical for medical applications such asreport analysis and online consultations. Traditional methods rely on OCR andlanguage models, while end-to-end multimodal models offer direct JSONgeneration. However, domain-specific schemas and high annotation costs limittheir effectiveness in medical VIE. We base our approach on the ReinforcementLearning with Verifiable Rewards (RLVR) framework to address these challengesusing only 100 annotated samples. Our approach ensures dataset diversity, abalanced precision-recall reward mechanism to reduce hallucinations and improvefield coverage, and innovative sampling strategies to enhance reasoningcapabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achievestate-of-the-art performance on medical VIE tasks, significantly improving F1,precision, and recall. While our models excel on tasks similar to medicaldatasets, performance drops on dissimilar tasks, highlighting the need fordomain-specific optimization. Case studies further demonstrate the value ofreasoning during training and inference for VIE.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Document Understanding

Natural Language Processing

Lijun Liu Ruiyang Li Zhaocheng Liu Chenglin Zhu Chong Li Jiehan Cheng Qiang Ju Jian Xie

Abstract

Visual Information Extraction (VIE) converts unstructured document imagesinto structured formats like JSON, critical for medical applications such asreport analysis and online consultations. Traditional methods rely on OCR andlanguage models, while end-to-end multimodal models offer direct JSONgeneration. However, domain-specific schemas and high annotation costs limittheir effectiveness in medical VIE. We base our approach on the ReinforcementLearning with Verifiable Rewards (RLVR) framework to address these challengesusing only 100 annotated samples. Our approach ensures dataset diversity, abalanced precision-recall reward mechanism to reduce hallucinations and improvefield coverage, and innovative sampling strategies to enhance reasoningcapabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achievestate-of-the-art performance on medical VIE tasks, significantly improving F1,precision, and recall. While our models excel on tasks similar to medicaldatasets, performance drops on dissimilar tasks, highlighting the need fordomain-specific optimization. Case studies further demonstrate the value ofreasoning during training and inference for VIE.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp