3 months ago

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Junying Chen Ruyi Ouyang Anningzhe Gao Shunian Chen Guiming Hardy Chen Xidong Wang Ruifei Zhang Zhenyang Cai Ke Ji Guangjun Yu

Abstract

The rapid development of multimodal large language models (MLLMs), such asGPT-4V, has led to significant advancements. However, these models still facechallenges in medical multimodal capabilities due to limitations in thequantity and quality of medical vision-text data, stemming from data privacyconcerns and high annotation costs. While pioneering approaches utilizePubMed's large-scale, de-identified medical image-text pairs to address theselimitations, they still fall short due to inherent data noise. To tackle this,we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) inan 'unblinded' capacity to denoise and reformat the data, resulting in thecreation of the PubMedVision dataset with 1.3 million medical VQA samples. Ourvalidation demonstrates that: (1) PubMedVision can significantly enhance themedical multimodal capabilities of current MLLMs, showing significantimprovement in benchmarks including the MMMU Health & Medicine track; (2)manual checks by medical experts and empirical results validate the superiordata quality of our dataset compared to other data construction methods. UsingPubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which showssuperior performance in medical multimodal scenarios among open-source MLLMs.

Code Repositories

freedomintelligence/huatuogpt-vision

Official

pytorch

Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette