Command Palette
Search for a command to run...
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Abstract
The rapid development of multimodal large language models (MLLMs), such asGPT-4V, has led to significant advancements. However, these models still facechallenges in medical multimodal capabilities due to limitations in thequantity and quality of medical vision-text data, stemming from data privacyconcerns and high annotation costs. While pioneering approaches utilizePubMed's large-scale, de-identified medical image-text pairs to address theselimitations, they still fall short due to inherent data noise. To tackle this,we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) inan 'unblinded' capacity to denoise and reformat the data, resulting in thecreation of the PubMedVision dataset with 1.3 million medical VQA samples. Ourvalidation demonstrates that: (1) PubMedVision can significantly enhance themedical multimodal capabilities of current MLLMs, showing significantimprovement in benchmarks including the MMMU Health & Medicine track; (2)manual checks by medical experts and empirical results validate the superiordata quality of our dataset compared to other data construction methods. UsingPubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which showssuperior performance in medical multimodal scenarios among open-source MLLMs.
Code Repositories
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.