8 months ago

Abstract

Medical Visual Question Answering (MedVQA) presents a significant opportunityto enhance diagnostic accuracy and healthcare delivery by leveraging artificialintelligence to interpret and answer questions based on medical images. In thisstudy, we reframe the problem of MedVQA as a generation task that naturallyfollows the human-machine interaction and propose a generative-based model formedical visual understanding by aligning visual information from a pre-trainedvision encoder with a large language model. We establish a scalable pipeline toconstruct a large-scale medical visual question-answering dataset, namedPMC-VQA, which contains 227k VQA pairs of 149k images that cover variousmodalities or diseases. We train the proposed model on PMC-VQA and thenfine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, andImage-Clef-2019, significantly outperforming existing MedVQA models ingenerating relevant, accurate free-form answers. In addition, we propose a testset that has undergone manual verification, which is significantly morechallenging, serving to better monitor the development of generative MedVQAmethods. To facilitate comprehensive evaluation and comparison, we havemaintained a leaderboard athttps://paperswithcode.com/paper/pmc-vqa-visual-instruction-tuning-for-medical,offering a centralized resource for tracking progress and benchmarkingstate-of-the-art approaches. The PMC-VQA dataset emerges as a vital resourcefor the field of research, and the MedVInT presents a significant breakthroughin the area of MedVQA.

Source PDF