Command Palette
Search for a command to run...
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data
Yucheng Shi Quanzheng Li Jin Sun Xiang Li Ninghao Liu

Abstract
Large multimodal models (LMMs) have shown impressive capabilities in a widerange of visual tasks. However, they often struggle with fine-grained visualreasoning, failing to identify domain-specific objectives and providejustifiable explanations for their predictions. To address this, we propose anovel visual rejection sampling framework to improve the cognition andexplainability of LMMs using self-synthesized data. Specifically, visualfine-tuning requires images, queries, and target answers. Our approach beginsby synthesizing interpretable answers that include human-verifiable visualfeatures. These features are based on expert-defined concepts, carefullyselected based on their alignment with the image content. After each round offine-tuning, we apply a reward model-free filtering mechanism to select thehighest-quality interpretable answers for the next round of tuning. Thisiterative process of data synthesis and fine-tuning progressively improves themodel's ability to generate accurate and reasonable explanations. Experimentalresults demonstrate the effectiveness of our method in improving both theaccuracy and explainability of specialized visual classification tasks.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| fine-grained-visual-recognition-on-cub-200-1 | Selfsynthx | Accuracy (%): 85.02 |
| fine-grained-visual-recognition-on-fgvc-2 | Selfsynthx | Accuracy (%): 91.99 |
| fine-grained-visual-recognition-on-new-plant | Selfsynthx | Accuracy (% ): 97.16 |
| fine-grained-visual-recognition-on-stanford-2 | Selfsynthx | Accuracy (%): 86.91 |
| pneumonia-detection-on-chest-x-ray-images-1 | Selfsynthx | Accuracy: 98.72 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.