Command Palette
Search for a command to run...
Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy
Ging Simon ; Bravo María A. ; Brox Thomas

Abstract
The evaluation of text-generative vision-language models is a challenging yetcrucial endeavor. By addressing the limitations of existing Visual QuestionAnswering (VQA) benchmarks and proposing innovative evaluation methodologies,our research seeks to advance our understanding of these models' capabilities.We propose a novel VQA benchmark based on well-known visual classificationdatasets which allows a granular evaluation of text-generative vision-languagemodels and their comparison with discriminative vision-language models. Toimprove the assessment of coarse answers on fine-grained classification tasks,we suggest using the semantic hierarchy of the label space to ask automaticallygenerated follow-up questions about the ground-truth category. Finally, wecompare traditional NLP and LLM-based metrics for the problem of evaluatingmodel predictions given ground-truth answers. We perform a human evaluationstudy upon which we base our decision on the final metric. We apply ourbenchmark to a suite of vision-language models and show a detailed comparisonof their abilities on object, action, and attribute classification. Ourcontributions aim to lay the foundation for more precise and meaningfulassessments, facilitating targeted progress in the exciting field ofvision-language modeling.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-question-answering-vqa-on-activitynet-1 | BLIP-2 T5 | ClipMatch@1: 53.39 ClipMatch@5: 74.71 Contains: 15.70 ExactMatch: 7.07 Follow-up ClipMatch@1: 62.02 Follow-up ClipMatch@5: 75.13 Follow-up Contains: 18.09 Follow-up ExactMatch: 8.84 |
| visual-question-answering-vqa-on-coco | InstructBLIP Vicuna | ClipMatch@1: 59.58 ClipMatch@5: 73.32 Contains: 27.52 ExactMatch: 26.50 |
| visual-question-answering-vqa-on-imagenet | BLIP-2 OPT | ClipMatch@1: 57.10 ClipMatch@5: 77.24 Contains: 35.49 ExactMatch: 0.87 Follow-up ClipMatch@1: 67.22 Follow-up ClipMatch@5: 83.54 Follow-up Contains: 40.31 Follow-up ExactMatch: 2.54 |
| visual-question-answering-vqa-on-ovad | BLIP | Contains w. Synonyms: 45.70 ExactMatch w. Synonyms: 36.99 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.