Visual Question Answering On Docvqa Test

评估指标

ANLS

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Human0.9436DocVQA: A Dataset for VQA on Document Images
MLCD-Embodied-7B0.916Multi-label Cluster Discrimination for Visual Representation Learning
SMoLA-PaLI-X Specialist0.908Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts-
SMoLA-PaLI-X Generalist0.906Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts-
Qwen-VL-Plus0.9024Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
ScreenAI 5B (4.62 B params, w/OCR)0.8988ScreenAI: A Vision-Language Model for UI and Infographics Understanding
PaLI-3 (w/ OCR)0.886PaLI-3 Vision Language Models: Smaller, Faster, Stronger
ERNIE-Layout large (ensemble)0.8841ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
GPT-40.884Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
DocFormerv2-large0.8784DocFormerv2: Local Features for Document Understanding
UDOP (aux)0.878Unifying Vision, Text, and Layout for Universal Document Processing
PaLI-30.876PaLI-3 Vision Language Models: Smaller, Faster, Stronger
TILT-Large0.8705Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
PaLI-X (Single-task FT w/ OCR)0.868PaLI-X: On Scaling up a Multilingual Vision and Language Model
LayoutLMv2LARGE0.8672LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
ERNIE-Layout large0.8486ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
UDOP0.847Unifying Vision, Text, and Layout for Universal Document Processing
TILT-Base0.8392Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Claude + LATIN-Prompt0.8336Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
GPT-3.5 + LATIN-Prompt0.8255Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
0 of 33 row(s) selected.
Visual Question Answering On Docvqa Test | SOTA | HyperAI超神经