HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Abstract

Screen user interfaces (UIs) and infographics, sharing similar visuallanguage and design principles, play important roles in human communication andhuman-machine interaction. We introduce ScreenAI, a vision-language model thatspecializes in UI and infographics understanding. Our model improves upon thePaLI architecture with the flexible patching strategy of pix2struct and istrained on a unique mixture of datasets. At the heart of this mixture is anovel screen annotation task in which the model has to identify the type andlocation of UI elements. We use these text annotations to describe screens toLarge Language Models and automatically generate question-answering (QA), UInavigation, and summarization training datasets at scale. We run ablationstudies to demonstrate the impact of these design choices. At only 5Bparameters, ScreenAI achieves new state-of-the-artresults on UI- andinfographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and WidgetCaptioning), and new best-in-class performance on others (Chart QA, DocVQA, andInfographicVQA) compared to models of similar size. Finally, we release threenew datasets: one focused on the screen annotation task and two others focusedon question answering.

Code Repositories

google-research-datasets/screen_qa
Official
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
chart-question-answering-on-chartqaScreenAI 5B (4.62 B params, w/ OCR)
1:1 Accuracy: 76.7
visual-question-answering-on-docvqa-testScreenAI 5B (4.62 B params, w/OCR)
ANLS: 0.8988
visual-question-answering-vqa-onScreenAI 5B (4.62 B params, w/ OCR)
ANLS: 65.90

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
ScreenAI: A Vision-Language Model for UI and Infographics Understanding | Papers | HyperAI