8 months ago

Visual Question Answering

Document Understanding

Natural Language Processing

Gilles Baechler Srinivas Sunkara Maria Wang Fedir Zubach Hassan Mansoor Vincent Etter Victor Carbune Jason Lin Jindong Chen Abhanshu Sharma

Abstract

Screen user interfaces (UIs) and infographics, sharing similar visuallanguage and design principles, play important roles in human communication andhuman-machine interaction. We introduce ScreenAI, a vision-language model thatspecializes in UI and infographics understanding. Our model improves upon thePaLI architecture with the flexible patching strategy of pix2struct and istrained on a unique mixture of datasets. At the heart of this mixture is anovel screen annotation task in which the model has to identify the type andlocation of UI elements. We use these text annotations to describe screens toLarge Language Models and automatically generate question-answering (QA), UInavigation, and summarization training datasets at scale. We run ablationstudies to demonstrate the impact of these design choices. At only 5Bparameters, ScreenAI achieves new state-of-the-artresults on UI- andinfographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and WidgetCaptioning), and new best-in-class performance on others (Chart QA, DocVQA, andInfographicVQA) compared to models of similar size. Finally, we release threenew datasets: one focused on the screen annotation task and two others focusedon question answering.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Visual Question Answering

Document Understanding

Natural Language Processing

Gilles Baechler Srinivas Sunkara Maria Wang Fedir Zubach Hassan Mansoor Vincent Etter Victor Carbune Jason Lin Jindong Chen Abhanshu Sharma

Abstract

Screen user interfaces (UIs) and infographics, sharing similar visuallanguage and design principles, play important roles in human communication andhuman-machine interaction. We introduce ScreenAI, a vision-language model thatspecializes in UI and infographics understanding. Our model improves upon thePaLI architecture with the flexible patching strategy of pix2struct and istrained on a unique mixture of datasets. At the heart of this mixture is anovel screen annotation task in which the model has to identify the type andlocation of UI elements. We use these text annotations to describe screens toLarge Language Models and automatically generate question-answering (QA), UInavigation, and summarization training datasets at scale. We run ablationstudies to demonstrate the impact of these design choices. At only 5Bparameters, ScreenAI achieves new state-of-the-artresults on UI- andinfographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and WidgetCaptioning), and new best-in-class performance on others (Chart QA, DocVQA, andInfographicVQA) compared to models of similar size. Finally, we release threenew datasets: one focused on the screen annotation task and two others focusedon question answering.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp