4 months ago

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin Linjie Li Difei Gao Zhengyuan Yang Shiwei Wu Zechen Bai Weixian Lei Lijuan Wang Mike Zheng Shou

Abstract

Building Graphical User Interface (GUI) assistants holds significant promisefor enhancing human workflow productivity. While most agents arelanguage-based, relying on closed-source API with text-rich meta-information(e.g., HTML or accessibility tree), they show limitations in perceiving UIvisuals as humans do, highlighting the need for GUI visual agents. In thiswork, we develop a vision-language-action model in digital world, namelyShowUI, which features the following innovations: (i) UI-Guided Visual TokenSelection to reduce computational costs by formulating screenshots as an UIconnected graph, adaptively identifying their redundant relationship and serveas the criteria for token selection during self-attention blocks; (ii)Interleaved Vision-Language-Action Streaming that flexibly unifies diverseneeds within GUI tasks, enabling effective management of visual-action historyin navigation or pairing multi-turn query-action sequences per screenshot toenhance training efficiency; (iii) Small-scale High-quality GUIInstruction-following Datasets by careful data curation and employing aresampling strategy to address significant data type imbalances. With abovecomponents, ShowUI, a lightweight 2B model using 256K data, achieves a strong75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selectionfurther reduces 33% of redundant visual tokens during training and speeds upthe performance by 1.4x. Navigation experiments across web Mind2Web, mobileAITW, and online MiniWob environments further underscore the effectiveness andpotential of our model in advancing GUI visual agents. The models are availableat https://github.com/showlab/ShowUI.

Code Repositories

showlab/showui

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
natural-language-visual-grounding-on	ShowUI	Accuracy (%): 75.1
natural-language-visual-grounding-on	ShowUI-G	Accuracy (%): 75.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette