Command Palette
Search for a command to run...
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Kevin Qinghong Lin Linjie Li Difei Gao Zhengyuan Yang Shiwei Wu Zechen Bai Weixian Lei Lijuan Wang Mike Zheng Shou

Abstract
Building Graphical User Interface (GUI) assistants holds significant promisefor enhancing human workflow productivity. While most agents arelanguage-based, relying on closed-source API with text-rich meta-information(e.g., HTML or accessibility tree), they show limitations in perceiving UIvisuals as humans do, highlighting the need for GUI visual agents. In thiswork, we develop a vision-language-action model in digital world, namelyShowUI, which features the following innovations: (i) UI-Guided Visual TokenSelection to reduce computational costs by formulating screenshots as an UIconnected graph, adaptively identifying their redundant relationship and serveas the criteria for token selection during self-attention blocks; (ii)Interleaved Vision-Language-Action Streaming that flexibly unifies diverseneeds within GUI tasks, enabling effective management of visual-action historyin navigation or pairing multi-turn query-action sequences per screenshot toenhance training efficiency; (iii) Small-scale High-quality GUIInstruction-following Datasets by careful data curation and employing aresampling strategy to address significant data type imbalances. With abovecomponents, ShowUI, a lightweight 2B model using 256K data, achieves a strong75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selectionfurther reduces 33% of redundant visual tokens during training and speeds upthe performance by 1.4x. Navigation experiments across web Mind2Web, mobileAITW, and online MiniWob environments further underscore the effectiveness andpotential of our model in advancing GUI visual agents. The models are availableat https://github.com/showlab/ShowUI.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| natural-language-visual-grounding-on | ShowUI | Accuracy (%): 75.1 |
| natural-language-visual-grounding-on | ShowUI-G | Accuracy (%): 75.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.