Command Palette
Search for a command to run...

Abstract
Existing efforts in building GUI agents heavily rely on the availability ofrobust commercial Vision-Language Models (VLMs) such as GPT-4o andGeminiProVision. Practitioners are often reluctant to use open-source VLMs dueto their significant performance lag compared to their closed-sourcecounterparts, particularly in GUI grounding and Out-Of-Distribution (OOD)scenarios. To facilitate future research in this area, we developed OS-Atlas -a foundational GUI action model that excels at GUI grounding and OOD agentictasks through innovations in both data and modeling. We have investedsignificant engineering effort in developing an open-source toolkit forsynthesizing GUI grounding data across multiple platforms, including Windows,Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasingthe largest open-source cross-platform GUI grounding corpus to date, whichcontains over 13 million GUI elements. This dataset, combined with innovationsin model training, provides a solid foundation for OS-Atlas to understand GUIscreenshots and generalize to unseen interfaces. Through extensive evaluationacross six benchmarks spanning three different platforms (mobile, desktop, andweb), OS-Atlas demonstrates significant performance improvements over previousstate-of-the-art models. Our evaluation also uncovers valuable insights intocontinuously improving and scaling the agentic capabilities of open-sourceVLMs.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| natural-language-visual-grounding-on | OS-Atlas-Base-7B | Accuracy (%): 82.47 |
| natural-language-visual-grounding-on | OS-Atlas-Base-4B | Accuracy (%): 68.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.