HyperAIHyperAI

Command Palette

Search for a command to run...

ACTIONENGINE: From Reactive to Programmatic GUI Agents via State Machine Memory

Hongbin Zhong Fazole Faisalal Luis França Tanakorn Leesatapornwongsa Adriana Szekeres Kexin Rong Suman Nath

Abstract

Existing Graphical User Interface (GUI) agents operate through step-by-step calls to vision language models--taking a screenshot, reasoning about the next action, executing it, then repeating on the new page--resulting in high costs and latency that scale with the number of reasoning steps, and limited accuracy due to no persistent memory of previously visited pages. We propose ACTIONENGINE, a training-free framework that transitions from reactive execution to programmatic planning through a novel two-agent architecture: a Crawling Agent that constructs an updatable state-machine memory of the GUIs through offline exploration, and an Execution Agent that leverages this memory to synthesize complete, executable Python programs for online task execution. To ensure robustness against evolving interfaces, execution failures trigger a vision-based re-grounding fallback that repairs the failed action and updates the memory. This design drastically improves both efficiency and accuracy: on Reddit tasks from the Web Arena benchmark, our agent achieves 95% task success with on average a single LLM call, compared to 66% for the strongest vision-only baseline, while reducing cost by 11.8× and end-to-end latency by 2×.

One-sentence Summary

The proposed training-free framework, ACTIONENGINE, transitions GUI agents from reactive execution to programmatic planning through a two-agent architecture that utilizes a Crawling Agent to build state-machine memory and an Execution Agent to synthesize Python programs, achieving 95% success on Reddit tasks in the Web Arena benchmark while reducing costs by 11.8× and latency by 2× compared to vision-only baselines.

Key Contributions

  • The paper introduces ACTIONENGINE, a training-free framework that shifts GUI agents from reactive step-by-step execution to a global programmatic planning paradigm using a two-agent architecture.
  • This work presents a novel state-machine memory constructed by a Crawling Agent through offline exploration, which allows an Execution Agent to synthesize complete, executable Python programs for task execution in a single inference step.
  • The framework incorporates a vision-based re-grounding fallback mechanism that repairs failed actions and dynamically updates the state-machine memory to ensure robustness against evolving interfaces.
  • Empirical evaluations on the Reddit tasks within the Web Arena benchmark demonstrate that the method achieves a 95% success rate, reducing end-to-end latency by 2× and costs by 11.8× compared to strong vision-only baselines.

Introduction

Modern Graphical User Interface (GUI) agents are essential for automating complex digital tasks, yet most current systems rely on a reactive paradigm where a model observes a screenshot and predicts a single next action at every step. This iterative approach suffers from high latency and computational costs that scale linearly with task length, while also being prone to error accumulation where a single visual hallucination can derail the entire process. The authors leverage a novel two-agent architecture called ACTIONENGINE to transition from this reactive execution to programmatic planning. By using a Crawling Agent to build an updatable state-machine memory of the interface and an Execution Agent to synthesize complete Python programs for task completion, the framework reduces reasoning complexity from O(N)O(N)O(N) to O(1)O(1)O(1) and provides a robust mechanism for self-correction through vision-based re-grounding.

Dataset

Please provide the paper paragraphs you would like me to process. The text provided in your prompt appears to be a snippet of comment analysis rather than a technical description of a dataset.

Once you provide the relevant technical text, I will draft the description following your requirements.

Method

The authors introduce a novel two-agent architecture that separates application operation learning from task planning and execution. This approach shifts the computational burden from expensive runtime visual processing to amortized offline preprocessing, moving from a reactive paradigm to a programmatic one.

As shown in the framework diagram:

The system lifecycle consists of three primary phases involving a Crawling Agent and an Execution Agent. The Crawling Agent operates offline to systematically explore the target GUI application and construct a State Machine Graph (SMG). This SMG is a directed graph where nodes correspond to symbolic application states and edges represent GUI operations, such as clicks or text entry, that trigger state transitions. The agent identifies core atoms to generate unique State IDs, effectively capturing the application's topology independent of specific task requirements.

The SMG is formally defined as a state machine M=(S,O,T)M = (S, O, T)M=(S,O,T), where SSS is a set of discrete states, OOO is a set of executable operations, and T:S×OST : S \times O \rightarrow ST:S×OS is the transition function.

Refer to the illustration of the State-Machine Graph:

To prevent state explosion, the authors decouple the static topology from dynamic data content. A state is defined by its atom signature, where atoms represent sets of related UI elements that appear atomically. While static atoms encode invariant interface elements, dynamic atoms represent elements with data-dependent content. This allows multiple data instances to map to a single state template, ensuring the graph size scales with the number of distinct templates rather than the number of data items.

As shown in the figure below:

The Execution Agent performs online compilation by querying the SMG to solve user tasks. Given a goal, a code-generating LLM (the Planner) produces a Sketch Program in an intermediate representation (IR). This Python-based sketch captures the logical control flow using loops and conditionals while using placeholders for concrete UI interactions and symbolic variables, denoted by an @ prefix, for runtime data.

The linking phase then resolves these abstract placeholders into concrete execution paths by searching the SMG. The agent employs strategies such as breadth-first search (BFS) or loop-aware linking to find valid sequences of graph edges. Once linked, a compiler expands these operations into a fully specified program composed of UI Nodes for primitive browser actions, Python Nodes for local computation, and Control Flow Nodes to preserve the original nesting structure.

During runtime, if the environment deviates from the SMG, a feedback loop is triggered. An MLLM-based mechanism performs vision-based recovery to identify new interaction points, and these successful recoveries are committed back to the SMG to update the system's memory.

Experiment

The study evaluates ACTIONENGINE, a programmatic GUI agent framework, against reactive baseline agents using the WebArena benchmark, specifically focusing on complex, long-horizon tasks within the Reddit domain. By replacing iterative visual reasoning with a single-step planning phase that generates executable Python code based on a structured state-machine memory, the approach demonstrates superior reliability and efficiency. The results show that transitioning from stochastic, step-by-step interactions to deterministic programmatic execution significantly improves task success rates while drastically reducing latency, computational costs, and error accumulation.

The authors compare their programmatic approach, AEngine, against a reactive baseline, AOccam, across various task groups in the WebArena Reddit subset. Results demonstrate that the proposed method achieves higher success rates while significantly reducing latency, token consumption, and the number of required model calls. AEngine achieves a higher overall success rate compared to the reactive baseline across the evaluated tasks. The proposed method demonstrates substantial efficiency gains by reducing average latency and the number of LLM calls per task. AEngine significantly lowers both input and output token usage compared to the baseline approach.

The authors compare their proposed framework, AEngine, against the reactive baseline AOccam on the WebArena Reddit subset. Results show that AEngine significantly outperforms the baseline in success rate, latency, and cost efficiency. AEngine achieves a higher task success rate compared to the baseline. The proposed method reduces average latency and the number of LLM calls required per task. AEngine demonstrates substantial improvements in cost efficiency and reduces both input and output token consumption.

The authors evaluate the proposed AEngine framework against the reactive baseline AOccam using various task groups within the WebArena Reddit subset. The experiments demonstrate that AEngine achieves a higher success rate while providing substantial improvements in operational efficiency. Ultimately, the programmatic approach reduces latency, minimizes the number of required model calls, and lowers overall token consumption compared to the baseline.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp