HyperAIHyperAI

Command Palette

Search for a command to run...

14 hours ago
LLM
Agent

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Abstract

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus 4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at here.

One-sentence Summary

This analysis disentangles harness-updating from harness-benefit in self-evolving LLM agents to reveal that harness-updating yields similar gains across tiers, with Qwen3.5-9B updates yielding gains comparable to those of Claude Opus 4.6, whereas harness-benefit is non-monotonic with mid-tier models benefiting most, suggesting capability budgets should prioritize task-solving agents over evolvers.

Key Contributions

  • This work introduces a controlled capability analysis framework that separates harness-updating from harness-benefit to measure these capabilities separately. The methodology varies agents and evolvers independently to isolate specific capabilities rather than conflating end-to-end gains.
  • Experiments demonstrate that harness-updating capability remains flat across base model tiers, producing similar gains regardless of underlying model size. Results show that updates from smaller models like Qwen3.5-9B yield performance improvements comparable to those from stronger models like Claude Opus 4.6.
  • The analysis reveals that harness-benefit capability follows a non-monotonic pattern where mid-tier models benefit most while weak-tier models fail to activate or follow harness artifacts faithfully. These findings suggest prioritizing the task-solving agent over the evolver and targeting harness invocation and long-horizon instruction following in agent training.

Introduction

Large language model agents increasingly rely on editable external harnesses such as prompts, tools, and memory to shape behavior without updating model weights. While harness self-evolution allows these systems to adapt by refining artifacts based on execution evidence, prior evaluations often conflate the agent's base performance with the quality of updates and the agent's ability to utilize them. The authors disentangle these factors by distinguishing between harness-updating, the ability to produce useful artifacts, and harness-benefit, the ability to leverage them during task solving. Their analysis reveals that updating capability remains consistent across model tiers while benefit capability peaks at mid-tier levels, suggesting developers should prioritize investing in the task-solving agent rather than the evolver.

Dataset

  • Dataset Composition: The authors evaluate on three benchmarks: SWE-bench Verified for software engineering, MCP-Atlas for tool use over real servers, and SkillsBench for skill-based execution.
  • Subset Details: SWE-bench Verified contains 500 human-validated tasks from 12 Python repositories where patches must pass hidden tests. MCP-Atlas includes 500 tasks requiring orchestration across 36 servers using a claims-based scoring rubric. SkillsBench offers 86 tasks across 11 domains with deterministic verifiers.
  • Evaluation Protocol: The team employs an in-situ setting where tasks are scored under the previous harness before evidence updates the current one. Pass rate is the primary metric aggregated over the task stream.
  • Processing and Constraints: Harness editing is benchmark-specific. SWE-bench and SkillsBench restrict edits to the skills directory while MCP-Atlas allows changes to prompts and memory files. All models use fixed prompt templates and evolution budgets to maintain fair comparison.

Method

The authors leverage a harness self-evolution protocol designed to adapt an LLM agent by updating the external harness surrounding a fixed model during task execution. In this framework, the agent attempts a stream of tasks, and the harness is iteratively updated based on the agent's execution evidence. This approach distinguishes between the parametric model backbone and the non-parametric context, allowing for continuous improvement without retraining the underlying LLM.

Refer to the framework diagram to visualize the overall architecture. The system consists of a Frozen LLM connected to a dynamic Harness containing Memory, Tools, Prompts, and Skills. The process operates as a closed loop: execution experiences are gathered and subjected to Diagnosis, which feeds into an Evolver Model. The Evolver Model then generates updates that modify the Harness components, such as injecting new skills or refining prompts, to better handle future tasks.

Formally, at evolution step ttt, the agent is defined as At=(f,Ht)A_t = (f, H_t)At=(f,Ht), where fff represents the fixed model backbone and HtH_tHt denotes the harness state. The system adheres to a protocol where fff remains constant while editable components of HtH_tHt (e.g., prompts, skills, memories) are updated. The evolver eee functions as the update procedure that converts execution evidence into harness modifications. Given the previous harness Ht1H_{t-1}Ht1 and accumulated execution evidence Dt\mathcal{D}_tDt, the evolver proposes an update ΔHt\Delta H_tΔHt and applies it to obtain the next harness state:

ΔHt=e(Ht1,Dt),Ht=Apply(Ht1,ΔHt).\begin{array} { r } { \Delta H _ { t } = e ( H _ { t - 1 } , \mathcal { D } _ { t } ) , } \\ { H _ { t } = \mathrm { A p p l y } ( H _ { t - 1 } , \Delta H _ { t } ) . } \end{array}ΔHt=e(Ht1,Dt),Ht=Apply(Ht1,ΔHt).

The evolution protocol follows an iterative loop over TTT steps. At each step, the agent At1A_{t-1}At1 attempts to solve a batch of tasks Xt\mathcal{X}_tXt, outputting execution trajectories τt,x\tau_{t,x}τt,x and final outputs yt,xy_{t,x}yt,x. The evidence Dt\mathcal{D}_tDt is collected from these interactions, and the evolver produces the updated harness HtH_tHt for the subsequent step.

As shown in the figure below, the practical application of this method involves the evolver diagnosing failures and injecting specific procedural skills into the harness. The comparison illustrates a scenario where an agent without an evolver fails to complete a Flink job task due to missing steps. In contrast, agents utilizing a Qwen3.5-9B or Opus4.6 Evolver successfully pass the task. This success is attributed to the evolver auto-injecting a flink-query skill into the loaded skills list. The figure highlights that while different evolver models may produce procedurally isomorphic recipes with similar logic, the specific details and phrasing of the injected skills differ, yet both lead to a successful outcome where the base agent would have failed.

Experiment

This study evaluates harness self-evolution across seven LLMs and three agentic benchmarks by decoupling harness-updating, where models generate improvements from execution evidence, from harness-benefit, where models utilize those updates during task solving. Experiments reveal that harness-updating capability is independent of base model strength, while harness-benefit follows a non-monotonic pattern where mid-tier models gain the most and weaker models struggle due to failures in activating artifacts or adhering to guidance. These findings suggest that system design should prioritize investing capability in the task-solving agent and training it to reliably invoke and follow external harness instructions.

The authors evaluate harness-updating capability by pairing various evolver models with fixed task-solving agents across three benchmarks. Results indicate that the ability to generate useful harness updates is relatively consistent across different model tiers, with no single evolver dominating all tasks. Furthermore, the final system performance is driven more by the base capability of the task-solving agent than by the specific evolver model used. Harness-updating gains remain stable across evolver capability tiers, showing that smaller models can produce updates comparable to larger ones. No evolver model consistently outperforms others across all benchmarks, indicating a reshuffling of effectiveness depending on the task domain. The base capability of the task-solving agent is the dominant factor in post-evolution performance, while the evolver identity contributes less variance.

The authors analyze per-phase adherence scores to diagnose why weak-tier models derive low benefit from harness updates. The results show that while strong models maintain stable adherence throughout task execution, weaker models suffer from significant drift as the trajectory unfolds. This indicates a long-horizon instruction-following bottleneck where adherence decays much more steeply for weaker models compared to their stronger counterparts. Strong models maintain the highest adherence scores across all trajectory phases. Weak models exhibit the largest negative drift in adherence from loading to the final turn. Mid-tier models show moderate adherence decay, falling between weak and strong model performance.

The authors analyze agent-side capabilities in harness self-evolution by measuring activation, adherence, and success rates across models of varying capabilities. The data reveals that weaker models suffer from both low skill-load rates and poor adherence to loaded instructions, whereas stronger models consistently activate and follow harness artifacts. Notably, mid-capability models show high activation rates but struggle with adherence, indicating that loading a skill does not guarantee effective utilization. Weak-tier models exhibit significantly lower skill-load rates compared to strong-tier models, failing to activate harness artifacts in the majority of trajectories. Strong-tier models demonstrate superior adherence to loaded skills, maintaining high following rates while weaker models often deviate from instructions. A disparity exists between activation and adherence for mid-tier models, where high skill-load rates do not necessarily translate to high instruction-following performance.

The authors analyze harness-benefit capability across seven LLMs and three benchmarks to determine which models benefit most from updated agent harnesses. Results indicate that the improvement from harness evolution is non-monotonic with respect to base model capability, where mid-tier models show the largest gains while both weaker and stronger models benefit less. This pattern suggests that weak models struggle to activate or adhere to harness instructions, whereas strong models face diminishing returns due to performance ceilings. Mid-tier models demonstrate the highest improvement from harness evolution compared to weaker or stronger counterparts. Stronger models experience limited gains as they approach performance ceilings on the benchmarks. Weaker models fail to leverage harness updates effectively due to difficulties in activating and following procedural guidance.

The authors analyze extreme pairings where the strongest task-solving agent uses its worst evolver and the weakest agent uses its best evolver. Results indicate that the strong agent consistently outperforms the weak agent by a significant margin across all benchmarks, demonstrating that the agent's base capability is the primary determinant of post-evolution performance. Strong agents maintain a substantial performance lead over weak agents even when paired with their least effective evolvers. The performance gap favors the strong agent across all tested benchmarks, highlighting the dominance of agent capability over evolver quality. Results suggest that optimizing the task-solving agent yields greater returns than optimizing the model responsible for generating harness updates.

The authors evaluate harness-updating capabilities by pairing various evolver models with fixed task-solving agents across multiple benchmarks. Findings reveal that the base capability of the task-solving agent is the primary determinant of performance, outweighing the specific evolver model used. Although mid-tier models benefit most from harness evolution, weaker models struggle with instruction adherence while stronger models face diminishing returns, indicating that optimizing the agent yields greater returns than refining the evolver.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp