HyperAI

NVIDIA Dynamo has released a significant update to its inference engine to better support complex multi-turn agentic workflows. These updates focus on correcting parser logic, improving streaming performance, and ensuring full compatibility with leading agent frameworks like Claude Code and OpenClaw. The primary goal is to allow custom serving stacks to replicate the nuanced behavior of native agent environments, ensuring that reasoning steps and tool calls are handled with precision. A major performance breakthrough involves fixing KV-cache inefficiencies caused by unstable prompt prefixes. Previously, session-specific billing headers included in requests prevented the caching of stable prompt instructions, leading to a fivefold increase in the time to first token for new sessions. By implementing a flag to strip these unstable headers before tokenization, NVIDIA restored prompt stability. Benchmarks on NVIDIA B200 hardware show that this change reduced time to first token from 912ms to approximately 169ms, turning what was a cold prefill into a highly efficient cache hit. The update also addresses the complexity of reasoning and tool-call parsing. In agentic workflows, reasoning spans must remain attached to the specific tool calls they explain across turns. Older parsing methods often grouped all reasoning separately or dropped it aggressively, breaking the logical flow required for future turns. NVIDIA Dynamo now supports an interleaved format where reasoning and tool calls are preserved as distinct, ordered segments. This ensures that subsequent model turns receive the correct context to understand previous logic. Streaming capabilities have been significantly enhanced to improve user responsiveness. Previously, tool calls were buffered until the end of a response, delaying execution. The new system streams reasoning tokens immediately and uses a typed side-channel event, known as tool call dispatch, to signal tool readiness as soon as the model decides to call a function. This allows harnesses to execute tools in real-time without waiting for the full response stream to finish, drastically improving the perceived speed of the application. To achieve this, NVIDIA has separated reasoning and tool parsing into dedicated components. A single parser now owns reasoning interpretation, preventing conflicting layers from misinterpreting token boundaries. The system also respects model-specific chat templates, ensuring that reasoning history is either preserved or truncated based on the specific model's policy and the nature of the current turn. This prevents the accidental loss of critical context in multi-step agent tasks. Compatibility with Claude Code and OpenClaw has been refined through the Anthropic Messages API. Fixes were applied to ensure that input token counts are reported correctly before a stream begins and that model details are served accurately during initialization. Furthermore, NVIDIA addressed the Codex fidelity issues on the Responses API. A critical finding revealed that model catalog metadata heavily influences agent behavior, including tool-output truncation limits and reasoning settings. When a custom endpoint lacked the correct catalog profile, agents defaulted to generic behaviors that reduced tool usage and altered system prompts. Aligning custom endpoints with the proper model catalog restored performance parity with native models, as demonstrated in SWE-Bench tests where tool call frequency matched the reference implementation. Looking ahead, NVIDIA Dynamo is introducing new harness hints such as latency sensitivity and priority flags to help systems manage turn types more effectively. The protocol, parser, and tokenizer layers are being released as standalone, reusable crates. This modular approach allows developers to build custom agent stacks without duplicating internal engine code, facilitating the deployment of efficient, long-running agentic systems on diverse infrastructure.

Related Links

Related Links

Related Links

Scientists Have Independently Generated Novel Materials by reverse-engineering gallium-containing Materials Using a Bayesian Optimization framework. The Optimization Results Exhibit Uniqueness and novelty.

Scientists Have Independently Generated Novel Materials by reverse-engineering gallium-containing Materials Using a Bayesian Optimization framework. The Optimization Results Exhibit Uniqueness and novelty.

Command Palette

NVIDIA Dynamo adds multi-turn agentic support

Related Links

Command Palette

NVIDIA Dynamo adds multi-turn agentic support

Related Links

Command Palette

NVIDIA Dynamo adds multi-turn agentic support

Related Links

Scientists Have Independently Generated Novel Materials by reverse-engineering gallium-containing Materials Using a Bayesian Optimization framework. The Optimization Results Exhibit Uniqueness and novelty.

Scientists Have Independently Generated Novel Materials by reverse-engineering gallium-containing Materials Using a Bayesian Optimization framework. The Optimization Results Exhibit Uniqueness and novelty.