a month ago

Table of Contents

Abstract

Large Language Models (LLMs) have demonstrated remarkable potential as autonomous agents, approaching human-expert performance through advanced reasoning and tool orchestration. However, decision-making in fully dynamic and live environments remains highly challenging, requiring real-time information integration and adaptive responses. While existing efforts have explored live evaluation mechanisms in structured tasks, a critical gap remains in systematic benchmarking for real-world applications, particularly in finance where stringent requirements exist for live strategic responsiveness. To address this gap, we introduce AI-Trader, the first fully-automated, live, and data-uncontaminated evaluation benchmark for LLM agents in financial decision-making. AI-Trader spans three major financial markets: U.S. stocks, A-shares, and cryptocurrencies, with multiple trading granularities to simulate live financial environments. Our benchmark implements a revolutionary fully autonomous minimal information paradigm where agents receive only essential context and must independently search, verify, and synthesize live market information without human intervention. We evaluate six mainstream LLMs across three markets and multiple trading frequencies. Our analysis reveals striking findings: general intelligence does not automatically translate to effective trading capability, with most agents exhibiting poor returns and weak risk management. We demonstrate that risk control capability determines cross-market robustness, and that AI trading strategies achieve excess returns more readily in highly liquid markets than policy-driven environments. These findings expose critical limitations in current autonomous agents and provide clear directions for future improvements.

One-sentence Summary

The authors from the University of Hong Kong propose AI-Trader, a fully automated, live benchmark for evaluating LLM agents in financial decision-making across U.S. stocks, A-shares, and cryptocurrencies, introducing a minimal-information paradigm that forces autonomous real-time data synthesis; their evaluation reveals that general intelligence does not ensure trading success, with risk control emerging as key to cross-market robustness, especially in liquid markets, and underscores the need for improved agent design in dynamic financial environments.

Key Contributions

We introduce AI-Trader, the first fully autonomous, live, and data-uncontaminated benchmark for evaluating LLM agents in real-world financial decision-making across three major markets: U.S. stocks, A-shares, and cryptocurrencies, with multiple trading granularities to simulate dynamic, real-time market conditions.
The benchmark enforces a minimal information paradigm where agents receive only essential context and must independently search, verify, and synthesize live market data through autonomous tool use, eliminating human intervention and enabling rigorous assessment of real-time reasoning and adaptability.
Evaluation of six mainstream LLMs reveals that general intelligence does not ensure trading effectiveness, with most agents showing poor returns and weak risk management, while risk control emerges as a key determinant of cross-market robustness, and excess returns are more achievable in highly liquid markets.

Introduction

Financial markets present a high-stakes, real-time environment where autonomous agents must integrate live information, reason under uncertainty, and make time-critical decisions—challenges that static benchmarks fail to capture. Prior evaluation frameworks often rely on fixed data, pre-defined workflows, or human-in-the-loop interventions, creating a disconnect between lab performance and real-world capability. These limitations hinder meaningful assessment of true autonomous decision-making, especially in dynamic domains like trading.

The authors introduce AI-Trader, the first fully autonomous, live, and data-uncontaminated benchmark for evaluating LLM agents across U.S. stocks, A-shares, and cryptocurrencies. Agents operate with minimal context—only current holdings, real-time prices, and access to tools—requiring them to independently search, verify, and synthesize live market data without any human guidance. This minimal information paradigm enforces rigorous demonstration of long-horizon reasoning, information retrieval, and adaptive strategy execution.

Their evaluation of six mainstream LLMs reveals that general intelligence does not imply trading competence: most agents deliver poor returns and exhibit weak risk management, with performance heavily dependent on market liquidity and structure. The study underscores that risk control is key to cross-market robustness and highlights the need for improved autonomous planning and adaptation in LLM agents. The open-sourced framework enables reproducible, high-fidelity assessment of agent capabilities in real financial environments.

Dataset

The dataset spans three distinct financial markets: the U.S. stock market, China's A-share market, and the cryptocurrency market, enabling evaluation of agent generalization across different regulatory environments, investor behaviors, and market dynamics.
Trading frequencies are supported at both hourly and daily intervals to capture diverse market behaviors and test agent responsiveness across time horizons.
For the U.S. stock market, the dataset includes all 100 constituents of the Nasdaq-100 Index, representing large-cap, non-financial firms in technology, semiconductors, biotech, internet services, and consumer discretionary sectors. Key companies include Apple, Microsoft, NVIDIA, Amazon, Alphabet, Tesla, and Meta.
The U.S. portfolio includes a cash asset with zero return to allow agents to practice risk-free capital allocation and demonstrate full portfolio management skills.
The A-share market subset comprises the 50 stocks in the SSE-50 Index, selected from the Shanghai Stock Exchange. These are major blue-chip firms across finance, consumer goods, industrials, IT, energy, and healthcare, including Ping An Insurance, Kweichow Moutai, and China Merchants Bank.
Both market environments are designed to reflect real-world conditions: the U.S. market emphasizes sensitivity to macroeconomic factors and technological innovation, while the A-share market highlights macro-driven volatility, sector rotation, and non-stationary behavior.
The authors use the data in a training and evaluation framework where agents are trained on a mixture of market subsets with dynamically adjusted ratios to simulate cross-market adaptability.
Data is processed to ensure consistent time alignment across markets, with no external data augmentation.
No cropping is applied; full historical price and volume data are used for each asset.
Metadata includes asset identifiers, sector classifications, market capitalization tiers, and index membership to support structured analysis and model interpretation.

Method

The authors leverage a modular architecture for AI-Trader, designed to support autonomous, adaptive trading agents within a unified and extensible framework. The system operates as a closed-loop process, where agents continuously observe market conditions, reason about potential actions, and execute trades, all while interacting with a live environment through a standardized toolkit. This architecture is structured around three primary components: the agent's observation and action spaces, the toolkit of available tools, and the live environment that simulates real-world trading constraints.

As shown in the figure below, the agent's observation space encompasses key market data such as current asset prices $\mathbf{p}$ and the agent's portfolio holdings $\mathbf{s}$ , forming the initial observation $o_0$ . This base information is augmented dynamically through tool invocations, which provide additional data such as detailed stock indicators $\pi_i$ and general market news $i$ , resulting in a comprehensive observation $o_t$ at each time step. The agent's reasoning process, guided by the ReAct paradigm, operates autonomously, generating natural language traces that articulate its decision-making logic. These traces are recorded to ensure transparency and reproducibility, enabling researchers to analyze the agent's behavior in complex financial contexts.

The action space is constrained to three discrete actions per asset: buy, sell, or hold. The agent's policy function $a_t = f(o_t, r_t)$ maps the current observation $o_t$ and reasoning $r_t$ to an executable action, ensuring decisions are both autonomous and compliant with real-world constraints such as liquidity and regulatory rules. If an action would violate these constraints, the system triggers a self-correction mechanism, requiring the agent to re-evaluate and generate a new, feasible decision.

The toolkit, which includes tools for checking prices, performing web searches, retrieving stock news, executing trades, and performing calculations, is designed to be modular and extensible. Each tool is built upon the Model Context Protocol (MCP), enabling seamless integration and adaptation across different asset classes and trading frequencies. The Trade tool, for instance, enforces market-specific rules such as lot-size requirements and updates portfolio holdings and cash balances in real time, ensuring accurate and auditable trade execution. The interaction between the agent and the live environment is bidirectional: the agent receives updated observations from the environment after each action, and the environment is updated with the agent's decisions, maintaining the integrity of the closed-loop system.

Experiment

Evaluated AI trading agents across U.S. equities, A-shares, and cryptocurrencies using daily (A-shares, crypto) and hourly (U.S. equities) strategies, assessing performance on cumulative return, Sortino ratio, volatility, and maximum drawdown.
MiniMax-M2 achieved the highest cumulative return of 9.56% in the U.S. market (vs. QQQ’s 1.87%) with a Sortino ratio of 4.42 and maximum drawdown of -4.92%, demonstrating superior risk control and robustness across markets.
DeepSeek-v3.1 outperformed the CD5 Index in the crypto market (-12.18% vs. -14.30%) by maintaining a high cash position (up to 41%) and executing strategic buy-the-dip trades, highlighting its adaptability in high-volatility environments.
GPT-5, Qwen3-Max, and Gemini-2.5-Flash underperformed across all markets, with GPT-5 recording a cumulative return of 1.56% in the U.S. market and -16.41% in crypto, underscoring the gap between general language capabilities and effective trading.
Model generalization is limited: DeepSeek-v3.1 excelled in the U.S. market (8.39% CR) but failed in A-shares (-1.23% CR), while MiniMax-M2 maintained consistent performance across all markets, indicating context-aware adaptability is critical.
Case studies revealed that agents can emulate human-like behavior—DeepSeek-v3.1 successfully avoided a major U.S. market crash by diversifying into defensive sectors and increasing cash, but later underperformed due to unverified news, exposing weaknesses in information verification.

The authors use a multi-market, multi-frequency experimental design to evaluate AI trading agents across U.S. equities, A-shares, and cryptocurrencies, with performance measured using cumulative return, Sortino ratio, volatility, and maximum drawdown. Results show that MiniMax-M2 achieves the highest cumulative return and Sortino ratio in the U.S. market and remains the only consistently profitable agent in the A-share market, while DeepSeek-v3.1 outperforms the baseline in cryptocurrencies due to effective cash management and strategic trading during downturns.

The authors use the table to compare the trading behavior of different AI models across three markets, showing that MiniMax-M2 executes trades less frequently than most other models, with the lowest proportion of no-trade executions in the U.S. and A-share markets. DeepSeek-v3.1 and Claude-3.7-Sonnet exhibit higher average trade counts in the U.S. and A-share markets, indicating more active trading strategies, while Gemini-2.5-Flash shows the highest average trades in cryptocurrencies, reflecting its more aggressive approach in that volatile environment.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

a month ago

Tianyu Fan Yuhao Yang Yangqin Jiang Yifei Zhang Yuxuan Chen Chao Huang

Table of Contents

Abstract

One-sentence Summary

Key Contributions

We introduce AI-Trader, the first fully autonomous, live, and data-uncontaminated benchmark for evaluating LLM agents in real-world financial decision-making across three major markets: U.S. stocks, A-shares, and cryptocurrencies, with multiple trading granularities to simulate dynamic, real-time market conditions.
The benchmark enforces a minimal information paradigm where agents receive only essential context and must independently search, verify, and synthesize live market data through autonomous tool use, eliminating human intervention and enabling rigorous assessment of real-time reasoning and adaptability.
Evaluation of six mainstream LLMs reveals that general intelligence does not ensure trading effectiveness, with most agents showing poor returns and weak risk management, while risk control emerges as a key determinant of cross-market robustness, and excess returns are more achievable in highly liquid markets.

Introduction

Dataset

The dataset spans three distinct financial markets: the U.S. stock market, China's A-share market, and the cryptocurrency market, enabling evaluation of agent generalization across different regulatory environments, investor behaviors, and market dynamics.
Trading frequencies are supported at both hourly and daily intervals to capture diverse market behaviors and test agent responsiveness across time horizons.
For the U.S. stock market, the dataset includes all 100 constituents of the Nasdaq-100 Index, representing large-cap, non-financial firms in technology, semiconductors, biotech, internet services, and consumer discretionary sectors. Key companies include Apple, Microsoft, NVIDIA, Amazon, Alphabet, Tesla, and Meta.
The U.S. portfolio includes a cash asset with zero return to allow agents to practice risk-free capital allocation and demonstrate full portfolio management skills.
The A-share market subset comprises the 50 stocks in the SSE-50 Index, selected from the Shanghai Stock Exchange. These are major blue-chip firms across finance, consumer goods, industrials, IT, energy, and healthcare, including Ping An Insurance, Kweichow Moutai, and China Merchants Bank.
Both market environments are designed to reflect real-world conditions: the U.S. market emphasizes sensitivity to macroeconomic factors and technological innovation, while the A-share market highlights macro-driven volatility, sector rotation, and non-stationary behavior.
The authors use the data in a training and evaluation framework where agents are trained on a mixture of market subsets with dynamically adjusted ratios to simulate cross-market adaptability.
Data is processed to ensure consistent time alignment across markets, with no external data augmentation.
No cropping is applied; full historical price and volume data are used for each asset.
Metadata includes asset identifiers, sector classifications, market capitalization tiers, and index membership to support structured analysis and model interpretation.

Method

Experiment

Evaluated AI trading agents across U.S. equities, A-shares, and cryptocurrencies using daily (A-shares, crypto) and hourly (U.S. equities) strategies, assessing performance on cumulative return, Sortino ratio, volatility, and maximum drawdown.
MiniMax-M2 achieved the highest cumulative return of 9.56% in the U.S. market (vs. QQQ’s 1.87%) with a Sortino ratio of 4.42 and maximum drawdown of -4.92%, demonstrating superior risk control and robustness across markets.
DeepSeek-v3.1 outperformed the CD5 Index in the crypto market (-12.18% vs. -14.30%) by maintaining a high cash position (up to 41%) and executing strategic buy-the-dip trades, highlighting its adaptability in high-volatility environments.
GPT-5, Qwen3-Max, and Gemini-2.5-Flash underperformed across all markets, with GPT-5 recording a cumulative return of 1.56% in the U.S. market and -16.41% in crypto, underscoring the gap between general language capabilities and effective trading.
Model generalization is limited: DeepSeek-v3.1 excelled in the U.S. market (8.39% CR) but failed in A-shares (-1.23% CR), while MiniMax-M2 maintained consistent performance across all markets, indicating context-aware adaptability is critical.
Case studies revealed that agents can emulate human-like behavior—DeepSeek-v3.1 successfully avoided a major U.S. market crash by diversifying into defensive sectors and increasing cash, but later underperformed due to unverified news, exposing weaknesses in information verification.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

Tianyu Fan Yuhao Yang Yangqin Jiang Yifei Zhang Yuxuan Chen Chao Huang

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

Tianyu Fan Yuhao Yang Yangqin Jiang Yifei Zhang Yuxuan Chen Chao Huang

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

Tianyu Fan Yuhao Yang Yangqin Jiang Yifei Zhang Yuxuan Chen Chao Huang

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters