HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Shulin Tian Ruiqi Wang Hongming Guo Penghao Wu Yuhao Dong Xiuying Wang Jingkang Yang Hao Zhang Hongyuan Zhu Ziwei Liu

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Abstract

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e.,in days and weeks) egocentric videos, which leverages a structuredChain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trainedvia reinforcement learning (RL). Inspired by human problem-solving strategies,CoTT decomposes complex reasoning into modular steps, with the RL agentinvoking specific tools, one per step, to iteratively and collaborativelyanswer sub-questions tackling such tasks as temporal retrieval and multi-modalunderstanding. We design a two-stage training paradigm involving supervisedfinetuning (SFT) of a pretrained language model using CoTT data and RL toenable our agent to dynamically propose step-by-step tools for long-rangereasoning. To facilitate training, we construct a dataset called Ego-R1 Data,which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, ourEgo-R1 agent is evaluated on a newly curated week-long video QA benchmark,Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources.Extensive results demonstrate that the dynamic, tool-augmented chain-of-thoughtreasoning by our Ego-R1 Agent can effectively tackle the unique challenges ofunderstanding ultra-long egocentric videos, significantly extending the timecoverage from few hours to a week.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning | Papers | HyperAI