HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Lin Long Yichen He Wentao Ye Yiyuan Pan Yuan Lin Hang Li Junbo Zhao Wei Li

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with
  Long-Term Memory

Abstract

We introduce M3-Agent, a novel multimodal agent framework equipped withlong-term memory. Like humans, M3-Agent can process real-time visual andauditory inputs to build and update its long-term memory. Beyond episodicmemory, it also develops semantic memory, enabling it to accumulate worldknowledge over time. Its memory is organized in an entity-centric, multimodalformat, allowing deeper and more consistent understanding of the environment.Given an instruction, M3-Agent autonomously performs multi-turn, iterativereasoning and retrieves relevant information from memory to accomplish thetask. To evaluate memory effectiveness and memory-based reasoning in multimodalagents, we develop M3-Bench, a new long-video question answering benchmark.M3-Bench comprises 100 newly recorded real-world videos captured from a robot'sperspective (M3-Bench-robot) and 929 web-sourced videos across diversescenarios (M3-Bench-web). We annotate question-answer pairs designed to testkey capabilities essential for agent applications, such as human understanding,general knowledge extraction, and cross-modal reasoning. Experimental resultsshow that M3-Agent, trained via reinforcement learning, outperforms thestrongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o,achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-weband VideoMME-long, respectively. Our work advances the multimodal agents towardmore human-like long-term memory and provides insights into their practicaldesign. Model, code and data are available athttps://github.com/bytedance-seed/m3-agent

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp