HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren; Linli Yao; Shicheng Li; Xu Sun; Lu Hou

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Abstract

This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.

Code Repositories

renshuhuai-andy/timechat
Official
pytorch
Mentioned in GitHub
lntzm/cvpr24track-longvideo
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-mvbenchTimeChat
Avg.: 38.5
video-text-retrieval-on-test-of-timeTime-Chat
2-Class Accuracy: 76.67
zero-shot-video-question-answer-on-egoschema-1TimeChat (7B)
Accuracy: 33.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Papers | HyperAI