Command Palette
Search for a command to run...
Han Wang; Yanjie Wang; Yongjie Ye; Yuxiang Nie; Can Huang

Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset supported for three tasks: Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million annotated video frames with corresponding object boxes and descriptions. Leveraging this dataset, we conduct training of MLLMs and propose a token-compression model T-Selector to tackle the second challenge. Our proposed approach, Elysium: Exploring Object-level Perception in Videos via MLLM, is an end-to-end trainable MLLM that attempts to conduct object-level tasks in videos without requiring any additional plug-in or expert models. All codes and datasets are available at https://github.com/Hon-Wong/Elysium.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-single-object-tracking-on-lasot | Elysium | AUC: 56.1 Normalized Precision: 61.0 Precision: 50.1 |
| zeroshot-video-question-answer-on-activitynet | Elysium | Accuracy: 43.4 Confidence Score: 2.9 |
| zeroshot-video-question-answer-on-msrvtt-qa | Elysium | Accuracy: 67.5 Confidence Score: 3.2 |
| zeroshot-video-question-answer-on-msvd-qa | Elysium | Accuracy: 75.8 Confidence Score: 3.7 |
| zeroshot-video-question-answer-on-tgif-qa | Elysium | Accuracy: 66.6 Confidence Score: 3.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.