Command Palette
Search for a command to run...
KunChang Li; Yinan He; Yi Wang; Yizhuo Li; Wenhai Wang; Ping Luo; Yali Wang; Limin Wang; Yu Qiao

Abstract
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| question-answering-on-next-qa-open-ended | VideoChat | Accuracy: 56.6 Confidence Score: 3.2 |
| video-based-generative-performance | Video Chat | Consistency: 2.24 Contextual Understanding: 2.53 Correctness of Information: 2.23 Detail Orientation: 2.50 Temporal Understanding: 1.94 mean: 2.29 |
| video-based-generative-performance-1 | Video Chat | gpt-score: 2.32 |
| video-based-generative-performance-2 | Video Chat | gpt-score: 2.24 |
| video-based-generative-performance-3 | Video Chat | gpt-score: 2.53 |
| video-based-generative-performance-4 | Video Chat | gpt-score: 2.50 |
| video-based-generative-performance-5 | Video Chat | gpt-score: 1.94 |
| video-question-answering-on-activitynet-qa | Video Chat | Accuracy: 26.5 Confidence score: 2.2 |
| video-question-answering-on-mvbench | VideoChat | Avg.: 35.5 |
| zeroshot-video-question-answer-on-activitynet | Video Chat | Accuracy: 26.5 Confidence Score: 2.2 |
| zeroshot-video-question-answer-on-msrvtt-qa | Video Chat-7B | Accuracy: 45.0 Confidence Score: 2.5 |
| zeroshot-video-question-answer-on-msvd-qa | Video Chat-7B | Accuracy: 56.3 Confidence Score: 2.8 |
| zeroshot-video-question-answer-on-tgif-qa | Video Chat-7B | Accuracy: 34.4 Confidence Score: 2.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.