Command Palette
Search for a command to run...
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz; Hanoona Rasheed; Salman Khan; Fahad Shahbaz Khan

Abstract
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the under-explored field of \emph{video-based conversation} by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with an LLM. The resulting model is capable of understanding and generating detailed conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantitative evaluation framework for video-based dialogue models to objectively analyze the strengths and weaknesses of video-based dialogue models. Code: https://github.com/mbzuai-oryx/Video-ChatGPT.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| question-answering-on-next-qa-open-ended | Video-ChatGPT | Accuracy: 54.6 Confidence Score: 3.2 |
| vcgbench-diverse-on-videoinstruct | Video-ChatGPT | Consistency: 2.06 Contextual Understanding: 2.46 Correctness of Information: 2.07 Dense Captioning: 0.89 Detail Orientation: 2.42 Reasoning: 3.60 Spatial Understanding: 2.25 Temporal Understanding: 1.39 mean: 2.08 |
| video-based-generative-performance | Video-ChatGPT | Consistency: 2.37 Contextual Understanding: 2.62 Correctness of Information: 2.4 Detail Orientation: 2.52 Temporal Understanding: 1.98 mean: 2.38 |
| video-based-generative-performance-1 | Video-ChatGPT | gpt-score: 2.40 |
| video-based-generative-performance-2 | Video-ChatGPT | gpt-score: 2.37 |
| video-based-generative-performance-3 | Video-ChatGPT | gpt-score: 2.62 |
| video-based-generative-performance-4 | Video-ChatGPT | gpt-score: 2.52 |
| video-based-generative-performance-5 | Video-ChatGPT | gpt-score: 1.98 |
| video-question-answering-on-activitynet-qa | Video-ChatGPT | Accuracy: 35.2 Confidence score: 2.7 |
| video-question-answering-on-mvbench | Video-ChatGPT | Avg.: 32.7 |
| zeroshot-video-question-answer-on-activitynet | Video-ChatGPT | Accuracy: 35.2 Confidence Score: 2.7 |
| zeroshot-video-question-answer-on-msrvtt-qa | Video-ChatGPT-7B | Accuracy: 49.3 Confidence Score: 2.8 |
| zeroshot-video-question-answer-on-msvd-qa | Video-ChatGPT-7B | Accuracy: 64.9 Confidence Score: 3.3 |
| zeroshot-video-question-answer-on-tgif-qa | Video-ChatGPT-7B | Accuracy: 51.4 Confidence Score: 3.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.