HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz; Hanoona Rasheed; Salman Khan; Fahad Shahbaz Khan

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Abstract

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the under-explored field of \emph{video-based conversation} by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with an LLM. The resulting model is capable of understanding and generating detailed conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantitative evaluation framework for video-based dialogue models to objectively analyze the strengths and weaknesses of video-based dialogue models. Code: https://github.com/mbzuai-oryx/Video-ChatGPT.

Code Repositories

mbzuai-oryx/video-chatgpt
Official
pytorch
Mentioned in GitHub
qiujihao19/artemis
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
question-answering-on-next-qa-open-endedVideo-ChatGPT
Accuracy: 54.6
Confidence Score: 3.2
vcgbench-diverse-on-videoinstructVideo-ChatGPT
Consistency: 2.06
Contextual Understanding: 2.46
Correctness of Information: 2.07
Dense Captioning: 0.89
Detail Orientation: 2.42
Reasoning: 3.60
Spatial Understanding: 2.25
Temporal Understanding: 1.39
mean: 2.08
video-based-generative-performanceVideo-ChatGPT
Consistency: 2.37
Contextual Understanding: 2.62
Correctness of Information: 2.4
Detail Orientation: 2.52
Temporal Understanding: 1.98
mean: 2.38
video-based-generative-performance-1Video-ChatGPT
gpt-score: 2.40
video-based-generative-performance-2Video-ChatGPT
gpt-score: 2.37
video-based-generative-performance-3Video-ChatGPT
gpt-score: 2.62
video-based-generative-performance-4Video-ChatGPT
gpt-score: 2.52
video-based-generative-performance-5Video-ChatGPT
gpt-score: 1.98
video-question-answering-on-activitynet-qaVideo-ChatGPT
Accuracy: 35.2
Confidence score: 2.7
video-question-answering-on-mvbenchVideo-ChatGPT
Avg.: 32.7
zeroshot-video-question-answer-on-activitynetVideo-ChatGPT
Accuracy: 35.2
Confidence Score: 2.7
zeroshot-video-question-answer-on-msrvtt-qaVideo-ChatGPT-7B
Accuracy: 49.3
Confidence Score: 2.8
zeroshot-video-question-answer-on-msvd-qaVideo-ChatGPT-7B
Accuracy: 64.9
Confidence Score: 3.3
zeroshot-video-question-answer-on-tgif-qaVideo-ChatGPT-7B
Accuracy: 51.4
Confidence Score: 3.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Papers | HyperAI