HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Tarsier: Recipes for Training and Evaluating Large Video Description Models

Jiawei Wang; Liping Yuan; Yuchen Zhang; Haomiao Sun

Tarsier: Recipes for Training and Evaluating Large Video Description Models

Abstract

Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a $+51.4\%$ advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a $+12.3\%$ advantage against GPT-4V and a $-6.7\%$ disadvantage against Gemini 1.5 Pro. When upgraded to Tarsier2 by building upon SigLIP and Qwen2-7B, it further improves significantly with a $+4.8\%$ advantage against GPT-4o. Besides video description, Tarsier proves to be a versatile generalist model, achieving new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. Our second contribution is the introduction of a new benchmark -- DREAM-1K (https://tarsier-vlm.github.io/) for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions. We make our models and evaluation benchmark publicly available at https://github.com/bytedance/tarsier.

Code Repositories

bytedance/tarsier
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-mvbenchTarsier (34B)
Avg.: 67.6
video-question-answering-on-tvbenchTarsier-7B
Average Accuracy: 46.9
video-question-answering-on-tvbenchTarsier-34B
Average Accuracy: 55.5
zero-shot-video-question-answer-on-egoschemaTarsier (34B)
Accuracy: 68.6
zero-shot-video-question-answer-on-egoschema-1Tarsier (34B)
Accuracy: 61.7
zero-shot-video-question-answer-on-next-qaTarsier (34B)
Accuracy: 79.2
zeroshot-video-question-answer-on-activitynetTarsier (34B)
Accuracy: 61.6
Confidence Score: 3.7
zeroshot-video-question-answer-on-msrvtt-qaTarsier (34B)
Accuracy: 66.4
Confidence Score: 3.7
zeroshot-video-question-answer-on-msvd-qaTarsier (34B)
Accuracy: 80.3
Confidence Score: 4.2
zeroshot-video-question-answer-on-tgif-qaTarsier (34B)
Accuracy: 82.5
Confidence Score: 4.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Tarsier: Recipes for Training and Evaluating Large Video Description Models | Papers | HyperAI