Command Palette
Search for a command to run...
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Jiawei Wang; Liping Yuan; Yuchen Zhang; Haomiao Sun

Abstract
Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a $+51.4\%$ advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a $+12.3\%$ advantage against GPT-4V and a $-6.7\%$ disadvantage against Gemini 1.5 Pro. When upgraded to Tarsier2 by building upon SigLIP and Qwen2-7B, it further improves significantly with a $+4.8\%$ advantage against GPT-4o. Besides video description, Tarsier proves to be a versatile generalist model, achieving new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. Our second contribution is the introduction of a new benchmark -- DREAM-1K (https://tarsier-vlm.github.io/) for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions. We make our models and evaluation benchmark publicly available at https://github.com/bytedance/tarsier.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-question-answering-on-mvbench | Tarsier (34B) | Avg.: 67.6 |
| video-question-answering-on-tvbench | Tarsier-7B | Average Accuracy: 46.9 |
| video-question-answering-on-tvbench | Tarsier-34B | Average Accuracy: 55.5 |
| zero-shot-video-question-answer-on-egoschema | Tarsier (34B) | Accuracy: 68.6 |
| zero-shot-video-question-answer-on-egoschema-1 | Tarsier (34B) | Accuracy: 61.7 |
| zero-shot-video-question-answer-on-next-qa | Tarsier (34B) | Accuracy: 79.2 |
| zeroshot-video-question-answer-on-activitynet | Tarsier (34B) | Accuracy: 61.6 Confidence Score: 3.7 |
| zeroshot-video-question-answer-on-msrvtt-qa | Tarsier (34B) | Accuracy: 66.4 Confidence Score: 3.7 |
| zeroshot-video-question-answer-on-msvd-qa | Tarsier (34B) | Accuracy: 80.3 Confidence Score: 4.2 |
| zeroshot-video-question-answer-on-tgif-qa | Tarsier (34B) | Accuracy: 82.5 Confidence Score: 4.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.