HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Liping Yuan Jiawei Wang Haomiao Sun Yuchen Zhang Yuan Lin

Tarsier2: Advancing Large Vision-Language Models from Detailed Video
  Description to Comprehensive Video Understanding

Abstract

We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM)designed for generating detailed and accurate video descriptions, while alsoexhibiting superior general video understanding capabilities. Tarsier2 achievessignificant advancements through three key upgrades: (1) Scaling pre-trainingdata from 11M to 40M video-text pairs, enriching both volume and diversity; (2)Performing fine-grained temporal alignment during supervised fine-tuning; (3)Using model-based sampling to automatically construct preference data andapplying DPO training for optimization. Extensive experiments show thatTarsier2-7B consistently outperforms leading proprietary models, includingGPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1Kbenchmark, Tarsier2-7B improves F1 by 2.8% over GPT-4o and 5.8% overGemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6%performance advantage over GPT-4o and +24.9% over Gemini-1.5-Pro. Tarsier2-7Balso sets new state-of-the-art results across 15 public benchmarks, spanningtasks such as video question-answering, video grounding, hallucination test,and embodied question-answering, demonstrating its versatility as a robustgeneralist vision-language model.

Code Repositories

bytedance/tarsier
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-tvbenchTarsier2-7B
Average Accuracy: 54.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding | Papers | HyperAI