8 months ago

Abstract

We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM)designed for generating detailed and accurate video descriptions, while alsoexhibiting superior general video understanding capabilities. Tarsier2 achievessignificant advancements through three key upgrades: (1) Scaling pre-trainingdata from 11M to 40M video-text pairs, enriching both volume and diversity; (2)Performing fine-grained temporal alignment during supervised fine-tuning; (3)Using model-based sampling to automatically construct preference data andapplying DPO training for optimization. Extensive experiments show thatTarsier2-7B consistently outperforms leading proprietary models, includingGPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1Kbenchmark, Tarsier2-7B improves F1 by 2.8% over GPT-4o and 5.8% overGemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6%performance advantage over GPT-4o and +24.9% over Gemini-1.5-Pro. Tarsier2-7Balso sets new state-of-the-art results across 15 public benchmarks, spanningtasks such as video question-answering, video grounding, hallucination test,and embodied question-answering, demonstrating its versatility as a robustgeneralist vision-language model.

Source PDF