HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Sihan Chen Xingjian He Handong Li Xiaojie Jin Jiashi Feng Jing Liu

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Abstract

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA jointly models visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of downstream tasks, including long-form/short-form video-text tasks and image-text tasks such as retrieval, captioning, and question answering. Notably, COSA achieves state-of-the-art results on various competitive benchmarks. Code and model are released at https://github.com/TXH-mercury/COSA.

Code Repositories

txh-mercury/cosa
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
video-captioning-on-msr-vtt-1COSA
BLEU-4: 53.7
CIDEr: 74.7
video-captioning-on-msvd-1COSA
BLEU-4: 76.5
CIDEr: 178.5
video-captioning-on-tvcCOSA
BLEU-4: 18.8
CIDEr: 70.7
video-captioning-on-vatex-1COSA
BLEU-4: 43.7
CIDEr: 96.5
video-captioning-on-youcook2COSA
BLEU-4: 10.1
CIDEr: 1.31
video-question-answering-on-activitynet-qaCOSA
Accuracy: 49.9
video-question-answering-on-msrvtt-qaCOSA
Accuracy: 49.2
video-retrieval-on-activitynetCOSA
text-to-video R@1: 67.3
video-retrieval-on-didemoCOSA
text-to-video R@1: 70.5
video-retrieval-on-lsmdcCOSA
text-to-video R@1: 39.4
video-retrieval-on-msr-vttCOSA
text-to-video R@1: 57.9
visual-question-answering-on-msvd-qa-1COSA
Accuracy: 0.60

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | Papers | HyperAI