HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

AnyCap Project: A Unified Framework, Dataset, and Benchmark for
  Controllable Omni-modal Captioning

Abstract

Controllable captioning is essential for precise multimodal alignment andinstruction following, yet existing models often lack fine-grained control andreliable evaluation protocols. To address this gap, we present the AnyCapProject, an integrated solution spanning model, dataset, and evaluation. Weintroduce AnyCapModel (ACM), a lightweight plug-and-play framework thatenhances the controllability of existing foundation models for omni-modalcaptioning without retraining the base model. ACM reuses the original captionsfrom base models while incorporating user instructions and modality features togenerate improved captions. To remedy the data scarcity in controllablemultimodal captioning, we build AnyCapDataset (ACD), covering three modalities,28 user-instruction types, and 300\,k high-quality data entries. We furtherpropose AnyCapEval, a new benchmark that provides more reliable evaluationmetrics for controllable captioning by decoupling content accuracy andstylistic fidelity. ACM markedly improves caption quality across a diverse setof base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scoresby 45\% and style scores by 12\%, and it also achieves substantial gains onwidely used benchmarks such as MIA-Bench and VidCapBench.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning | Papers | HyperAI