Command Palette
Search for a command to run...
VL3-Syn7M Multimodal Image-Text Dataset
The VL3-Syn7M dataset is a high-quality image-text dataset released by Alibaba DAMO Academy in 2025. It aims to help the cutting-edge multimodal basic model VideoLLaMA3 for video understanding achieve significant progress in multimodal understanding. The relevant paper results are:VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding". The dataset contains multi-dimensional fine annotations, including detailed captions of images, short captions, and image source information, and covers multiple types of data such as scene images, document images, and text images, providing rich materials for models to learn multimodal information. These high-quality data provide valuable support for in-depth research on image semantic understanding and optimization of multimodal interaction systems, and promote the development of related industries such as intelligent visual assistants, document understanding tools, and image-guided robot interaction.
Main Features
- Large data scale: Contains 7 million images and corresponding annotations, providing massive samples for model training, fully meeting the needs of complex models for large-scale data, and helping to improve the model's ability to understand various visual scenes and semantics.
- Wide range of data sources: scene images come from multiple different datasets such as Object365 and SA-1B, which greatly increases data diversity; scene text images come from BLIP3-OCR; document images are selected from pdfa-eng-wds and idl-wds, etc. The wide range of data sources makes the data cover rich and diverse visual content and scenes, which can improve the model's generalization ability to understand different types of images.
- High-quality annotation: Short subtitles are generated by InternVL2-8B, and detailed subtitles are completed by InternVL2-26B, and contain a large amount of plain text data. High-quality caption annotation provides accurate guidance for the model to learn the association between images and text, while plain text data helps improve the model's ability to handle instruction following tasks involving visual and text inputs.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.