HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling: An ultra-compact vision-language model for end-to-end
  multi-modal document conversion

Abstract

We introduce SmolDocling, an ultra-compact vision-language model targetingend-to-end document conversion. Our model comprehensively processes entirepages by generating DocTags, a new universal markup format that captures allpage elements in their full context with location. Unlike existing approachesthat rely on large foundational models, or ensemble solutions that rely onhandcrafted pipelines of multiple specialized models, SmolDocling offers anend-to-end conversion for accurately capturing content, structure and spatiallocation of document elements in a 256M parameters vision-language model.SmolDocling exhibits robust performance in correctly reproducing documentfeatures such as code listings, tables, equations, charts, lists, and moreacross a diverse range of document types including business documents, academicpapers, technical reports, patents, and forms -- significantly extending beyondthe commonly observed focus on scientific papers. Additionally, we contributenovel publicly sourced datasets for charts, tables, equations, and coderecognition. Experimental results demonstrate that SmolDocling competes withother Vision Language Models that are up to 27 times larger in size, whilereducing computational requirements substantially. The model is currentlyavailable, datasets will be publicly available soon.

Code Repositories

docling-project/docling
Mentioned in GitHub
DS4SD/docling
pytorch
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion | Papers | HyperAI