Command Palette
Search for a command to run...
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Zhiyuan Zhao Hengrui Kang Bin Wang Conghui He

Abstract
Document Layout Analysis is crucial for real-world document understandingsystems, but it encounters a challenging trade-off between speed and accuracy:multimodal methods leveraging both text and visual features achieve higheraccuracy but suffer from significant latency, whereas unimodal methods relyingsolely on visual features offer faster processing speeds at the expense ofaccuracy. To address this dilemma, we introduce DocLayout-YOLO, a novelapproach that enhances accuracy while maintaining speed advantages throughdocument-specific optimizations in both pre-training and model design. Forrobust document pre-training, we introduce the Mesh-candidate BestFitalgorithm, which frames document synthesis as a two-dimensional bin packingproblem, generating the large-scale, diverse DocSynth-300K dataset.Pre-training on the resulting DocSynth-300K dataset significantly improvesfine-tuning performance across various document types. In terms of modeloptimization, we propose a Global-to-Local Controllable Receptive Module thatis capable of better handling multi-scale variations of document elements.Furthermore, to validate performance across different document types, weintroduce a complex and challenging benchmark named DocStructBench. Extensiveexperiments on downstream datasets demonstrate that DocLayout-YOLO excels inboth speed and accuracy. Code, data, and models are available athttps://github.com/opendatalab/DocLayout-YOLO.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| document-layout-analysis-on-d4la | DocLayout-YOLO | mAP: 70.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.