HyperAIHyperAI

Command Palette

Search for a command to run...

Dolphin Multimodal Document Image Parsing

1. Tutorial Introduction

Build

Dolphin is a multimodal document parsing model launched by the ByteDance team in May 2025. The model is based on a two-stage approach of first parsing the structure and then the content. The first stage generates a sequence of document layout elements, and the second stage uses the elements as anchors to parse the content in parallel. Dolphin performs well in various document parsing tasks, surpassing models such as GPT-4.1 and Mistral-OCR. The related paper results are "Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting". Accepted by ACL 2025.

This tutorial uses resources for a single RTX 4090 card.

2. Project Examples

3. Operation steps

1. After starting the container, click the API address to enter the Web interface

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 1-2 minutes and refresh the page.

2. Usage Examples

Document Recognition

 result 

Element Recognition

result 

4. Discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information

The citation information for this project is as follows:

@inproceedings{dolphin2025,
  title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
  author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and Tang, Jingqun and Liu, Hao and Huang, Can},
  year={2025},
  booktitle={Proceedings of the 65rd Annual Meeting of the Association for Computational Linguistics (ACL)}
}

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Dolphin Multimodal Document Image Parsing | Tutorials | HyperAI