HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

Sankalp Nagaonkar Augustya Sharma Ashish Choithani Ashutosh Trivedi

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

Abstract

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.

Code Repositories

video-db/ocr-benchmark
Official
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
optical-character-recognition-ocr-on-videodbGemini-1.5 Pro
Average Accuracy: 76.13
Character Error Rate (CER): 0.2387
Word Error Rate (WER): 0.2385
optical-character-recognition-ocr-on-videodbGPT-4o
Average Accuracy: 76.22
Character Error Rate (CER): 0.2378
Word Error Rate (WER): 0.5117
optical-character-recognition-ocr-on-videodbClaude-3 Sonnet
Average Accuracy: 67.71
Character Error Rate (CER): 0.3229
Word Error Rate (WER): 0.4663
optical-character-recognition-ocr-on-videodbRapidOCR
Average Accuracy: 56.98
Character Error Rate (CER): 0.7620
Word Error Rate (WER): 0.4302
optical-character-recognition-ocr-on-videodbEasyOCR
Average Accuracy: 49.30
Character Error Rate (CER): 0.5070
Word Error Rate (WER): 0.8262

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp