Command Palette
Search for a command to run...
LaTeX OCR Mathematical Formula Recognition Dataset
Date
Size
Publish URL
The LaTeX OCR dataset is a dataset that focuses on complex mathematical formula recognition problems in the field of optical character recognition (OCR). The LaTeX OCR dataset contains multiple configurations, each with different features and data partitioning. For example, the "full" configuration contains about 100k printed samples, while the "synthetic_handwrite" configuration contains 100k handwritten samples synthesized using handwritten fonts based on printed formulas.
This repository has 5 datasets:
smallIt is a small data set with 110 samples, used for testingfullThis is a complete dataset of about 100k printed words. In fact, the number of samples is slightly less than 100k, because a lot of LaTeX that cannot be rendered is removed using the LaTeX abstract syntax tree.synthetic_handwriteIt is a complete dataset of handwritten 100k characters, based onfullThe formula is synthesized using handwritten fonts, which can be regarded as human handwriting on paper. The number of samples is actually slightly less than 100k, for the same reason as above.human_handwriteIt is a smaller handwriting dataset that is more consistent with human handwriting on electronic screens. It mainly comes fromCROHMEWe have verified it using LaTeX's abstract syntax tree.human_handwrite_printIs fromhuman_handwriteThe printed data set, formula part andhuman_handwriteSimilarly, the pictures are rendered from formulas using LaTeX.
The LaTeX OCR dataset comes from multiple sources, including https://zenodo.org/record/56198#.V2p0KTXT6eA and https://www.isical.ac.in/~crohme/ The collected data, as well as the self-constructed data, can be used to train and evaluate OCR models, especially when processing complex mathematical symbols and formulas. It has a wide range of applications in the fields of academic document digitization, online education, scientific research assistants, and personal learning.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.