HyperAIHyperAI

Command Palette

Search for a command to run...

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Fangyu Liu Francesco Piccinno Syrine Krichene Chenxi Pang Kenton Lee Mandar Joshi Yasemin Altun Nigel Collier Julian Martin Eisenschlos

Abstract

Visual language data such as plots, charts, and infographics are ubiquitousin the human world. However, state-of-the-art vision-language models do notperform well on these data. We propose MatCha (Math reasoning and Chartderendering pretraining) to enhance visual language models' capabilities injointly modeling charts/plots and language data. Specifically, we proposeseveral pretraining tasks that cover plot deconstruction and numericalreasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recentlyproposed image-to-text visual language model. On standard benchmarks such asPlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by asmuch as nearly 20%. We also examine how well MatCha pretraining transfers todomains such as screenshots, textbook diagrams, and document figures andobserve overall improvement, verifying the usefulness of MatCha pretraining onbroader visual language tasks.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp