8 months ago

Visual Question Answering

Method/Architecture

Fangyu Liu‡‡∗§ Julian Martin Eisenschlos∗∗ Francesco Piccinno∗ Syrine Krichene∗ Chenxi Pang∗ Kenton Lee∗ Mandar Joshi∗ Wenhu Chen∗ Nigel Collier∗ Yasemin Altun∗

Abstract

Visual language such as charts and plots is ubiquitous in the human world.Comprehending plots and charts requires strong reasoning skills. Priorstate-of-the-art (SOTA) models require at least tens of thousands of trainingexamples and their reasoning capabilities are still much limited, especially oncomplex human-written queries. This paper presents the first one-shot solutionto visual language reasoning. We decompose the challenge of visual languagereasoning into two steps: (1) plot-to-text translation, and (2) reasoning overthe translated text. The key in this method is a modality conversion module,named as DePlot, which translates the image of a plot or chart to a linearizedtable. The output of DePlot can then be directly used to prompt a pretrainedlarge language model (LLM), exploiting the few-shot reasoning capabilities ofLLMs. To obtain DePlot, we standardize the plot-to-table task by establishingunified task formats and metrics, and train DePlot end-to-end on this task.DePlot can then be used off-the-shelf together with LLMs in a plug-and-playfashion. Compared with a SOTA model finetuned on more than >28k data points,DePlot+LLM with just one-shot prompting achieves a 24.0% improvement overfinetuned SOTA on human-written queries from the task of chart QA.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Visual Question Answering

Method/Architecture

Fangyu Liu‡‡∗§ Julian Martin Eisenschlos∗∗ Francesco Piccinno∗ Syrine Krichene∗ Chenxi Pang∗ Kenton Lee∗ Mandar Joshi∗ Wenhu Chen∗ Nigel Collier∗ Yasemin Altun∗

Abstract

Visual language such as charts and plots is ubiquitous in the human world.Comprehending plots and charts requires strong reasoning skills. Priorstate-of-the-art (SOTA) models require at least tens of thousands of trainingexamples and their reasoning capabilities are still much limited, especially oncomplex human-written queries. This paper presents the first one-shot solutionto visual language reasoning. We decompose the challenge of visual languagereasoning into two steps: (1) plot-to-text translation, and (2) reasoning overthe translated text. The key in this method is a modality conversion module,named as DePlot, which translates the image of a plot or chart to a linearizedtable. The output of DePlot can then be directly used to prompt a pretrainedlarge language model (LLM), exploiting the few-shot reasoning capabilities ofLLMs. To obtain DePlot, we standardize the plot-to-table task by establishingunified task formats and metrics, and train DePlot end-to-end on this task.DePlot can then be used off-the-shelf together with LLMs in a plug-and-playfashion. Compared with a SOTA model finetuned on more than >28k data points,DePlot+LLM with just one-shot prompting achieves a 24.0% improvement overfinetuned SOTA on human-written queries from the task of chart QA.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp