3 months ago

Image and Text fusion for UPMC Food-101 \using BERT and CNNs

{and Riccardo La Grassa Nicola Landro Gianmarco Ria Ignazio Gallo}

Abstract

The modern digital world is becoming more and more multimodal. Looking on the internet, images are often associated with the text, so classification problems with these two modalities are very common.In this paper, we examine multimodal classification using textual information and visual representations of the same concept.We investigate two main basic methods to perform multimodal fusion and adapt them with stacking techniques to better handle this type of problem.Here, we use UPMC Food-101, which is a difficult and noisy multimodal dataset that well represents this category of multimodal problems.Our results show that the proposed early fusion technique combined with a stacking-based approach exceeds the state of the art on the dataset used.

Benchmarks

Benchmark	Methodology	Metrics
image-classification-on-food-101-1	Inception V3	Accuracy (%): 71.67
multimodal-text-and-image-classification-on-1	Early Fusion (Bert + InceptionV3)	Accuracy (%): 92.5
multimodal-text-and-image-classification-on-1	Late Fusion (Bert + InceptionV3)	Accuracy (%): 84.59

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning