HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Image and Text fusion for UPMC Food-101 \using BERT and CNNs

{and Riccardo La Grassa Nicola Landro Gianmarco Ria Ignazio Gallo}

Image and Text fusion for UPMC Food-101 \using BERT and CNNs

Abstract

The modern digital world is becoming more and more multimodal. Looking on the internet, images are often associated with the text, so classification problems with these two modalities are very common.In this paper, we examine multimodal classification using textual information and visual representations of the same concept.We investigate two main basic methods to perform multimodal fusion and adapt them with stacking techniques to better handle this type of problem.Here, we use UPMC Food-101, which is a difficult and noisy multimodal dataset that well represents this category of multimodal problems.Our results show that the proposed early fusion technique combined with a stacking-based approach exceeds the state of the art on the dataset used.

Benchmarks

BenchmarkMethodologyMetrics
image-classification-on-food-101-1Inception V3
Accuracy (%): 71.67
multimodal-text-and-image-classification-on-1Early Fusion (Bert + InceptionV3)
Accuracy (%): 92.5
multimodal-text-and-image-classification-on-1Late Fusion (Bert + InceptionV3)
Accuracy (%): 84.59

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Image and Text fusion for UPMC Food-101 \using BERT and CNNs | Papers | HyperAI