Command Palette
Search for a command to run...
{and Riccardo La Grassa Nicola Landro Gianmarco Ria Ignazio Gallo}

Abstract
The modern digital world is becoming more and more multimodal. Looking on the internet, images are often associated with the text, so classification problems with these two modalities are very common.In this paper, we examine multimodal classification using textual information and visual representations of the same concept.We investigate two main basic methods to perform multimodal fusion and adapt them with stacking techniques to better handle this type of problem.Here, we use UPMC Food-101, which is a difficult and noisy multimodal dataset that well represents this category of multimodal problems.Our results show that the proposed early fusion technique combined with a stacking-based approach exceeds the state of the art on the dataset used.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-classification-on-food-101-1 | Inception V3 | Accuracy (%): 71.67 |
| multimodal-text-and-image-classification-on-1 | Early Fusion (Bert + InceptionV3) | Accuracy (%): 92.5 |
| multimodal-text-and-image-classification-on-1 | Late Fusion (Bert + InceptionV3) | Accuracy (%): 84.59 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.