Command Palette
Search for a command to run...
Vaclav Kosar; Antonín Hoskovec; Milan Šulc; Radek Bartyzal

Abstract
We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text. The dataset, source code and model checkpoints are published at https://github.com/glami/glami-1m
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| multi-lingual-image-text-classification-on | EmbraceNet (image+text) | Top 1 Accuracy %: 69.7 Top 5 Accuracy %: 94.0 |
| multi-lingual-image-text-classification-on | CLIP (zero-shot image+text) | Top 1 Accuracy %: 32.3 Top 5 Accuracy %: 74.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.