Command Palette
Search for a command to run...
Ang Li; Allan Jabri; Armand Joulin; Laurens van der Maaten

Abstract
Real-world image recognition systems need to recognize tens of thousands of classes that constitute a plethora of visual concepts. The traditional approach of annotating thousands of images per class for training is infeasible in such a scenario, prompting the use of webly supervised data. This paper explores the training of image-recognition systems on large numbers of images and associated user comments. In particular, we develop visual n-gram models that can predict arbitrary phrases that are relevant to the content of an image. Our visual n-gram models are feed-forward convolutional networks trained using new loss functions that are inspired by n-gram models commonly used in language modeling. We demonstrate the merits of our models in phrase prediction, phrase-based image retrieval, relating images and captions, and zero-shot transfer.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-transfer-image-classification-on | Visual N-Grams | Accuracy: 72.4 |
| zero-shot-transfer-image-classification-on-2 | Visual N-Grams | Accuracy: 23.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.