8 months ago

Hexiang Hu Yi Luan Yang Chen Urvashi Khandelwal Mandar Joshi Kenton Lee Kristina Toutanova Ming-Wei Chang

Abstract

Large-scale multi-modal pre-training models such as CLIP and PaLI exhibitstrong generalization on various visual domains and tasks. However, existingimage classification benchmarks often evaluate recognition on a specific domain(e.g., outdoor images) or a specific task (e.g., classifying plant species),which falls short of evaluating whether pre-trained foundational models areuniversal visual recognizers. To address this, we formally present the task ofOpen-domain Visual Entity recognitioN (OVEN), where a model need to link animage onto a Wikipedia entity with respect to a text query. We constructOVEN-Wiki by re-purposing 14 existing datasets with all labels grounded ontoone single label space: Wikipedia entities. OVEN challenges models to selectamong six million possible Wikipedia entities, making it a general visualrecognition benchmark with the largest number of labels. Our study onstate-of-the-art pre-trained models reveals large headroom in generalizing tothe massive-scale label space. We show that a PaLI-based auto-regressive visualrecognition model performs surprisingly well, even on Wikipedia entities thathave never been seen during fine-tuning. We also find existing pretrainedmodels yield different strengths: while PaLI-based models obtain higher overallperformance, CLIP-based models are better at recognizing tail entities.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Multimodal

Image Recognition

Multimodal Representation

Multimodality

Computer Vision

Task/Problem

Hexiang Hu Yi Luan Yang Chen Urvashi Khandelwal Mandar Joshi Kenton Lee Kristina Toutanova Ming-Wei Chang

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Multimodal

Image Recognition

Multimodal Representation

Multimodality

Computer Vision

Task/Problem

Hexiang Hu Yi Luan Yang Chen Urvashi Khandelwal Mandar Joshi Kenton Lee Kristina Toutanova Ming-Wei Chang

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Hexiang Hu Yi Luan Yang Chen Urvashi Khandelwal Mandar Joshi Kenton Lee Kristina Toutanova Ming-Wei Chang

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Hexiang Hu Yi Luan Yang Chen Urvashi Khandelwal Mandar Joshi Kenton Lee Kristina Toutanova Ming-Wei Chang

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Hexiang Hu Yi Luan Yang Chen Urvashi Khandelwal Mandar Joshi Kenton Lee Kristina Toutanova Ming-Wei Chang

Abstract

Build AI with AI

HyperAI Newsletters