Command Palette
Search for a command to run...
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning
Zhuo Chen Yufeng Huang Jiaoyan Chen Yuxia Geng Wen Zhang Yin Fang Jeff Z. Pan Huajun Chen

Abstract
Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are annotations for class-level visual characteristics. However, the current methods often fail to discriminate those subtle visual distinctions between images due to not only the shortage of fine-grained annotations, but also the attribute imbalance and co-occurrence. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives. We find that our DUET can achieve state-of-the-art performance on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark. Its components are effective and its predictions are interpretable.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-learning-on-awa2 | DUET (Ours) | Accuracy Seen: 84.7 Accuracy Unseen: 63.7 H: 72.7 average top-1 classification accuracy: 69.9 |
| zero-shot-learning-on-cub-200-2011 | DUET | Accuracy Seen: 72.8 Accuracy Unseen: 62.9 H: 67.5 average top-1 classification accuracy: 72.3 |
| zero-shot-learning-on-sun-attribute | DUET (Ours) | Accuracy Seen: 45.8 Accuracy Unseen: 45.7 H: 45.8 average top-1 classification accuracy: 64.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.