HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis

Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis

Abstract

Deep network models are often purely inductive during both training and inference on unseen data. When these models are used for prediction, but they may fail to capture important semantic information and implicit dependencies within datasets. Recent advancements have shown that combining multiple modalities in large-scale vision and language settings can improve understanding and generalization performance. However, as the model size increases, fine-tuning and deployment become computationally expensive, even for a small number of downstream tasks. Moreover, it is still unclear how domain or prior modal knowledge can be specified in a backpropagation friendly manner, especially in large-scale and noisy settings. To address these challenges, we propose a simplified alternative of combining features from pretrained deep networks and freely available semantic explicit knowledge. In order to remove irrelevant explicit knowledge that does not correspond well to the images, we introduce an implicit Differentiable Out-of-Distribution (OOD) detection layer. This layer addresses outlier detection by solving for fixed points of a differentiable function and using the last iterate of fixed point solver to backpropagate. In practice, we apply our model on several vision and language downstream tasks including visual question answering, visual reasoning, and image-text retrieval on different datasets. Our experiments show that it is possible to design models that perform similarly to state-of-the-art results but with significantly fewer samples and less training time. Our models and code are available here: https://github.com/ellenzhuwang/implicit_vkood

Benchmarks

BenchmarkMethodologyMetrics
cross-modal-retrieval-on-coco-2014VK-OOD
Image-to-text R@1: 80.7
Image-to-text R@10: 96.8
Image-to-text R@5: 95.1
Text-to-image R@1: 62.9
Text-to-image R@10: 92.8
Text-to-image R@5: 84.8
visual-question-answering-on-ok-vqaVK-OOD
Accuracy: 52.4
visual-question-answering-on-vqa-v2-test-devVK-OOD
Accuracy: 77.9
visual-reasoning-on-nlvr2-devVK-OOD
Accuracy: 84.6
zero-shot-cross-modal-retrieval-on-flickr30kVK-OOD
Image-to-text R@1: 89.0
Image-to-text R@10: 99.8
Image-to-text R@5: 99.2
Text-to-image R@1: 77.2
Text-to-image R@10: 98.2
Text-to-image R@5: 94.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | Papers | HyperAI