HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Kamath Aishwarya ; Singh Mannat ; LeCun Yann ; Synnaeve Gabriel ; Misra Ishan ; Carion Nicolas

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Abstract

Multi-modal reasoning systems rely on a pre-trained object detector toextract regions of interest from the image. However, this crucial module istypically used as a black box, trained independently of the downstream task andon a fixed vocabulary of objects and attributes. This makes it challenging forsuch systems to capture the long tail of visual concepts expressed in free formtext. In this paper we propose MDETR, an end-to-end modulated detector thatdetects objects in an image conditioned on a raw text query, like a caption ora question. We use a transformer-based architecture to reason jointly over textand image by fusing the two modalities at an early stage of the model. Wepre-train the network on 1.3M text-image pairs, mined from pre-existingmulti-modal datasets having explicit alignment between phrases in text andobjects in the image. We then fine-tune on several downstream tasks such asphrase grounding, referring expression comprehension and segmentation,achieving state-of-the-art results on popular benchmarks. We also investigatethe utility of our model as an object detector on a given label set whenfine-tuned in a few-shot setting. We show that our pre-training approachprovides a way to handle the long tail of object categories which have very fewlabelled instances. Our approach can be easily extended for visual questionanswering, achieving competitive performance on GQA and CLEVR. The code andmodels are available at https://github.com/ashkamath/mdetr.

Code Repositories

b-faye/lightmdetr
pytorch
Mentioned in GitHub
ashkamath/mdetr
Official
pytorch
Mentioned in GitHub
thunlp/pevl
pytorch
Mentioned in GitHub
AleDella/mdter_eval
pytorch
Mentioned in GitHub
facebookresearch/multimodal
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
generalized-referring-expressionMDETR
N-acc.: 36.1
Precision@(F1=1, IoU≥0.5): 41.5
phrase-grounding-on-flickr30k-entities-testMDETR-ENB5
R@1: 84.3
R@10: 95.8
R@5: 93.9
referring-expression-segmentation-onMDETR ENB3
Mean IoU: 53.7
Pr@0.5: 57.5
Pr@0.7: 39.9
Pr@0.9: 11.9
referring-image-matting-expression-based-onMDETR (ResNet-101)
MAD: 0.0482
MAD(E): 0.0515
MSE: 0.0434
MSE(E): 0.0463
SAD: 84.70
SAD(E): 90.45
referring-image-matting-keyword-based-onMDETR (ResNet-101)
MAD: 0.0183
MAD(E): 0.0190
MSE: 0.0137
MSE(E): 0.0141
SAD: 32.27
SAD(E): 33.52
referring-image-matting-refmatte-rw100-onMDETR (ResNet-101)
MAD: 0.0751
MAD(E): 0.0779
MSE: 0.0675
MSE(E): 0.0700
SAD: 131.58
SAD(E): 136.59
visual-question-answering-on-clevrMDETR
Accuracy: 99.7
visual-question-answering-on-clevr-humansMDETR
Accuracy: 81.7
visual-question-answering-on-gqa-test-stdMDETR-ENB5
Accuracy: 62.45

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding | Papers | HyperAI