5 months ago

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Kamath Aishwarya ; Singh Mannat ; LeCun Yann ; Synnaeve Gabriel ; Misra Ishan ; Carion Nicolas

Abstract

Multi-modal reasoning systems rely on a pre-trained object detector toextract regions of interest from the image. However, this crucial module istypically used as a black box, trained independently of the downstream task andon a fixed vocabulary of objects and attributes. This makes it challenging forsuch systems to capture the long tail of visual concepts expressed in free formtext. In this paper we propose MDETR, an end-to-end modulated detector thatdetects objects in an image conditioned on a raw text query, like a caption ora question. We use a transformer-based architecture to reason jointly over textand image by fusing the two modalities at an early stage of the model. Wepre-train the network on 1.3M text-image pairs, mined from pre-existingmulti-modal datasets having explicit alignment between phrases in text andobjects in the image. We then fine-tune on several downstream tasks such asphrase grounding, referring expression comprehension and segmentation,achieving state-of-the-art results on popular benchmarks. We also investigatethe utility of our model as an object detector on a given label set whenfine-tuned in a few-shot setting. We show that our pre-training approachprovides a way to handle the long tail of object categories which have very fewlabelled instances. Our approach can be easily extended for visual questionanswering, achieving competitive performance on GQA and CLEVR. The code andmodels are available at https://github.com/ashkamath/mdetr.

Code Repositories

b-faye/lightmdetr

pytorch

Mentioned in GitHub

ashkamath/mdetr

Official

pytorch

Mentioned in GitHub

thunlp/pevl

pytorch

Mentioned in GitHub

AleDella/mdter_eval

pytorch

Mentioned in GitHub

facebookresearch/multimodal

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
generalized-referring-expression	MDETR	N-acc.: 36.1 Precision@(F1=1, IoU≥0.5): 41.5
phrase-grounding-on-flickr30k-entities-test	MDETR-ENB5	R@1: 84.3 R@10: 95.8 R@5: 93.9
referring-expression-segmentation-on	MDETR ENB3	Mean IoU: 53.7 Pr@0.5: 57.5 Pr@0.7: 39.9 Pr@0.9: 11.9
referring-image-matting-expression-based-on	MDETR (ResNet-101)	MAD: 0.0482 MAD(E): 0.0515 MSE: 0.0434 MSE(E): 0.0463 SAD: 84.70 SAD(E): 90.45
referring-image-matting-keyword-based-on	MDETR (ResNet-101)	MAD: 0.0183 MAD(E): 0.0190 MSE: 0.0137 MSE(E): 0.0141 SAD: 32.27 SAD(E): 33.52
referring-image-matting-refmatte-rw100-on	MDETR (ResNet-101)	MAD: 0.0751 MAD(E): 0.0779 MSE: 0.0675 MSE(E): 0.0700 SAD: 131.58 SAD(E): 136.59
visual-question-answering-on-clevr	MDETR	Accuracy: 99.7
visual-question-answering-on-clevr-humans	MDETR	Accuracy: 81.7
visual-question-answering-on-gqa-test-std	MDETR-ENB5	Accuracy: 62.45

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Kamath Aishwarya ; Singh Mannat ; LeCun Yann ; Synnaeve Gabriel ; Misra Ishan ; Carion Nicolas

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters