Command Palette
Search for a command to run...
Kamath Aishwarya ; Singh Mannat ; LeCun Yann ; Synnaeve Gabriel ; Misra Ishan ; Carion Nicolas

Abstract
Multi-modal reasoning systems rely on a pre-trained object detector toextract regions of interest from the image. However, this crucial module istypically used as a black box, trained independently of the downstream task andon a fixed vocabulary of objects and attributes. This makes it challenging forsuch systems to capture the long tail of visual concepts expressed in free formtext. In this paper we propose MDETR, an end-to-end modulated detector thatdetects objects in an image conditioned on a raw text query, like a caption ora question. We use a transformer-based architecture to reason jointly over textand image by fusing the two modalities at an early stage of the model. Wepre-train the network on 1.3M text-image pairs, mined from pre-existingmulti-modal datasets having explicit alignment between phrases in text andobjects in the image. We then fine-tune on several downstream tasks such asphrase grounding, referring expression comprehension and segmentation,achieving state-of-the-art results on popular benchmarks. We also investigatethe utility of our model as an object detector on a given label set whenfine-tuned in a few-shot setting. We show that our pre-training approachprovides a way to handle the long tail of object categories which have very fewlabelled instances. Our approach can be easily extended for visual questionanswering, achieving competitive performance on GQA and CLEVR. The code andmodels are available at https://github.com/ashkamath/mdetr.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| generalized-referring-expression | MDETR | N-acc.: 36.1 Precision@(F1=1, IoU≥0.5): 41.5 |
| phrase-grounding-on-flickr30k-entities-test | MDETR-ENB5 | R@1: 84.3 R@10: 95.8 R@5: 93.9 |
| referring-expression-segmentation-on | MDETR ENB3 | Mean IoU: 53.7 Pr@0.5: 57.5 Pr@0.7: 39.9 Pr@0.9: 11.9 |
| referring-image-matting-expression-based-on | MDETR (ResNet-101) | MAD: 0.0482 MAD(E): 0.0515 MSE: 0.0434 MSE(E): 0.0463 SAD: 84.70 SAD(E): 90.45 |
| referring-image-matting-keyword-based-on | MDETR (ResNet-101) | MAD: 0.0183 MAD(E): 0.0190 MSE: 0.0137 MSE(E): 0.0141 SAD: 32.27 SAD(E): 33.52 |
| referring-image-matting-refmatte-rw100-on | MDETR (ResNet-101) | MAD: 0.0751 MAD(E): 0.0779 MSE: 0.0675 MSE(E): 0.0700 SAD: 131.58 SAD(E): 136.59 |
| visual-question-answering-on-clevr | MDETR | Accuracy: 99.7 |
| visual-question-answering-on-clevr-humans | MDETR | Accuracy: 81.7 |
| visual-question-answering-on-gqa-test-std | MDETR-ENB5 | Accuracy: 62.45 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.