8 months ago

Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang Linjie Li Zicheng Liu Ce Liu Yann LeCun Nanyun Peng

Abstract

Vision-language (VL) pre-training has recently received considerableattention. However, most existing end-to-end pre-training approaches eitheronly aim to tackle VL tasks such as image-text retrieval, visual questionanswering (VQA) and image captioning that test high-level understanding ofimages, or only target region-level understanding for tasks such as phrasegrounding and object detection. We present FIBER (Fusion-In-the-Backbone-basedtransformER), a new VL model architecture that can seamlessly handle both thesetypes of tasks. Instead of having dedicated transformer layers for fusion afterthe uni-modal backbones, FIBER pushes multimodal fusion deep into the model byinserting cross-attention into the image and text backbones, bringing gains interms of memory and performance. In addition, unlike previous work that iseither only pre-trained on image-text data or on fine-grained data withbox-level annotations, we present a two-stage pre-training strategy that usesboth these kinds of data efficiently: (i) coarse-grained pre-training based onimage-text data; followed by (ii) fine-grained pre-training based onimage-text-box data. We conduct comprehensive experiments on a wide range of VLtasks, ranging from VQA, image captioning, and retrieval, to phrase grounding,referring expression comprehension, and object detection. Using deep multimodalfusion coupled with the two-stage pre-training, FIBER provides consistentperformance improvements over strong baselines across all tasks, oftenoutperforming methods using magnitudes more data. Code is available athttps://github.com/microsoft/FIBER.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Visual Question Answering

Multimodal Representation

Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang Linjie Li Zicheng Liu Ce Liu Yann LeCun Nanyun Peng

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Visual Question Answering

Multimodal Representation

Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang Linjie Li Zicheng Liu Ce Liu Yann LeCun Nanyun Peng

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Papers | HyperAI

Command Palette

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang Linjie Li Zicheng Liu Ce Liu Yann LeCun Nanyun Peng2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang Linjie Li Zicheng Liu Ce Liu Yann LeCun Nanyun Peng2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang Linjie Li Zicheng Liu Ce Liu Yann LeCun Nanyun Peng2 more

Abstract

Build AI with AI

HyperAI Newsletters

Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang Linjie Li Zicheng Liu Ce Liu Yann LeCun Nanyun Peng

Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang Linjie Li Zicheng Liu Ce Liu Yann LeCun Nanyun Peng

Zi-Yi Dou Aishwarya Kamath Zhe Gan Pengchuan Zhang Jianfeng Wang Linjie Li Zicheng Liu Ce Liu Yann LeCun Nanyun Peng