8 months ago

Multimodal Representation

Multi-Task Learning

Method/Architecture

Siddharth Srivastava Gaurav Sharma

Abstract

Majority of research in learning based methods has been towards designing andtraining networks for specific tasks. However, many of the learning basedtasks, across modalities, share commonalities and could be potentially tackledin a joint framework. We present an approach in such direction, to learnmultiple tasks, in multiple modalities, with a unified architecture. Theproposed network is composed of task specific encoders, a common trunk in themiddle, followed by task specific prediction heads. We first pre-train it byself-supervised masked training, followed by sequential training for thedifferent tasks. We train the network on all major modalities, e.g.\ visual,audio, text and 3D, and report results on $22$ diverse and challenging publicbenchmarks. We demonstrate empirically that, using a joint network to trainacross modalities leads to meaningful information sharing and this allows us toachieve state-of-the-art results on most of the benchmarks. We also showgeneralization of the trained network on cross-modal tasks as well as unseendatasets and tasks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multimodal Representation

Multi-Task Learning

Method/Architecture

Siddharth Srivastava Gaurav Sharma

Abstract

Majority of research in learning based methods has been towards designing andtraining networks for specific tasks. However, many of the learning basedtasks, across modalities, share commonalities and could be potentially tackledin a joint framework. We present an approach in such direction, to learnmultiple tasks, in multiple modalities, with a unified architecture. Theproposed network is composed of task specific encoders, a common trunk in themiddle, followed by task specific prediction heads. We first pre-train it byself-supervised masked training, followed by sequential training for thedifferent tasks. We train the network on all major modalities, e.g.\ visual,audio, text and 3D, and report results on $22$ diverse and challenging publicbenchmarks. We demonstrate empirically that, using a joint network to trainacross modalities leads to meaningful information sharing and this allows us toachieve state-of-the-art results on most of the benchmarks. We also showgeneralization of the trained network on cross-modal tasks as well as unseendatasets and tasks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp