HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

Weituo Hao Chunyuan Li Xiujun Li Lawrence Carin Jianfeng Gao

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

Abstract

Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent called Prevalent. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation and "Help, Anna!" the proposed Prevalent leads to significant improvement over existing methods, achieving a new state of the art.

Code Repositories

weituo12321/PREVALENT
Official
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
visual-navigation-on-help-anna-hanna-1Prevalent
spl: 28.72
visual-navigation-on-room-to-room-1Prevalent
spl: 0.51

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training | Papers | HyperAI