HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

NVILA: Efficient Frontier Visual Language Models

NVILA: Efficient Frontier Visual Language Models

Abstract

Visual language models (VLMs) have made significant advances in accuracy inrecent years. However, their efficiency has received much less attention. Thispaper introduces NVILA, a family of open VLMs designed to optimize bothefficiency and accuracy. Building on top of VILA, we improve its modelarchitecture by first scaling up the spatial and temporal resolutions, and thencompressing visual tokens. This "scale-then-compress" approach enables NVILA toefficiently process high-resolution images and long videos. We also conduct asystematic investigation to enhance the efficiency of NVILA throughout itsentire lifecycle, from training and fine-tuning to deployment. NVILA matches orsurpasses the accuracy of many leading open and proprietary VLMs across a widerange of image and video benchmarks. At the same time, it reduces trainingcosts by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code andmodels available to facilitate reproducibility.

Code Repositories

efficient-large-model/vila
pytorch
Mentioned in GitHub
nvlabs/vila
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-next-qaNVILA(8B)
Accuracy: 82.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp