5 months ago

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

Lei Weixian ; Ge Yixiao ; Zhang Jianfeng ; Sun Dylan ; Yi Kun ; Shan Ying ; Shou Mike Zheng

Abstract

Though the success of CLIP-based training recipes in vision-language models,their scalability to more modalities (e.g., 3D, audio, etc.) is limited tolarge-scale data, which is expensive or even inapplicable for rare modalities.In this paper, we present ViT-Lens that facilitates efficient omni-modalrepresentation learning by perceiving novel modalities with a pretrained ViTand aligning to a pre-defined space. Specifically, the modality-specific lensis tuned to project multimodal signals to the shared embedding space, which arethen processed by a strong ViT that carries pre-trained image knowledge. Theencoded multimodal representations are optimized toward aligning with themodal-independent space, pre-defined by off-the-shelf foundation models. Awell-trained lens with a ViT backbone has the potential to serve as one ofthese foundation models, supervising the learning of subsequent modalities.ViT-Lens provides a unified solution for representation learning of increasingmodalities with two appealing benefits: (i) Exploiting the pretrained ViTacross tasks and domains effectively with efficient data regime; (ii) Emergentdownstream capabilities of novel modalities are demonstrated due to themodality alignment space. We evaluate ViT-Lens in the context of 3D as aninitial verification. In zero-shot 3D classification, ViT-Lens achievessubstantial improvements over previous state-of-the-art, showing 52.0% accuracyon Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore,we enable zero-shot 3D question-answering by simply integrating the trained 3Dlens into the InstructBLIP model without any adaptation. We will release theresults of ViT-Lens on more modalities in the near future.

Code Repositories

TencentARC/ViT-Lens

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
zero-shot-transfer-3d-point-cloud	ViT-Lens	Accuracy (%): 87.6
zero-shot-transfer-3d-point-cloud-2	ViT-Lens	OBJ_ONLY Accuracy(%): 60.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette