8 months ago

Abstract

Large 2D vision-language models (2D-LLMs) have gained significant attentionby bridging Large Language Models (LLMs) with images using a simple projector.Inspired by their success, large 3D point cloud-language models (3D-LLMs) alsointegrate point clouds into LLMs. However, directly aligning point clouds withLLM requires expensive training costs, typically in hundreds of GPU-hours onA100, which hinders the development of 3D-LLMs. In this paper, we introduceMiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTAresults while training for only 27 hours on one RTX 3090. Specifically, wepropose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, whichcan leverage the similarity between 2D and 3D visual information. We introducea novel four-stage training strategy for modality alignment in a cascaded way,and a mixture of query experts module to adaptively aggregate features withhigh efficiency. Moreover, we utilize parameter-efficient fine-tuning methodsLoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, whichis up to 260x fewer than existing methods. Extensive experiments show thatMiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, withsignificantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12increase on GPT-4 evaluation score for the challenging object captioning taskcompared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800.We are the first to explore the efficient 3D-LLM, offering new insights to thecommunity. Code and weights are available athttps://github.com/TangYuan96/MiniGPT-3D.

Source PDF