
摘要
本文介绍了ShapeLLM,首个为具身交互设计的三维多模态大语言模型(LLM),探索了通过三维点云和语言实现的通用三维物体理解。ShapeLLM基于改进的三维编码器构建,该编码器通过扩展ReCon至ReCon++,利用多视角图像蒸馏技术增强了几何理解能力。通过使用ReCon++作为大语言模型的三维点云输入编码器,ShapeLLM在构造的指令跟随数据上进行训练,并在我们新的人工整理基准测试集3D MM-Vet上进行了测试。ReCon++和ShapeLLM在三维几何理解和语言统一的三维交互任务(如具身视觉定位)中达到了最先进的性能。项目页面:https://qizekun.github.io/shapellm/
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| 3d-object-captioning-on-objaverse-1 | ShapeLLM-13B | Sentence-BERT: 48.52 GPT-4: 48.94 SimCSE: 49.98 |
| 3d-object-captioning-on-objaverse-1 | ShapeLLM-7B | Sentence-BERT: 48.20 GPT-4: 46.92 SimCSE: 49.23 |
| 3d-point-cloud-classification-on-modelnet40 | ReCon++ | Overall Accuracy: 95.0 |
| 3d-point-cloud-classification-on-scanobjectnn | ReCon++ | OBJ-BG (OA): 98.80 OBJ-ONLY (OA): 97.59 Overall Accuracy: 95.25 |
| 3d-point-cloud-linear-classification-on | ReCon++ | Overall Accuracy: 93.6 |
| 3d-question-answering-3d-qa-on-3d-mm-vet | ShapeLLM-13B | Overall Accuracy: 53.1 |
| 3d-question-answering-3d-qa-on-3d-mm-vet | ShapeLLM-7B | Overall Accuracy: 47.4 |
| few-shot-3d-point-cloud-classification-on-1 | ReCon++ | Overall Accuracy: 98.0 Standard Deviation: 2.3 |
| few-shot-3d-point-cloud-classification-on-2 | ReCon++ | Overall Accuracy: 99.5 Standard Deviation: 0.8 |
| few-shot-3d-point-cloud-classification-on-3 | ReCon++ | Overall Accuracy: 94.5 Standard Deviation: 4.1 |
| few-shot-3d-point-cloud-classification-on-4 | ReCon++ | Overall Accuracy: 96.5 Standard Deviation: 3.0 |
| generative-3d-object-classification-on-1 | ShapeLLM-13B | Objaverse (Average): 54.00 |
| generative-3d-object-classification-on-1 | ShapeLLM-7B | Objaverse (Average): 54.50 |
| generative-3d-object-classification-on-2 | ShapeLLM-13B | ModelNet40 (Average): 52.96 |
| generative-3d-object-classification-on-2 | ShapeLLM-7B | ModelNet40 (Average): 53.08 |
| zero-shot-transfer-3d-point-cloud | ReCon++ | Accuracy (%): 87.3 |
| zero-shot-transfer-3d-point-cloud-2 | ReCon++ | OBJ_ONLY Accuracy(%): 65.4 |