Command Palette
Search for a command to run...
Ding Runyu ; Yang Jihan ; Xue Chuhui ; Zhang Wenqing ; Bai Song ; Qi Xiaojuan

Abstract
Open-vocabulary scene understanding aims to localize and recognize unseencategories beyond the annotated label space. The recent breakthrough of 2Dopen-vocabulary perception is largely driven by Internet-scale pairedimage-text data with rich vocabulary concepts. However, this success cannot bedirectly transferred to 3D scenarios due to the inaccessibility of large-scale3D-text pairs. To this end, we propose to distill knowledge encoded inpre-trained vision-language (VL) foundation models through captioningmulti-view images from 3D, which allows explicitly associating 3D andsemantic-rich captions. Further, to foster coarse-to-fine visual-semanticrepresentation learning from captions, we design hierarchical 3D-caption pairs,leveraging geometric constraints between 3D scenes and multi-view images.Finally, by employing contrastive learning, the model learns language-awareembeddings that connect 3D and text for open-vocabulary tasks. Our method notonly remarkably outperforms baseline methods by 25.8% $\sim$ 44.7% hIoU and14.5% $\sim$ 50.4% hAP$_{50}$ in open-vocabulary semantic and instancesegmentation, but also shows robust transferability on challenging zero-shotdomain transfer tasks. See the project website athttps://dingry.github.io/projects/PLA.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| 3d-open-vocabulary-instance-segmentation-on-2 | PLA | AP50 Base B6/N6: 46.9 AP50 Base B8/N4 : 59.0 AP50 Novel B6/N6: 9.8 AP50 Novel B8/N4: 8.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.