3 months ago

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Yu-Qi Yang Yu-Xiao Guo Jian-Yu Xiong Yang Liu Hao Pan Peng-Shuai Wang Xin Tong Baining Guo

Abstract

The use of pretrained backbones with fine-tuning has been successful for 2D vision and natural language processing tasks, showing advantages over task-specific networks. In this work, we introduce a pretrained 3D backbone, called {\SST}, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large {\SST} model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets, but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach. The code and models are available at https://github.com/microsoft/Swin3D .

Code Repositories

microsoft/swin3d

Official

pytorch

Pointcept/Pointcept

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
3d-object-detection-on-s3dis	Swin3D-L+FCAF3D	mAP@0.25: 72.1 mAP@0.5: 54.0
3d-object-detection-on-scannetv2	Swin3D-L+CAGroup3D	mAP@0.25: 76.4 mAP@0.5: 63.2
semantic-segmentation-on-s3dis	Swin3D-L	Mean IoU: 79.8 Number of params: N/A mAcc: 88.0 oAcc: 92.4
semantic-segmentation-on-s3dis-area5	Swin3D-L	Number of params: N/A mAcc: 80.5 mIoU: 74.5 oAcc: 92.7
semantic-segmentation-on-scannet	Swin3D-L	test mIoU: 77.9 val mIoU: 77.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette