Command Palette
Search for a command to run...
Yikai Wang Xinghao Chen Lele Cao Wenbing Huang Fuchun Sun Yunhe Wang

Abstract
Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| 3d-object-detection-on-scannetv2 | TokenFusion | mAP@0.25: 70.8 mAP@0.5: 54.2 |
| 3d-object-detection-on-sun-rgbd-val | TokenFusion | mAP@0.25: 64.9 mAP@0.5: 48.3 |
| semantic-segmentation-on-deliver | TokenFusion (RGB-Depth) | mIoU: 60.25 |
| semantic-segmentation-on-deliver | TokenFusion (RGB-Event) | mIoU: 45.63 |
| semantic-segmentation-on-deliver | TokenFusion (RGB-LiDAR) | mIoU: 53.01 |
| semantic-segmentation-on-kitti-360 | TokenFusion (RGB-LiDAR) | mIoU: 54.55 |
| semantic-segmentation-on-kitti-360 | TokenFusion (RGB-Depth) | mIoU: 57.44 |
| semantic-segmentation-on-llrgbd-synthetic | TokenFusion (SegFormer-B2) | mIoU: 64.75 |
| semantic-segmentation-on-nyu-depth-v2 | TokenFusion (Ti) | Mean IoU: 53.3% |
| semantic-segmentation-on-nyu-depth-v2 | TokenFusion (S) | Mean IoU: 54.2% |
| semantic-segmentation-on-sun-rgbd | TokenFusion (S) | Mean IoU: 53.0% |
| semantic-segmentation-on-sun-rgbd | TokenFusion (Ti) | Mean IoU: 51.4% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.