Command Palette
Search for a command to run...
BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
Li Zhiqi ; Wang Wenhai ; Li Hongyang ; Xie Enze ; Sima Chonghao ; Lu Tong ; Yu Qiao ; Dai Jifeng

Abstract
3D visual perception tasks, including 3D detection and map segmentation basedon multi-camera images, are essential for autonomous driving systems. In thiswork, we present a new framework termed BEVFormer, which learns unified BEVrepresentations with spatiotemporal transformers to support multiple autonomousdriving perception tasks. In a nutshell, BEVFormer exploits both spatial andtemporal information by interacting with spatial and temporal space throughpredefined grid-shaped BEV queries. To aggregate spatial information, we designspatial cross-attention that each BEV query extracts the spatial features fromthe regions of interest across camera views. For temporal information, wepropose temporal self-attention to recurrently fuse the history BEVinformation. Our approach achieves the new state-of-the-art 56.9\% in terms ofNDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher thanprevious best arts and on par with the performance of LiDAR-based baselines. Wefurther show that BEVFormer remarkably improves the accuracy of velocityestimation and recall of objects under low visibility conditions. The code isavailable at \url{https://github.com/zhiqi-li/BEVFormer}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| 3d-object-detection-on-dair-v2x-i | BEVFormer | AP|R40(easy): 61.4 AP|R40(hard): 50.7 AP|R40(moderate): 50.7 |
| 3d-object-detection-on-nuscenes | BEVFormer | NDS: 0.57 mAAE: 0.13 mAOE: 0.38 mAP: 0.48 mASE: 0.26 mATE: 0.58 mAVE: 0.38 |
| 3d-object-detection-on-nuscenes-camera-only | BEVFormer | Future Frame: false NDS: 56.9 |
| bird-s-eye-view-semantic-segmentation-on | BEVFormer | IoU lane - 224x480 - 100x100 at 0.5: 25.7 IoU veh - 224x480 - No vis filter - 100x100 at 0.5: 35.8 IoU veh - 224x480 - Vis filter. - 100x100 at 0.5: 42.0 IoU veh - 448x800 - No vis filter - 100x100 at 0.5: 39.0 IoU veh - 448x800 - Vis filter. - 100x100 at 0.5: 45.5 |
| bird-s-eye-view-semantic-segmentation-on-lyft | BEVFormer (EfficientNet-b4) | IoU vehicle - 224x480 - Long: 44.5 IoU vehicle - 224x480 - Short: 69.9 |
| bird-s-eye-view-semantic-segmentation-on-lyft | BEVFormer(ResNet-50) | IoU vehicle - 224x480 - Long: 43.2 IoU vehicle - 224x480 - Short: 68.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.