HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Li Zhiqi ; Wang Wenhai ; Li Hongyang ; Xie Enze ; Sima Chonghao ; Lu Tong ; Yu Qiao ; Dai Jifeng

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera
  Images via Spatiotemporal Transformers

Abstract

3D visual perception tasks, including 3D detection and map segmentation basedon multi-camera images, are essential for autonomous driving systems. In thiswork, we present a new framework termed BEVFormer, which learns unified BEVrepresentations with spatiotemporal transformers to support multiple autonomousdriving perception tasks. In a nutshell, BEVFormer exploits both spatial andtemporal information by interacting with spatial and temporal space throughpredefined grid-shaped BEV queries. To aggregate spatial information, we designspatial cross-attention that each BEV query extracts the spatial features fromthe regions of interest across camera views. For temporal information, wepropose temporal self-attention to recurrently fuse the history BEVinformation. Our approach achieves the new state-of-the-art 56.9\% in terms ofNDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher thanprevious best arts and on par with the performance of LiDAR-based baselines. Wefurther show that BEVFormer remarkably improves the accuracy of velocityestimation and recall of objects under low visibility conditions. The code isavailable at \url{https://github.com/zhiqi-li/BEVFormer}.

Code Repositories

zhiqi-li/BEVFormer
Official
Mentioned in GitHub
valeoai/pointbev
pytorch
Mentioned in GitHub
fundamentalvision/BEVFormer
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
3d-object-detection-on-dair-v2x-iBEVFormer
AP|R40(easy): 61.4
AP|R40(hard): 50.7
AP|R40(moderate): 50.7
3d-object-detection-on-nuscenesBEVFormer
NDS: 0.57
mAAE: 0.13
mAOE: 0.38
mAP: 0.48
mASE: 0.26
mATE: 0.58
mAVE: 0.38
3d-object-detection-on-nuscenes-camera-onlyBEVFormer
Future Frame: false
NDS: 56.9
bird-s-eye-view-semantic-segmentation-onBEVFormer
IoU lane - 224x480 - 100x100 at 0.5: 25.7
IoU veh - 224x480 - No vis filter - 100x100 at 0.5: 35.8
IoU veh - 224x480 - Vis filter. - 100x100 at 0.5: 42.0
IoU veh - 448x800 - No vis filter - 100x100 at 0.5: 39.0
IoU veh - 448x800 - Vis filter. - 100x100 at 0.5: 45.5
bird-s-eye-view-semantic-segmentation-on-lyftBEVFormer (EfficientNet-b4)
IoU vehicle - 224x480 - Long: 44.5
IoU vehicle - 224x480 - Short: 69.9
bird-s-eye-view-semantic-segmentation-on-lyftBEVFormer(ResNet-50)
IoU vehicle - 224x480 - Long: 43.2
IoU vehicle - 224x480 - Short: 68.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers | Papers | HyperAI