8 months ago

Abstract

Building 3D perception systems for autonomous vehicles that do not rely onhigh-density LiDAR is a critical research problem because of the expense ofLiDAR systems compared to cameras and other sensors. Recent research hasdeveloped a variety of camera-only methods, where features are differentiably"lifted" from the multi-camera images onto the 2D ground plane, yielding a"bird's eye view" (BEV) feature representation of the 3D space around thevehicle. This line of work has produced a variety of novel "lifting" methods,but we observe that other details in the training setups have shifted at thesame time, making it unclear what really matters in top-performing methods. Wealso observe that using cameras alone is not a real-world constraint,considering that additional sensors like radar have been integrated into realvehicles for years already. In this paper, we first of all attempt to elucidatethe high-impact factors in the design and training protocol of BEV perceptionmodels. We find that batch size and input resolution greatly affectperformance, while lifting strategies have a more modest effect -- even asimple parameter-free lifter works well. Second, we demonstrate that radar datacan provide a substantial boost to performance, helping to close the gapbetween camera-only and LiDAR-enabled systems. We analyze the radar usagedetails that lead to good performance, and invite the community to re-considerthis commonly-neglected part of the sensor platform.

Source PDF View Code