8 months ago

Abstract

The goal of perception for autonomous vehicles is to extract semanticrepresentations from multiple sensors and fuse these representations into asingle "bird's-eye-view" coordinate frame for consumption by motion planning.We propose a new end-to-end architecture that directly extracts abird's-eye-view representation of a scene given image data from an arbitrarynumber of cameras. The core idea behind our approach is to "lift" each imageindividually into a frustum of features for each camera, then "splat" allfrustums into a rasterized bird's-eye-view grid. By training on the entirecamera rig, we provide evidence that our model is able to learn not only how torepresent images but how to fuse predictions from all cameras into a singlecohesive representation of the scene while being robust to calibration error.On standard bird's-eye-view tasks such as object segmentation and mapsegmentation, our model outperforms all baselines and prior work. In pursuit ofthe goal of learning dense representations for motion planning, we show thatthe representations inferred by our model enable interpretable end-to-endmotion planning by "shooting" template trajectories into a bird's-eye-view costmap output by our network. We benchmark our approach against models that useoracle depth from lidar. Project page with code:https://nv-tlabs.github.io/lift-splat-shoot .

Source PDF