Command Palette
Search for a command to run...
Zhou Brady ; Krähenbühl Philipp

Abstract
We present cross-view transformers, an efficient attention-based model formap-view semantic segmentation from multiple cameras. Our architectureimplicitly learns a mapping from individual camera views into a canonicalmap-view representation using a camera-aware cross-view attention mechanism.Each camera uses positional embeddings that depend on its intrinsic andextrinsic calibration. These embeddings allow a transformer to learn themapping across different views without ever explicitly modeling itgeometrically. The architecture consists of a convolutional image encoder foreach view and cross-view transformer layers to infer a map-view semanticsegmentation. Our model is simple, easily parallelizable, and runs inreal-time. The presented architecture performs at state-of-the-art on thenuScenes dataset, with 4x faster inference speeds. Code is available athttps://github.com/bradyz/cross_view_transformers.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| bird-s-eye-view-semantic-segmentation-on | CVT | IoU veh - 224x480 - No vis filter - 100x100 at 0.5: 31.4 IoU veh - 224x480 - Vis filter. - 100x100 at 0.5: 36.0 IoU veh - 448x800 - No vis filter - 100x100 at 0.5: 32.5 IoU veh - 448x800 - Vis filter. - 100x100 at 0.5: 37.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.