4 months ago

SAM4D: Segment Anything in Camera and LiDAR Streams

Jianyun Xu Song Wang Ziqian Ni Chunyong Hu Sheng Yang Jianke Zhu Qiang Li

Abstract

We present SAM4D, a multi-modal and temporal foundation model designed forpromptable segmentation across camera and LiDAR streams. Unified Multi-modalPositional Encoding (UMPE) is introduced to align camera and LiDAR features ina shared 3D space, enabling seamless cross-modal prompting and interaction.Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA),which leverages ego-motion compensation to enhance temporal consistency andlong-horizon feature retrieval, ensuring robust segmentation across dynamicallychanging autonomous driving scenes. To avoid annotation bottlenecks, we developa multi-modal automated data engine that synergizes VFM-driven video masklets,spatiotemporal 4D reconstruction, and cross-modal masklet fusion. Thisframework generates camera-LiDAR aligned pseudo-labels at a speed orders ofmagnitude faster than human annotation while preserving VFM-derived semanticfidelity in point cloud representations. We conduct extensive experiments onthe constructed Waymo-4DSeg, which demonstrate the powerful cross-modalsegmentation ability and great potential in data annotation of proposed SAM4D.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

SAM4D: Segment Anything in Camera and LiDAR Streams

Jianyun Xu Song Wang Ziqian Ni Chunyong Hu Sheng Yang Jianke Zhu Qiang Li

Abstract

Build AI with AI

Hyper Newsletters