HyperAIHyperAI

Command Palette

Search for a command to run...

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang Charles Herrmann Junhwa Hur Chen Sun Ming-Hsuan Yang Forrester Cole Trevor Darrell Deqing Sun

Abstract

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

One-sentence Summary

Researchers from Google DeepMind and UC Berkeley present LoGeR, a feedforward model that scales 3D reconstruction to long videos by combining Test-Time Training for global consistency with Sliding Window Attention for local precision, eliminating the need for post-optimization while achieving superior accuracy on datasets with thousands of frames.

Key Contributions

  • Feedforward geometric models currently struggle to scale to minute-long videos due to quadratic attention complexity and limited memory, creating a critical gap between short-window reconstruction and the need for global consistency over long sequences.
  • LoGeR introduces a novel chunk-wise architecture with a hybrid memory module that combines parametric Test-Time Training to anchor the global coordinate frame and non-parametric Sliding Window Attention to preserve high-precision local alignment.
  • Trained on sequences of only 128 frames, the model generalizes to thousands of frames and achieves state-of-the-art performance by reducing Absolute Trajectory Error on KITTI by over 74% and improving results by 30.8% on a new 19k-frame VBR benchmark.

Introduction

Large-scale dense 3D reconstruction is essential for applications ranging from autonomous driving to generative world-building, yet current methods struggle to balance computational efficiency with long-range consistency. While classical optimization pipelines can handle city-scale scenes, they rely on slow offline processes and fail on sparse inputs, whereas modern feedforward geometric models offer speed but are limited to short, bounded scenes due to quadratic attention complexity and a lack of training data for long sequences. To bridge this gap, the authors propose LoGeR, a feedforward framework that utilizes a hybrid memory module combining non-parametric sliding window attention for high-fidelity local details and parametric associative memory for global structural integrity. This approach allows the model to process massive sequences of up to 19,000 frames with linear computational cost, effectively overcoming the context and data walls that previously prevented feedforward models from scaling to real-world, long-horizon trajectories.

Dataset

  • Dataset Composition and Sources: The authors utilize a diverse mixture of 14 large-scale datasets containing both real-world and synthetic scenes across indoor, outdoor, and autonomous driving environments to support long-context geometric reconstruction.

  • Key Details for Each Subset:

    • Navigation and large-scale scene datasets like TartanAirV2 and VKITTI2 are heavily weighted to encourage long-range geometric reasoning.
    • DL3DV receives a high sampling weight due to its exceptional real-world scene diversity, which aids model generalization.
    • Smaller or object-centric datasets are down-weighted in the mixture.
    • The OmniWorld-Game contribution is limited to a subset of 5,000 sequences based on the publicly released data at the time of training.
  • Data Usage and Mixture Ratios: The training configuration employs relative sampling percentages as summarized in Table 4, where the mixture is specifically tuned to provide sufficient long-horizon signals and diverse scene priors.

  • Processing and Filtering Strategies:

    • All datasets are standardized into multi-view sequences with 48 views (or 128 views for H200 GPU training) at a uniform resolution of 504 × 280.
    • The sampling strategy follows the CUT3R approach.
    • Rigorous depth filtering is applied to ensure geometric supervision quality.
    • Metric-scale datasets like ARKitScenes and ScanNet use a maximum depth threshold (e.g., 80.0 meters), while others like DL3DV and TartanAir use percentile-based clipping (e.g., 90th or 98th percentile) to mask noisy or invalid depth values.

Method

The authors propose LoGeR, a novel architecture designed to scale dense 3D reconstruction to extremely long video sequences without post-optimization. To overcome the quadratic complexity of global attention and the scarcity of long-horizon training data, the method processes video streams sequentially by chunk. This approach tightly bounds computational cost while ensuring that local inferences remain within the distribution of existing short-context training data.

Refer to the framework diagram for an overview of the proposed chunk-wise processing pipeline and its performance on long sequences.

The core innovation lies in a learning-based hybrid memory module that manages coherence across chunk boundaries. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment.

As shown in the figure below, the architecture processes input sequences in consecutive chunks, utilizing specific mechanisms for the first chunk versus following chunks to propagate information effectively.

Within each residual block of the geometry backbone, the authors introduce the hybrid memory system. The process begins with per-frame attention to extract spatial features independently for each frame. To align adjacent chunks, sparse sliding-window attention layers are inserted at a subset of network depths. These layers attend to tokens from both the previous chunk Cm1\mathcal{C}^{m-1}Cm1 and the current chunk Cm\mathcal{C}^{m}Cm, establishing a lossless information highway for high-fidelity feature propagation.

To integrate global context, the model maintains a set of fast weights WmW^{m}Wm that summarize information up to chunk mmm. The TTT layer performs an apply-then-update procedure at the chunk level. In the apply operation, the TTT layers use historical information stored in the weights to modulate the network's processing of the current chunk. In the update operation, the weights are edited to store information from the current chunk, conceptually compressing important but redundant geometric information. The mathematical formulation for the TTT update and apply operations is defined as:

WWηWL(fW(k),v)W \gets W - \eta \nabla _ { W } \mathcal { L } ( f _ { W } ( \mathbf { k } ) , \mathbf { v } )WWηWL(fW(k),v) Apply operation:o=fW(q)\mathrm { A p p l y ~ o p e r a t i o n : } \quad o = f _ { W } ( \mathbf { q } )Apply operation:o=fW(q)

where η\etaη is the learning rate and L\mathcal{L}L is a loss function encouraging the function fWf_WfW to link keys with corresponding values. Finally, within each chunk, a bidirectional attention module is employed for powerful geometric reasoning under a bounded context window.

For training, the authors employ a progressive curriculum strategy to stabilize the optimization of recurrent TTT layers. The schedule begins with shorter sequences and gradually increases complexity, forcing the model to shift reliance from local Sliding Window Attention to the global TTT hidden state. Additionally, to mitigate prediction errors in very long streams, a variant called LoGeR* incorporates a purely feedforward alignment step. This step aligns raw predictions into a consistent global coordinate system by computing a rigid SE(3) transformation using overlapping frames between chunks.

Experiment

  • Long-sequence evaluation on KITTI and VBR benchmarks demonstrates that LoGeR effectively mitigates accumulated drift over thousands of frames, maintaining global scale and trajectory consistency where prior methods fail.
  • Short-sequence tests on 7-Scenes, ScanNet, and TUM-Dynamics confirm that the proposed model and baseline significantly outperform existing feedforward and optimization-based approaches in 3D reconstruction and camera pose estimation.
  • Ablation studies validate that the hybrid architecture is essential, with the Test-Time Training layer ensuring global consistency and the Sliding Window Attention layer preserving local geometric smoothness.
  • Experiments on data mixture and curriculum training prove that incorporating large-scale navigation datasets and a progressive training schedule are critical for generalization and stabilizing recurrent layer optimization.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp