HyperAIHyperAI

Command Palette

Search for a command to run...

TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

Weijie Wang Zimu Li Jinchuan Shi Zeyu Zhang Botao Ye Marc Pollefeys Donny Y. Chen Bohan Zhuang

Abstract

Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.

One-sentence Summary

TriSplat is a feed-forward 3D reconstruction network that replaces Gaussian primitives with oriented triangle primitives, constructs stable geometry normals via an image-conditioned head, and directly exports simulation-ready meshes in a single pass, yielding geometry-faithful reconstructions on RealEstate10K and DL3DV without requiring costly post-hoc processing.

Key Contributions

  • TriSplat is a feed-forward network that replaces Gaussian primitives with oriented triangle representations to jointly predict local geometry, appearance, and camera poses from sparse unposed images in a single forward pass. This native triangular format directly exports textured meshes, eliminating the lossy post-hoc extraction steps required by existing methods.
  • The method employs a normal-anchored triangle construction pipeline that derives surface orientation from predicted point maps, refines normals using an image-conditioned head, and stabilizes training through mono-normal bootstrapping and validity-aware masking. Anchoring triangle orientation to explicit local geometry rather than unconstrained latent variables improves surface fidelity and rendering stability.
  • Evaluations on RealEstate10K and DL3DV demonstrate that the triangle-native representation yields geometry-faithful reconstructions and superior surface accuracy compared to Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Zero-shot testing on ScanNet confirms cross-dataset generalization, validating the approach as a direct simulation-ready solution for downstream physics and rendering pipelines.

Introduction

Reconstructing 3D scenes from sparse images is essential for robotics and embodied AI, where downstream tasks like physics simulation and collision detection require explicit surface meshes that integrate seamlessly with engines such as Unity or Isaac Sim. While recent feed-forward models accelerate reconstruction by predicting Gaussian primitives directly from images, these representations capture surfaces only implicitly, forcing users to rely on expensive post-hoc mesh extraction steps that undermine the efficiency of the initial prediction. The authors propose TriSplat, a feed-forward network that represents scenes using oriented triangle primitives to deliver simulation-ready meshes in a single forward pass. By jointly predicting geometry, appearance, and camera poses, and by anchoring triangle orientation to image-refined surface normals, TriSplat eliminates the need for intermediate conversion and produces geometry-faithful outputs that physics engines can consume immediately.

Dataset

  • Dataset Composition and Sources: The authors generate synthetic evaluation data entirely within Unity and NVIDIA Isaac Sim environments. Rather than relying on real-world captures, they construct chronological four-frame dynamic sequences directly from the model's exported triangle meshes.
  • Subset Details: The synthetic sequences are organized into three simulation categories: robotic grasping, ball dynamics, and multi-platform locomotion. Each subset is derived automatically from the exported meshes without manual cleanup or format conversion.
  • Data Usage and Processing: The sequences drive physics-based validation. The authors extract the underlying meshes using two distinct pipelines. The TSDF fusion method applies a 0.005 voxel size, 0.1 SDF truncation, and 5.0 depth truncation, while masking pixels with rendered alpha below 0.3 and retaining only the 50 largest connected components containing at least 50 triangles. The direct export method prunes triangles with opacity below 0.10 after temperature scaling at 5.0, deduplicates vertices using quantized position hashing at 10^-5 precision, and calculates per-triangle colors from zeroth-order spherical harmonics.
  • Orientation and Frame Construction: Triangle orientations are anchored to predicted 3D geometry to prevent hard-edged artifacts. Raw normals are computed via finite differences on the dense point map, smoothed with average pooling, and corrected by a zero-initialized U-Net that ingests appearance, depth, and validity masks. Early training stability relies on a mono-normal bootstrap that gradually transitions from a pretrained monocular estimator to the model's own predictions using a cosine decay schedule. The final normals are converted into orthonormal tangent frames to drive triangle rendering.

Method

The authors leverage a feed-forward architecture to reconstruct scenes from sparse, unposed images in a single forward pass, directly outputting oriented triangle primitives that form a mesh without post-processing. The framework begins with an encoder built upon a DINOv2 backbone, augmented with 2D rotary position embeddings and per-pixel intrinsic information to support pose-free reconstruction. This backbone feeds into a custom transformer decoder that alternates between intra-view self-attention and cross-view joint attention, enabling local spatial reasoning and multi-view correspondence aggregation. The decoder produces feature tokens that are processed by three parallel heads: a point head, a camera head, and a primitive head.

The point head predicts a dense local 3D point map PRH×W×3\mathbf{P} \in \mathbb{R}^{H \times W \times 3}PRH×W×3 for each input view. For each pixel, it outputs three unconstrained scalars (u,v,z)(u, v, z')(u,v,z), where the depth is recovered as z=exp(z)z = \exp(z')z=exp(z) to ensure strict positivity, and the 3D point is computed as p=z(u,v,1)\mathbf{p} = z \cdot (u, v, 1)^\topp=z(u,v,1). This parameterization couples lateral position with depth, mirroring the projective image-formation model. The camera head regresses per-view SE(3) camera-to-world poses by mean-pooling decoder tokens and projecting a 3×33 \times 33×3 matrix onto SO(3) via SVD orthogonalization, with all poses expressed relative to the first view. The primitive head predicts per-pixel triangle attributes, including a density logit, three scale logits, a quaternion, spherical-harmonic appearance coefficients, and a blur parameter. To provide direct access to appearance, the input RGB image is patch-embedded and additively fused into the features before decoding. All dense heads employ pixel-shuffle upsampling to reach full spatial resolution.

The predicted point maps and camera poses define the triangle centers c\mathbf{c}c in world space. Each triangle is instantiated from a canonical equilateral template TR3×3\mathcal{T} \in \mathbb{R}^{3 \times 3}TR3×3. The three raw scale logits are mapped via sigmoid to a bounded interval and converted to world-space sizes using the predicted depth and the intrinsic-derived pixel footprint. Let s\mathbf{s}s denote the resulting scale vector, Rn\mathbf{R}_nRn the tangent-frame rotation that orients the triangle along the local surface, and Rc\mathbf{R}_cRc the camera-to-world rotation. The kkk-th vertex is computed as vk=RcRn(Tks)+c\mathbf{v}_k = \mathbf{R}_c \, \mathbf{R}_n \big( \mathcal{T}_k \odot \mathbf{s} \big) + \mathbf{c}vk=RcRn(Tks)+c, where \odot denotes element-wise multiplication. The resulting oriented triangles are rendered by a differentiable triangle rasterizer via tile-based sorting and front-to-back alpha compositing, producing RGB images, depth maps, and surface normals.

The point maps serve a dual purpose: they define triangle centers and provide the geometric foundation for deriving triangle orientation. This is achieved through a geometry-anchored normal refinement process. The framework predicts raw geometry normals from the point map, which are then refined via a lightweight U-Net that takes as input the raw geometry normal, smoothed geometry normal, downsampled RGB image, predicted depth, and a validity mask. The refinement head outputs a residual correction, which is applied to the raw normal to produce refined normals. These refined normals are used to compute the tangent-frame rotation Rn\mathbf{R}_nRn, which orients the triangle along the local surface. At pixels where the geometry-based rotation is valid, the network's predicted quaternion is overridden by the geometry-derived quaternion; at invalid pixels, the network quaternion is retained as a fallback. This geometry-anchored orientation ensures that the triangles are properly aligned with the scene's surface geometry.

Experiment

The evaluation assesses TriSplat across surface geometry, novel-view rendering, depth and normal accuracy, and runtime efficiency through controlled ablations and cross-dataset zero-shot tests on RE10K, DL3DV, and ScanNet. Experiments validate that its geometry-anchored normal pipeline and progressive training schedules yield smooth, coherent surface orientations while each architectural component successfully mitigates specific reconstruction failure modes. Efficiency and simulation readiness are confirmed by the native triangle representation, which eliminates costly post-hoc mesh extraction and enables direct integration into standard physics engines for robust collision detection, locomotion, and interactive tasks. Ultimately, the results demonstrate that TriSplat preserves rendering fidelity throughout mesh export, establishing a structurally efficient and simulation-ready alternative to conventional Gaussian-based methods.

The authors conduct an ablation study on the triangle scale range in TriSplat, evaluating its impact on surface geometry and mesh-rendering quality. Results show that varying the scale range affects both Chamfer Distance and F1 score, with different combinations yielding distinct trade-offs between geometry fidelity and rendering quality. The optimal range balances coverage and artifact control, as indicated by the metrics. Varying the triangle scale range impacts both surface geometry and rendering quality metrics. A smaller minimum scale with a larger maximum scale improves F1 score but increases LPIPS. The optimal scale range achieves a balance between geometry fidelity and rendering quality, as shown by consistent CD and PSNR trends.

The authors compare the inference time of TriSplat against several baseline methods across different numbers of input views. Results show that TriSplat achieves significantly faster inference times than all other methods, particularly at smaller input settings, where it remains under one second. In contrast, Gaussian-based baselines and volumetric methods exhibit substantially longer inference times that scale with the number of input views. TriSplat achieves inference times well under one second, significantly faster than all other methods at small input settings. Gaussian-based baselines and volumetric methods show much longer inference times that increase with the number of input views. TriSplat's end-to-end efficiency is attributed to its triangle-native representation, eliminating the need for post-hoc mesh extraction.

The authors evaluate TriSplat on depth and normal accuracy, demonstrating that it outperforms all compared methods in both metrics. Results show that TriSplat achieves the lowest depth errors and the best normal consistency, with the lowest mean angular error and highest accuracy below 30 degrees. The improvements are attributed to the geometry-anchored normal pipeline and a bootstrap schedule that prioritizes orientation quality. TriSplat achieves the best depth accuracy with the lowest relative and difference errors among all methods. TriSplat produces the most accurate normals, with the lowest mean angular error and highest accuracy below 30 degrees. The superior normal quality is attributed to a geometry-anchored normal pipeline and a bootstrap schedule that emphasizes orientation consistency.

The authors evaluate the impact of opacity temperature scheduling on TriSplat's performance, analyzing its effect on geometry and rendering quality. Results show that a fixed low temperature leads to poor gradient coverage and suboptimal performance, while a fixed high temperature produces soft surfaces. The default schedule, which increases temperature from 1.0 to 5.0 over 16K steps, achieves the best balance, improving geometry and rendering metrics compared to both extremes. A fixed low opacity temperature results in the worst performance across all metrics. A fixed high opacity temperature degrades rendering quality despite better geometry. The default temperature schedule from 1.0 to 5.0 achieves the best trade-off, improving both geometry and rendering quality.

The authors evaluate TriSplat's surface geometry and mesh-rendering quality across different input view counts, comparing it against several baseline methods. Results show that TriSplat achieves the best performance in surface geometry metrics and consistently outperforms all baselines in mesh-rendering quality, particularly as the number of input views increases. The method maintains strong and stable performance across varying view counts, with improvements in precision and recall metrics that are more pronounced at higher view counts. TriSplat achieves the best surface geometry metrics across all view counts compared to baseline methods. TriSplat consistently outperforms all baselines in mesh-rendering quality, with improvements becoming more pronounced at higher input view counts. The performance gap between TriSplat and baselines widens as the number of input views increases, indicating better scalability of the proposed method.

The evaluation encompasses ablation studies validating the impact of triangle scale ranges and opacity temperature scheduling, alongside comparative assessments measuring inference efficiency, geometric accuracy, and scalability across varying input view counts. Qualitatively, the findings indicate that optimized hyperparameter ranges successfully reconcile surface fidelity with rendering quality, while the triangle-native representation bypasses costly post-processing to achieve substantially faster inference than competing approaches. Additionally, the geometry-anchored normal pipeline and dedicated scheduling strategy consistently produce superior depth and orientation accuracy, with performance advantages growing more distinct as input data increases. Collectively, these results establish TriSplat as a highly efficient and scalable framework that reliably outperforms existing baselines in both geometric precision and visual reconstruction.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp