7 months ago

Abstract

Recovering 3D structures with open-vocabulary scene understanding from 2Dimages is a fundamental but daunting task. Recent developments have achievedthis by performing per-scene optimization with embedded language information.However, they heavily rely on the calibrated dense-view reconstructionparadigm, thereby suffering from severe rendering artifacts and implausiblesemantic synthesis when limited views are available. In this paper, weintroduce a novel generative framework, coined LangScene-X, to unify andgenerate 3D consistent multi-modality information for reconstruction andunderstanding. Powered by the generative capability of creating more consistentnovel observations, we can build generalizable 3D language-embedded scenes fromonly sparse views. Specifically, we first train a TriMap video diffusion modelthat can generate appearance (RGBs), geometry (normals), and semantics(segmentation maps) from sparse inputs through progressive knowledgeintegration. Furthermore, we propose a Language Quantized Compressor (LQC),trained on large-scale image datasets, to efficiently encode languageembeddings, enabling cross-scene generalization without per-scene retraining.Finally, we reconstruct the language surface fields by aligning languageinformation onto the surface of 3D scenes, enabling open-ended languagequeries. Extensive experiments on real-world data demonstrate the superiorityof our LangScene-X over state-of-the-art methods in terms of quality andgeneralizability. Project Page: https://liuff19.github.io/LangScene-X.

Source PDF View Code