6 months ago

Abstract

3D local editing of specified regions is crucial for game industry and robotinteraction. Recent methods typically edit rendered multi-view images and thenreconstruct 3D models, but they face challenges in precisely preservingunedited regions and overall coherence. Inspired by structured 3D generativemodels, we propose VoxHammer, a novel training-free approach that performsprecise and coherent editing in 3D latent space. Given a 3D model, VoxHammerfirst predicts its inversion trajectory and obtains its inverted latents andkey-value tokens at each timestep. Subsequently, in the denoising and editingphase, we replace the denoising features of preserved regions with thecorresponding inverted latents and cached key-value tokens. By retaining thesecontextual features, this approach ensures consistent reconstruction ofpreserved areas and coherent integration of edited parts. To evaluate theconsistency of preserved regions, we constructed Edit3D-Bench, ahuman-annotated dataset comprising hundreds of samples, each with carefullylabeled 3D editing regions. Experiments demonstrate that VoxHammersignificantly outperforms existing methods in terms of both 3D consistency ofpreserved regions and overall quality. Our method holds promise forsynthesizing high-quality edited paired data, thereby laying the datafoundation for in-context 3D generation. See our project page athttps://huanngzh.github.io/VoxHammer-Page/.

Source PDF View Code