8 months ago

Abstract

Nowadays, transformer networks have demonstrated superior performance in manycomputer vision tasks. In a multi-view 3D reconstruction algorithm followingthis paradigm, self-attention processing has to deal with intricate imagetokens including massive information when facing heavy amounts of view input.The curse of information content leads to the extreme difficulty of modellearning. To alleviate this problem, recent methods compress the token numberrepresenting each view or discard the attention operations between the tokensfrom different views. Obviously, they give a negative impact on performance.Therefore, we propose long-range grouping attention (LGA) based on thedivide-and-conquer principle. Tokens from all views are grouped for separateattention operations. The tokens in each group are sampled from all views andcan provide macro representation for the resided view. The richness of featurelearning is guaranteed by the diversity among different groups. An effectiveand efficient encoder can be established which connects inter-view featuresusing LGA and extract intra-view features using the standard self-attentionlayer. Moreover, a novel progressive upsampling decoder is also designed forvoxel generation with relatively high resolution. Hinging on the above, weconstruct a powerful transformer-based network, called LRGT. Experimentalresults on ShapeNet verify our method achieves SOTA accuracy in multi-viewreconstruction. Code will be available athttps://github.com/LiyingCV/Long-Range-Grouping-Transformer.

Source PDF View Code