6 months ago

Abstract

This paper presents innovative enhancements to diffusion models byintegrating a novel multi-resolution network and time-dependent layernormalization. Diffusion models have gained prominence for their effectivenessin high-fidelity image generation. While conventional approaches rely onconvolutional U-Net architectures, recent Transformer-based designs havedemonstrated superior performance and scalability. However, Transformerarchitectures, which tokenize input data (via "patchification"), face atrade-off between visual fidelity and computational complexity due to thequadratic nature of self-attention operations concerning token length. Whilelarger patch sizes enable attention computation efficiency, they struggle tocapture fine-grained visual details, leading to image distortions. To addressthis challenge, we propose augmenting the Diffusion model with theMulti-Resolution network (DiMR), a framework that refines features acrossmultiple resolutions, progressively enhancing detail from low to highresolution. Additionally, we introduce Time-Dependent Layer Normalization(TD-LN), a parameter-efficient approach that incorporates time-dependentparameters into layer normalization to inject time information and achievesuperior performance. Our method's efficacy is demonstrated on theclass-conditional ImageNet generation benchmark, where DiMR-XL variantsoutperform prior diffusion models, setting new state-of-the-art FID scores of1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512. Project page:https://qihao067.github.io/projects/DiMR

Source PDF