21 days ago

Diffusion Transformers with Representation Autoencoders

Boyang Zheng Nanye Ma Shengbang Tong Saining Xie

Abstract

Latent generative modeling, where a pretrained autoencoder maps pixels into alatent space for the diffusion process, has become the standard strategy forDiffusion Transformers (DiT); however, the autoencoder component has barelyevolved. Most DiTs continue to rely on the original VAE encoder, whichintroduces several limitations: outdated backbones that compromisearchitectural simplicity, low-dimensional latent spaces that restrictinformation capacity, and weak representations that result from purelyreconstruction-based training and ultimately limit generative quality. In thiswork, we explore replacing the VAE with pretrained representation encoders(e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we termRepresentation Autoencoders (RAEs). These models provide both high-qualityreconstructions and semantically rich latent spaces, while allowing for ascalable transformer-based architecture. Since these latent spaces aretypically high-dimensional, a key challenge is enabling diffusion transformersto operate effectively within them. We analyze the sources of this difficulty,propose theoretically motivated solutions, and validate them empirically. Ourapproach achieves faster convergence without auxiliary representation alignmentlosses. Using a DiT variant equipped with a lightweight, wide DDT head, weachieve strong image generation results on ImageNet: 1.51 FID at 256x256 (noguidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offersclear advantages and should be the new default for diffusion transformertraining.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Diffusion Transformers with Representation Autoencoders

Boyang Zheng Nanye Ma Shengbang Tong Saining Xie

Abstract

Build AI with AI

Hyper Newsletters