a month ago

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Xiaoyu Yue Zidong Wang Yuqing Wang Wenlong Zhang Xihui Liu Wanli Ouyang Lei Bai Luping Zhou

Abstract

Recent studies have demonstrated the importance of high-quality visualrepresentations in image generation and have highlighted the limitations ofgenerative models in image understanding. As a generative paradigm originallydesigned for natural language, autoregressive models face similar challenges.In this work, we present the first systematic investigation into the mechanismsof applying the next-token prediction paradigm to the visual domain. Weidentify three key properties that hinder the learning of high-level visualsemantics: local and conditional dependence, inter-step semantic inconsistency,and spatial invariance deficiency. We show that these issues can be effectivelyaddressed by introducing self-supervised objectives during training, leading toa novel training framework, Self-guided Training for AutoRegressive models(ST-AR). Without relying on pre-trained representation models, ST-ARsignificantly enhances the image understanding ability of autoregressive modelsand leads to improved generation quality. Specifically, ST-AR bringsapproximately 42% FID improvement for LlamaGen-L and 49% FID improvement forLlamaGen-XL, while maintaining the same sampling strategy.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Xiaoyu Yue Zidong Wang Yuqing Wang Wenlong Zhang Xihui Liu Wanli Ouyang Lei Bai Luping Zhou

Abstract

Build AI with AI

Hyper Newsletters