HyperAIHyperAI

Command Palette

Search for a command to run...

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Yuxuan Zhang Yirui Yuan Yiren Song Haofan Wang Jiaming Liu

Abstract

Recent advancements in Unet-based diffusion models, such as ControlNet and IP-Adapter, have introduced effective spatial and subject control mechanisms. However, the DiT (Diffusion Transformer) architecture still struggles with efficient and flexible control. To tackle this issue, we propose EasyControl, a novel framework designed to unify condition-guided diffusion transformers with high efficiency and flexibility. Our framework is built on three key innovations. First, we introduce a lightweight Condition Injection LoRA Module. This module processes conditional signals in isolation, acting as a plug-and-play solution. It avoids modifying the base model weights, ensuring compatibility with customized models and enabling the flexible injection of diverse conditions. Notably, this module also supports harmonious and robust zero-shot multi-condition generalization, even when trained only on single-condition data. Second, we propose a Position-Aware Training Paradigm. This approach standardizes input conditions to fixed resolutions, allowing the generation of images with arbitrary aspect ratios and flexible resolutions. At the same time, it optimizes computational efficiency, making the framework more practical for real-world applications. Third, we develop a Causal Attention Mechanism combined with the KV Cache technique, adapted for conditional generation tasks. This innovation significantly reduces the latency of image synthesis, improving the overall efficiency of the framework. Through extensive experiments, we demonstrate that EasyControl achieves exceptional performance across various application scenarios. These innovations collectively make our framework highly efficient, flexible, and suitable for a wide range of tasks.

One-sentence Summary

The authors, affiliated with Tiamat AI, ShanghaiTech University, National University of Singapore, and Liblib AI, propose EasyControl—a lightweight, plug-and-play framework for diffusion transformers that enables efficient spatial and subject/face control via a Condition Injection LoRA Module, Position-Aware Training, and a Causal Attention Mechanism with KV Cache, achieving zero-shot multi-condition generalization and reduced latency, making it highly suitable for real-world image generation applications.

Key Contributions

  • EasyControl addresses the inefficiency and inflexibility of condition-guided diffusion transformers by introducing a lightweight Condition Injection LoRA Module that processes conditional signals in isolation through a parallel branch, enabling plug-and-play integration without modifying base model weights and supporting robust zero-shot multi-condition generalization even with single-condition training.

  • The framework enhances computational efficiency and resolution flexibility via a Position-Aware Training Paradigm that normalizes input conditions to fixed resolutions and employs Position-Aware Interpolation, allowing consistent generation across arbitrary aspect ratios and resolutions while reducing sequence length and inference overhead.

  • By replacing full attention with a Causal Attention Mechanism integrated with KV Cache, EasyControl achieves significant latency reduction through precomputed and reused condition feature key-value pairs, marking the first application of KV Cache in conditional generation and substantially improving inference speed.

Introduction

The authors leverage the growing adoption of Diffusion Transformers (DiT) in image generation, which offer higher quality and resolution than traditional UNet-based models but face challenges in efficiency, multi-condition control, and plug-and-play flexibility. Prior methods suffer from quadratic computational costs due to full attention over long token sequences, struggle with stable coordination across multiple conditions—especially in zero-shot combinations—and often introduce parameter conflicts that degrade performance during style transfer or customization. To address these issues, the authors introduce EasyControl, a lightweight, plug-and-play framework that enables efficient and flexible condition-guided generation. It achieves this through three core innovations: a Condition Injection LoRA module that isolates condition signals in a parallel branch, preserving the frozen backbone while enabling seamless integration; a Position-Aware Training Paradigm that normalizes input resolution and interpolates tokens to maintain spatial consistency across resolutions; and a Causal Attention mechanism with KV Cache that precomputes and reuses condition features, drastically reducing inference latency. Together, these advances enable high-efficiency, zero-shot multi-condition generalization, robust resolution flexibility, and strong compatibility with custom models—advancing the practical deployment of DiT-based generation systems.

Dataset

  • The dataset is composed of multiple specialized subsets tailored to different control tasks: MultiGen-20M for spatial control (depth, canny, OpenPose), Subject200K for subject control, and a curated subset of LAION-Face combined with a private multi-view human dataset for face control.
  • MultiGen-20M contains 20 million images and serves as the primary source for spatial control tasks. Subject200K provides 200,000 images focused on subject consistency. The LAION-Face subset is filtered for high-quality face images, augmented with a private collection of multi-view human images.
  • All human images in the private multi-view dataset are preprocessed using InsightFace to ensure precise cropping and facial alignment, enhancing input consistency and accuracy.
  • The authors use these datasets to train their model by combining them into a training mixture, with specific ratios optimized for each control type. The data is processed to align inputs with corresponding control signals, and cropping strategies are applied uniformly to maintain spatial and semantic coherence across all subsets.

Method

The authors leverage the FLUX.1 diffusion transformer architecture as the foundation for EasyControl, extending it with a modular framework designed for efficient and flexible condition-guided image generation. The overall framework integrates several key components: a Condition Injection LoRA Module, a Position-Aware Training Paradigm, a Causal Attention mechanism, and a KV Cache for inference. Refer to the framework diagram for a visual overview of the system.

The core of the method is the Condition Injection LoRA Module, which enables the efficient and plug-and-play integration of conditional signals into the pre-trained DiT model. This module operates by introducing a dedicated Condition Branch that processes the input condition independently. The authors apply Low-Rank Adaptation (LoRA) to adaptively enhance the query, key, and value (QKV) features of the Condition Branch, while leaving the text and noise branches unmodified. This targeted adaptation allows the model to inject conditional information without disrupting the pre-trained representations of text and noise, ensuring high-fidelity generation. The LoRA transformation is defined as ΔQc,ΔKc,ΔVc=BOAOZc,BKAKZc,BVAVZc\Delta Q_c, \Delta K_c, \Delta V_c = B_O A_O Z_c, B_K A_K Z_c, B_V A_V Z_cΔQc,ΔKc,ΔVc=BOAOZc,BKAKZc,BVAVZc, where Ai,BiA_i, B_iAi,Bi are low-rank matrices, and the updated QKV features are Qc=Qc+ΔQcQ_c' = Q_c + \Delta Q_cQc=Qc+ΔQc, etc. This design ensures that the model can flexibly integrate diverse conditions while maintaining compatibility with customized models.

To manage the flow of information between the different input modalities, the framework employs a Causal Attention mechanism. This unidirectional attention restricts each position in the sequence to attend only to previous positions and itself, enforcing a causal structure. The authors design two specialized causal attention mechanisms to handle different scenarios. For single-condition training, Causal Conditional Attention is used, which blocks attention from the condition branch to the denoising (text and noise) branch, allowing only the reverse flow. This isolation enables decoupled Key-Value (KV) Cache states for each branch during inference, reducing redundant computation. For multi-condition inference, Causal Mutual Attention is employed. This mechanism allows all conditions to interact normally with the denoising tokens but prevents cross-condition interactions by applying a mask that blocks attention between tokens from different condition blocks. This ensures that while multiple conditions are integrated, they do not interfere with each other during generation.

The Position-Aware Training Paradigm is designed to improve computational efficiency and resolution flexibility. It involves downscaling high-resolution control signals to a fixed target resolution (e.g., 512×512512 \times 512512×512) before encoding them into latent space. To preserve spatial alignment, especially for spatial conditions like canny maps, the authors introduce Position-Aware Interpolation (PAI). This strategy interpolates position encodings during the resizing process, ensuring that the spatial relationships between patches in the original and resized images are maintained. For subject conditions, a PE Offset Strategy is applied, which adds a fixed displacement to the position encodings in the height dimension to separate them from spatial conditions. The loss function used for training is a flow-matching loss, defined as LˉRF=Et,ϵN(0,I)vθ(z,t,ci)(ϵx0)22\bar{L}_{RF} = E_{t, \epsilon \sim N(0, I)} ||v_\theta(z, t, c_i) - (\epsilon - x_0)||_2^2LˉRF=Et,ϵN(0,I)∣∣vθ(z,t,ci)(ϵx0)22, which guides the model to predict the correct velocity field for denoising.

Finally, the framework achieves efficient inference by leveraging the KV Cache technique. The unique design of the Causal Attention mechanism, which isolates the conditioning branch from the denoising timestep, allows the Key-Value pairs of all conditional features to be precomputed and stored only once at the initial timestep. These cached pairs are then reused across all subsequent denoising steps, eliminating the need for NNN-fold recomputation and significantly reducing inference latency.

Experiment

  • Single-condition generation: Validates robust text consistency and controllability across Canny, Depth, and Subject conditions. On COCO 2017 and Concept-101 benchmarks, achieved state-of-the-art performance in controllability, text consistency (CLIP-Score), and generation quality (FID, MAN-IQA), outperforming ControlNet, OmniControl, and Uni-ControlNet.
  • Multi-condition integration: Demonstrates superior identity preservation and controllability under face + OpenPose conditions. On a custom dataset, achieved the best Face Similarity, lowest MJPE (controllability), lowest FID, highest MAN-IQA, and highest CLIP-Score, surpassing ControlNet+IP-Adapter, ControlNet+Redux, Uni-ControlNet, and ID customization methods.
  • Resolution adaptability: Maintains strong controllability and image quality across resolutions from low to high (up to 2560×3520), outperforming ControlNet and OmniControl, which exhibit distortion and degradation at extreme resolutions.
  • Efficiency: On a single A100 GPU, achieved 16.3 seconds inference time for single-condition generation (58% faster than ablated version) and 18.3 seconds for dual-condition tasks (75% faster than ablated version), with only 15M parameters (vs. 3B for ControlNet), demonstrating high efficiency and compactness.

Results show that the proposed method achieves competitive identity preservation with CLIP-I and DINO-I scores, while outperforming baseline methods in generative quality as measured by FID and MAN-IQA, and achieving the highest text consistency with a CLIP-Score of 0.283. The authors use this table to demonstrate superior performance in subject control tasks compared to IP-Adapter, OminiControl, and Uni-ControlNet.

Results show that the proposed method achieves the fastest inference time of 16.3 seconds in single-condition settings and 18.3 seconds in double-condition settings, with a parameter count of 15M and 30M respectively, outperforming baseline methods in efficiency while maintaining a significantly smaller model size. The full model demonstrates a 58% reduction in inference time compared to the ablated version without PATP and KV Cache in single-condition tasks, and a 75% reduction in double-condition tasks, highlighting the effectiveness of these mechanisms in improving inference speed without compromising model compactness.

Results show that the proposed method achieves the best performance across all metrics in multi-condition generation with OpenPose and face inputs. It attains the highest face similarity, the lowest mean joint position error, the best generative quality, and the strongest text consistency compared to baseline methods.

Results show that the proposed method achieves the highest controllability and text consistency under Canny conditions, with an F1 score of 0.311 and a CLIP-Score of 0.286, while also achieving the best generative quality with a MAN-IQA score of 0.503. Under depth conditions, the method demonstrates superior controllability with a score of 1092 and maintains strong text consistency with a CLIP-Score of 0.289, while achieving competitive generative quality with a MAN-IQA score of 0.469.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp