Guiding a Diffusion Transformer with the Internal Dynamics of Itself
Abstract
The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.
Community
🔥 New SOTA on 256 × 256 ImageNet generation. We present Internal Guidance (IG), a simple yet powerful guidance mechanism for Diffusion Transformers. LightningDiT-XL/1 + IG sets a new state of the art with FID = 1.07 on ImageNet (balanced sampling), while achieving FID = 1.24 without classifier-free guidance. IG delivers dramatic quality gains with far fewer training epochs, adds negligible overhead, and works as a drop-in upgrade for modern diffusion transformers.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank (2025)
- MeanFlow Transformers with Representation Autoencoders (2025)
- One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation (2025)
- Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training (2025)
- PixelDiT: Pixel Diffusion Transformers for Image Generation (2025)
- Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization (2025)
- SoFlow: Solution Flow Models for One-Step Generative Modeling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/guiding-a-diffusion-transformer-with-the-internal-dynamics-of-itself-3782-46c86a94
- Executive Summary
- Detailed Breakdown
- Practical Applications
Main Results of Internal Guidance for Diffusion Transformers
Core Innovation: Internal Guidance (IG)
The paper introduces Internal Guidance (IG), a simple yet highly effective strategy that guides diffusion transformers using their own internal dynamics. The key insight is to leverage the difference between intermediate and deep layer outputs within the same model to improve generation quality.
Framework Overview
Figure 1: The IG framework introduces auxiliary supervision during training and uses intermediate layer outputs during sampling to guide final generation results. With IG scale=1.4, CFG scale=1.45, and guidance interval, they achieved state-of-the-art FID=1.19.
Key Technical Contributions
1. Training with Intermediate Supervision
- Adds auxiliary loss on intermediate layer outputs during training:
- L_intermediate = ||D_i(x_t, t) - x_0||²
- L_final = ||D_f(x_t, t) - x_0||²
- Total loss: L = L_final + λ × L_intermediate
2. Sampling with Internal Guidance
During sampling, extrapolates between intermediate and deep outputs:
D_w(x; c) = D_i(x; c) + w(D_f(x; c) - D_i(x; c))
This creates a self-guidance effect without requiring additional sampling steps or degraded models.
Quantitative Results
State-of-the-Art Performance on ImageNet 256×256
| Method | Epochs | FID↓ (w/o CFG) | FID↓ (w/ CFG) |
|---|---|---|---|
| SiT-XL/2+IG | 80 | 5.31 | - |
| SiT-XL/2+IG | 800 | 1.75 | 1.46 |
| LightningDiT-XL/1+IG | 60 | 2.42 | - |
| LightningDiT-XL/1+IG | 680 | 1.34 | 1.19 |
Key Achievement: LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19 when combined with CFG.
Visual Quality Comparison
Figure 3: Analysis on a 2D toy dataset shows:
- (a) Standard sampling produces outliers
- (b) CFG reduces outliers but loses diversity
- (c) IG maintains diversity like Autoguidance
- (d) IG+CFG achieves superior results by combining their strengths
Figure 7: IG alone produces better results than baseline, and combining IG with CFG further improves generation quality by reducing inaccurate content while maintaining visual fidelity.
Scalability Analysis
Figure 6: The relative improvement of IG over vanilla models becomes increasingly significant as model size grows, demonstrating excellent scalability properties.
Key Advantages
1. Plug-and-Play Design
- No extra training required for guidance
- Compatible with existing diffusion transformers
- Minimal computational overhead (<0.5% increase in parameters)
2. Diversity Preservation
Unlike CFG which can lead to over-simplified results, IG maintains sample diversity while improving quality.
3. Training Acceleration
The intermediate supervision alone alleviates gradient vanishing and achieves convergence comparable to complex self-supervised regularization methods.
4. Complementary with Existing Methods
- Combines effectively with CFG
- Compatible with guidance intervals
- Works across different model architectures (DiT, SiT, LightningDiT)
Computational Efficiency
| Method | #Params | FLOPs↓ | Latency (s)↓ | FID↓ |
|---|---|---|---|---|
| SiT-XL/2 + REPA | 675 | 114.46 | 6.18 | 5.90 |
| SiT-XL/2 + IG | 678 (+0.44%) | 114.47 (+0.01%) | 6.19 (+0.16%) | 1.75 (-70.34%) |
IG achieves substantial FID improvement with negligible computational overhead.
Conclusion
Internal Guidance represents a breakthrough in diffusion model guidance strategies by:
- Eliminating the need for complex degradation strategies or additional training
- Achieving state-of-the-art results with minimal overhead
- Providing a scalable solution that works better with larger models
- Enabling practical deployment in large-scale image generation systems
The method's simplicity combined with its effectiveness makes it a compelling addition to the diffusion model toolkit, particularly for large-scale applications where efficiency and quality are both critical.
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper



