Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training

1The Hong Kong Polytechnic University , 2OPPO Research Institute
Teaser Image

Overview

[Question]: Can internal features be used as effective semantic guidance signals to improve the training of DiT models?

[Answer]: Yes! It can even provide better feature guidance than external pretrained DINO features used in REPA.

[Question]: Could any feature be effective guidance?

[Answer]: No. We find that the most effective guiding features should meet two criteria:

(1) They should have a clean structure, in the sense that they can effectively help shallow blocks distinguish noise from signal.

(2) They should be semantically discriminative, making it easier for shallow layers to learn effective representations.

Overview Figure

REPA leverages pre-trained DINO model features to effectively highlight semantically meaningful regions. In comparison, SRA and LayerSync rely on internal features from the in-training model, whose weak semantic representation limits their guidance for shallow layers. Instead, our approach generates features with clearer structural organization and richer semantics.

Method

We propose a two-stage training framework, namely Self-Transcendence.

Firstly, we use clean VAE features as guidance to help the model distinguish useful information from noise in shallow layers. After a certain number of iterations, the model has learned more meaningful representations.

We then freeze this model and use its representation as a fixed teacher. To enhance the semantic expression in the features, we build a self-guided representation that better aligns with the target conditions.

Framework Figure

The framework of our proposed Self-Transcendence. The spark icon indicates that the parameters of this layer are trainable, while the snowflake icon indicates that they are frozen.

Result

Self-Transcendence demonstrates superior class separability.

Despite the powerful semantic separability obtained by REPA, it heavily relies on extensive and time-consuming pre-training with external data, which may not always be feasible and desirable. Though some recent methods such as LayerSync and SRA achieve self-acceleration, their features lack stable structural guidance and semantic separability for training. Instead, our Self-Transcendence not only achieves self-acceleration, but also demonstrates comparable semantic separability with REPA.

Motivation Study

t-SNE visualizations of the guiding features extracted from (a) REPA, (b) LayerSync, (c) VAE features, and (d) our Self-Transcendence with t = 0.4 in the 200K iteration of SiTXL/2. Different colors represent different classes. As REPA, our internal guiding features demonstrate superior class separability.

Self-Transcendence is easier, more efficient and more effective.

Compared to methods using external features like REPA, our Self-Transcendence is easier and more efficient without using external data. And still, it outperforms all other self-contained techniques, achieving results comparable to or even better than REPA.

Quantitative Results

Comparisons with acceleration methods across different backbones on ImageNet. ↓ and ↑ indicate whether lower or higher values are better, respectively

Self-Transcendence obtains better structural generation.

We compare the vanilla SiT model, REPA-enhanced model, and our trained model across 100K to 400K iterations on two ImageNet classes, as illustrated in Fig.4. Both our model and REPA show faster convergence than the vanilla SiT model, and our method tends to obtain better structural generation. For example, in the shoe class, our method generates more realistic shapes

interactivegen Figure

Visual comparison of generated samples from SiT-XL/2 models at different training iterations. For all models, we apply the same seed, noise, and sampling strategy with a CFG scale of 4.0.

BibTeX

@misc{sun2026externalguidanceunleashingsemantic,
      title={Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training}, 
      author={Lingchen Sun and Rongyuan Wu and Zhengqiang Zhang and Ruibin Li and Yujing Sun and Shuaizheng Liu and Lei Zhang},
      year={2026},
      eprint={2601.07773},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.07773}, 
}