WiT motivation figure
An overview of our Waypoint Diffusion Transformers. (a) and (b) demonstrate the difference in trajectories before and after the waypoint is introduced. In standard pixel-space FM (a), mapping directly to an entangled, non-discriminative pixel manifold (d) induces severe trajectory conflict. With the integration of discriminative semantic waypoints (c), our WiT converts the noise-to-pixel task into two stable, decoupled mappings. By routing the transport path, the generative flow is disentangled, thus mitigating path overlap. Consequently, WiT significantly accelerates convergence compared to baseline (e) while yielding highly realistic generated samples (f).

Method

A decoupled semantic-to-pixel generation pipeline

01

Construct semantic waypoints

Dense DINOv3 features are projected with PCA to a compact 64D semantic manifold. These waypoints preserve discriminative structure without inheriting the full regression burden of high-dimensional foundation-model features.

02

Predict waypoints from noisy pixels

A 21M-parameter waypoint generator infers clean semantic anchors directly from the current pixel-space noisy state, making semantic navigation dynamic throughout ODE integration rather than a one-shot side condition.

03

Inject semantics with Just-Pixel AdaLN

Instead of concatenating extra tokens or channels, WiT injects semantic guidance as spatially varying affine modulation inside the diffusion transformer, preserving the native pixel attention manifold while enforcing local structure.

WiT architecture overview
Overview of the WiT architecture. Left: A lightweight Waypoints Generator (21M params) predicts Semantic Waypoints from the noisy state z_t. Right: The Pixel Space Generator synthesizes the image, utilizing these predicted waypoints as spatial conditions via the Just-Pixel AdaLN mechanism.
Just-Pixel AdaLN figure
(a) Just-Pixel AdaLN: The predicted semantic waypoints provide spatially varying modulation. (b) Visualization of the predicted semantic waypoints and intermediate pixel states during inference. Left: The evolving noisy pixel states z_t at different integration timesteps. Right: The corresponding spatial semantic waypoints s_hat_0 dynamically inferred by our lightweight Waypoints Generator.
Algorithm 1 training procedure of WiT
Algorithm 2 inference procedure of WiT

Results

Faster convergence, stronger realism, more stable trajectories

Matched JiT-L/16 early

WiT-L/16 reaches FID 2.36 after 265 epochs, matching JiT-L/16 at 600 epochs.

Stronger final scale-up

WiT-XL/16 reaches FID 2.09 and IS 311.8 after 600 epochs on ImageNet 256x256.

Conflict actually drops

Peak pairwise trajectory conflict improves by 1.62x, supporting the claim that semantic waypoints reduce path overlap rather than only masking it.

Comprehensive quantitative comparison

Method Params Epochs IS FID-50K
Latent-space Diffusion Models
DiT-XL/2 675M + 49M - 278.2 2.27
SiT-XL/2 675M + 49M - 277.5 2.06
REPA (SiT-XL/2) 675M + 49M - 305.7 1.42
LightningDiT-XL/2 675M + 49M - 295.3 1.35
DDT-XL/2 675M + 49M - 310.6 1.26
RAE (DiT^DH-XL/2) 839M + 415M - 262.6 1.13
Pixel-space Models (Non-diffusion)
JetFormer 2.8B - - 6.64
FractalMAR-H 848M - 348.9 6.15
Pixel-space Diffusion Models
ADM-G 554M - 186.7 4.59
RIN 410M - 182.0 3.42
SiD (UViT/2) 2B - 256.3 2.44
PixelFlow (XL/4) 677M - 282.1 1.98
PixNerd (XL/16) 700M - 297.0 2.15
JiT-H/16 953M - 303.4 1.86
JiT-G/16 2B - 292.6 1.82
LF-DIT-L/16 465M 200 - 2.48
Direct Baselines & Ours
JiT-B/16 131M 200 - 4.37
WiT-B/16 (Ours) 131M + 21M 200 270.7 3.34
JiT-B/16 131M 600 275.1 3.66
WiT-B/16 (Ours) 131M + 21M 600 280.2 3.03
JiT-L/16 459M 200 - 2.79
WiT-L/16 (Ours) 459M + 21M 200 289.1 2.38
JiT-L/16 459M 600 298.5 2.36
WiT-L/16 (Ours) 459M + 21M 265 293.7 2.36
WiT-L/16 (Ours) 459M + 21M 600 303.3 2.22
WiT-XL/16 (Ours) 676M + 21M 200 292.3 2.16
WiT-XL/16 (Ours) 676M + 21M 600 311.8 2.09

Full main-table results from the paper, covering latent-space diffusion, pixel-space non-diffusion, pixel-space diffusion, and direct JiT/WiT comparisons on ImageNet 256x256.

Trajectory conflict metrics

  • 1.55x more stable midpoint pairwise conflict
  • 1.62x more stable maximum peak conflict
  • 1.13x more stable midpoint CFG relative L2 distance

Ablation summary

  • PCA d=64 is the best semantic bottleneck
  • Just-Pixel AdaLN outperforms channel concat and in-context concat
  • WiT-B/16 reaches IS 270.73 and FID 3.34 at 200 epochs

Repository Status

The code repository is live and will be populated here

This repository is the official landing point for WiT. The current public state is the project page plus a placeholder code entry. Training and inference code will be released in this same repository after the paper is formally published.

Acknowledgments

Support and thanks

We thank Qiming Hu for insightful discussions and feedback. This work was partially supported by computational resources from TPU Research Cloud (TRC).

Citation

BibTeX