WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Abstract

Recent flow matching models avoid VAE reconstruction bottlenecks by operating directly in pixel space, but the pixel manifold lacks semantic continuity. Optimal transport paths for different semantic endpoints overlap and intersect, causing severe trajectory conflict and slow convergence.

WiT introduces discriminative semantic waypoints projected from pretrained vision models, then factors the transport into two easier mappings: prior-to-waypoint and waypoint-to-pixel. A lightweight waypoint generator predicts these semantic anchors from the current noisy state, and the primary diffusion transformer consumes them via Just-Pixel AdaLN for dense spatial modulation.

On ImageNet 256x256, WiT outperforms strong pixel-space baselines, matches JiT-L/16 at 600 epochs after only 265 epochs, and pushes pure pixel-space generation closer to or beyond heavyweight latent-space diffusion models.

WiT motivation figure — An overview of our Waypoint Diffusion Transformers. (a) and (b) demonstrate the difference in trajectories before and after the waypoint is introduced. In standard pixel-space FM (a), mapping directly to an entangled, non-discriminative pixel manifold (d) induces severe trajectory conflict. With the integration of discriminative semantic waypoints (c), our WiT converts the noise-to-pixel task into two stable, decoupled mappings. By routing the transport path, the generative flow is disentangled, thus mitigating path overlap. Consequently, WiT significantly accelerates convergence compared to baseline (e) while yielding highly realistic generated samples (f).

Method

A decoupled semantic-to-pixel generation pipeline

Construct semantic waypoints

Dense DINOv3 features are projected with PCA to a compact 64D semantic manifold. These waypoints preserve discriminative structure without inheriting the full regression burden of high-dimensional foundation-model features.

Predict waypoints from noisy pixels

A 21M-parameter waypoint generator infers clean semantic anchors directly from the current pixel-space noisy state, making semantic navigation dynamic throughout ODE integration rather than a one-shot side condition.

Inject semantics with Just-Pixel AdaLN

Instead of concatenating extra tokens or channels, WiT injects semantic guidance as spatially varying affine modulation inside the diffusion transformer, preserving the native pixel attention manifold while enforcing local structure.

WiT architecture overview — Overview of the WiT architecture. Left: A lightweight Waypoints Generator (21M params) predicts Semantic Waypoints from the noisy state z_t. Right: The Pixel Space Generator synthesizes the image, utilizing these predicted waypoints as spatial conditions via the Just-Pixel AdaLN mechanism.

Just-Pixel AdaLN figure — (a) Just-Pixel AdaLN: The predicted semantic waypoints provide spatially varying modulation. (b) Visualization of the predicted semantic waypoints and intermediate pixel states during inference. Left: The evolving noisy pixel states z_t at different integration timesteps. Right: The corresponding spatial semantic waypoints s_hat_0 dynamically inferred by our lightweight Waypoints Generator.

Results

Faster convergence, stronger realism, more stable trajectories

Matched JiT-L/16 early

WiT-L/16 reaches FID 2.36 after 265 epochs, matching JiT-L/16 at 600 epochs.

Stronger final scale-up

WiT-XL/16 reaches FID 2.09 and IS 311.8 after 600 epochs on ImageNet 256x256.

Conflict actually drops

Peak pairwise trajectory conflict improves by 1.62x, supporting the claim that semantic waypoints reduce path overlap rather than only masking it.

Comprehensive quantitative comparison

Method	Params	Epochs	IS	FID-50K
Latent-space Diffusion Models
DiT-XL/2	675M + 49M	-	278.2	2.27
SiT-XL/2	675M + 49M	-	277.5	2.06
REPA (SiT-XL/2)	675M + 49M	-	305.7	1.42
LightningDiT-XL/2	675M + 49M	-	295.3	1.35
DDT-XL/2	675M + 49M	-	310.6	1.26
RAE (DiT^DH-XL/2)	839M + 415M	-	262.6	1.13
Pixel-space Models (Non-diffusion)
JetFormer	2.8B	-	-	6.64
FractalMAR-H	848M	-	348.9	6.15
Pixel-space Diffusion Models
ADM-G	554M	-	186.7	4.59
RIN	410M	-	182.0	3.42
SiD (UViT/2)	2B	-	256.3	2.44
PixelFlow (XL/4)	677M	-	282.1	1.98
PixNerd (XL/16)	700M	-	297.0	2.15
JiT-H/16	953M	-	303.4	1.86
JiT-G/16	2B	-	292.6	1.82
LF-DIT-L/16	465M	200	-	2.48
Direct Baselines & Ours
JiT-B/16	131M	200	-	4.37
WiT-B/16 (Ours)	131M + 21M	200	270.7	3.34
JiT-B/16	131M	600	275.1	3.66
WiT-B/16 (Ours)	131M + 21M	600	280.2	3.03
JiT-L/16	459M	200	-	2.79
WiT-L/16 (Ours)	459M + 21M	200	289.1	2.38
JiT-L/16	459M	600	298.5	2.36
WiT-L/16 (Ours)	459M + 21M	265	293.7	2.36
WiT-L/16 (Ours)	459M + 21M	600	303.3	2.22
WiT-XL/16 (Ours)	676M + 21M	200	292.3	2.16
WiT-XL/16 (Ours)	676M + 21M	600	311.8	2.09

Full main-table results from the paper, covering latent-space diffusion, pixel-space non-diffusion, pixel-space diffusion, and direct JiT/WiT comparisons on ImageNet 256x256.

Trajectory conflict metrics

1.55x more stable midpoint pairwise conflict
1.62x more stable maximum peak conflict
1.13x more stable midpoint CFG relative L2 distance

Ablation summary

PCA d=64 is the best semantic bottleneck
Just-Pixel AdaLN outperforms channel concat and in-context concat
WiT-B/16 reaches IS 270.73 and FID 3.34 at 200 epochs

Qualitative Samples

Pixel-space detail with semantic structure preserved

Main qualitative results — WiT-L/16 samples show strong structure, clean boundaries, and high-frequency detail without relying on latent autoencoders.

Repository Status

The code repository is live and will be populated here

This repository is the official landing point for WiT. The current public state is the project page plus a placeholder code entry. Training and inference code will be released in this same repository after the paper is formally published.

Repository github.com/hainuo-wang/WiT

Acknowledgments

Support and thanks

We thank Qiming Hu for insightful discussions and feedback. This work was partially supported by computational resources from TPU Research Cloud (TRC).

Citation