Abstract
Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7$\times$ speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.
How JiT Works?
Motivation: The generative process of diffusion models exhibits a well-established coarse-to-fine characteristic, synthesizing low-frequency global structures first before progressively refining high-frequency details. The JiT framework exploits this inherent spatial redundancy through a progressive token expansion strategy. Instead of processing all tokens uniformly from the start, JiT initially feeds only a sparse subset of anchor tokens into the model to establish global semantics. As the generation progresses, the active token set progressively expands. The model ultimately returns to full-token computation in the final stages to synthesize fine-grained details, thereby achieving training-free spatial acceleration.
Overview: As illustrated in the figure above, JiT is driven by following components:
- (a) SAG-ODE: In standard generation, the model must process all tokens at each timestep. The SAG-ODE significantly reduces the computational load of the diffusion Transformer by calculating the velocity field on a sparse subset of anchor tokens. Concurrently, it utilizes an augmented lifter operator $\mathbf{\Pi}_{k}$ to extrapolate this localized velocity field to the full spatial dimension. This ensures that the model receives cohesive, structurally-aware, full-space dynamic guidance guided solely by computations on the anchor set.
- (b) DMF & ITA: During generation, the set of active tokens must progressively expand to incorporate details. To determine which regions to activate, the ITA strategy computes the local variance of the velocity field to dynamically prioritize regions with high information density. To resolve the distribution shifts and artifacts caused by the instantaneous injection of new tokens, the DMF executes a deterministic state transition that maps the newly activated tokens to a structurally and statistically correct target, effectively preventing artifacts.
- (c) & (d) The Dynamic Sampling Trajectory: The sampling trajectory from $t=0$ to $t=1$ visually demonstrates the dynamic resource allocation. The early generation stages (Stage 2/1) are driven by a narrow subset of active tokens (red flow) to establish global semantics. As the generation progresses, the active token set expands, reserving full computation only for the final detail-refining stage (Stage 0). We also provide a visualization of the sampling process below. You can drag the slider to observe the token expansion moments.
Predicted clean image
Noisy image
4$\times$ Image Generation Acceleration🚀
7$\times$ Image Generation Acceleration🚀🚀
Extend to Video Generation Acceleration
4$\times$ Acceleration🚀
7$\times$ Acceleration🚀🚀
4$\times$ Acceleration🚀
7$\times$ Acceleration🚀🚀
4$\times$ Acceleration🚀
7$\times$ Acceleration🚀🚀
BibTeX
@misc{sun2026justintimetrainingfreespatialacceleration,
title={Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers},
author={Wenhao Sun and Ji Li and Zhaoqiang Liu},
year={2026},
eprint={2603.10744},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.10744}
}