PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

🚀 Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

Wenhao Sun¹, Ji Li², Zhaoqiang Liu^1,✉

¹University of Electronic Science and Technology of China ²Capital Normal University
^✉Corresponding Author

CVPR 2026

Paper Code arXiv BibTeX

Generated by FLUX.2-klein-base-9B (7$\times$ acceleration)

Abstract

Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7$\times$ speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.

How JiT Works?

Motivation: The generative process of diffusion models exhibits a well-established coarse-to-fine characteristic, synthesizing low-frequency global structures first before progressively refining high-frequency details. The JiT framework exploits this inherent spatial redundancy through a progressive token expansion strategy. Instead of processing all tokens uniformly from the start, JiT initially feeds only a sparse subset of anchor tokens into the model to establish global semantics. As the generation progresses, the active token set progressively expands. The model ultimately returns to full-token computation in the final stages to synthesize fine-grained details, thereby achieving training-free spatial acceleration.

Overview: As illustrated in the figure above, JiT is driven by following components:

(a) SAG-ODE: In standard generation, the model must process all tokens at each timestep. The SAG-ODE significantly reduces the computational load of the diffusion Transformer by calculating the velocity field on a sparse subset of anchor tokens. Concurrently, it utilizes an augmented lifter operator $\mathbf{\Pi}_{k}$ to extrapolate this localized velocity field to the full spatial dimension. This ensures that the model receives cohesive, structurally-aware, full-space dynamic guidance guided solely by computations on the anchor set.
(b) DMF & ITA: During generation, the set of active tokens must progressively expand to incorporate details. To determine which regions to activate, the ITA strategy computes the local variance of the velocity field to dynamically prioritize regions with high information density. To resolve the distribution shifts and artifacts caused by the instantaneous injection of new tokens, the DMF executes a deterministic state transition that maps the newly activated tokens to a structurally and statistically correct target, effectively preventing artifacts.
(c) & (d) The Dynamic Sampling Trajectory: The sampling trajectory from $t=0$ to $t=1$ visually demonstrates the dynamic resource allocation. The early generation stages (Stage 2/1) are driven by a narrow subset of active tokens (red flow) to establish global semantics. As the generation progresses, the active token set expands, reserving full computation only for the final detail-refining stage (Stage 0). We also provide a visualization of the sampling process below. You can drag the slider to observe the token expansion moments.

Predicted clean image

Noisy image

4$\times$ Image Generation Acceleration🚀

A red panda is sitting on a chair, holding a cardboard sign that reads “UESTC DIG”.

A panda wearing a hoodie is looking at a blackboard with a rocket drawn on it, the blackboard reads “空间域加速”.

A hyper-realistic portrait of a young woman in natural sunlight, soft shadows, shallow depth of field.

Impressionist oil painting of a stormy ocean, thick brushstrokes, dramatic lighting, Van Gogh style.

Generated by Qwen-Image-2512

A 3D render of the words 'JUST IN TIME' in polished metallic letters, silver and copper, isometric view, cinematic lighting.

A spray-painted graffiti art piece on a concrete wall, with stylized letters spelling out 'SPATIAL DOMAIN', vibrant colors, street art style.

A freshly brewed cup of coffee with detailed crema pattern, resting on a textured linen tablecloth, soft morning light.

Traditional Chinese ink wash painting of a lone scholar sitting under a pine tree, mist-shrouded mountains in the background.

Generated by FLUX.2-klein-base-9B

Portrait of a wild lion, intense gaze, highly realistic fur texture, natural savanna background, sharp focus.

Portrait of an elderly artisan, deep facial wrinkles, soft natural backlight, masterpiece, highly detailed skin.

A single drop of water just about to fall from a green leaf, reflecting the forest around it, high-resolution realism.

A single, perfectly smooth concrete sphere casting a sharp, elongated shadow on a minimalist light wood floor, daylight.

Generated by FLUX.2-klein-base-9B

7$\times$ Image Generation Acceleration🚀🚀

A magical ancient book where glowing runes lift off the page.

The text 'VINTAGE SOUL' in a flowing, elegant cursive script, with a watercolor background, soft lighting.

A scale-up 4k photo-realistic image of a cat wearing a golden crown and sitting on a chair made by avocado wood.

A steaming cup of espresso on a rain-streaked window sill, soft morning light, hyper-realistic texture.

Generated by FLUX.2-klein-base-9B

Minimalist architectural photography, concrete curves against a deep blue sky, sharp shadows.

A transparent glass piano in a minimalist hall, reflections on the floor, soft morning light.

A futuristic samurai in white carbon fiber armor, cherry blossom petals falling, cinematic bokeh.

Lonely lighthouse on a cliff during a lightning storm, dramatic clouds, hyper-realistic waves.

Generated by FLUX.2-klein-base-9B

Extend to Video Generation Acceleration

A hyper-realistic close-up video of a fluffy red panda wearing a bright red hoodie, standing on a windy cliff. The strong wind violently rustles the individual strands of the red panda's thick fur and the fabric of the hoodie. The background is a static, soft-focus mountain range under a clear blue sky.

4$\times$ Acceleration🚀

7$\times$ Acceleration🚀🚀

Generated by HunyuanVideo 1.5

A fluffy red panda rolling in powdery white snow, rapidly shaking off the snowflakes.

4$\times$ Acceleration🚀

7$\times$ Acceleration🚀🚀

Generated by HunyuanVideo 1.5

Slow-motion macro shot of dark espresso pouring into a clear glass of iced milk, creating swirling gradients.

4$\times$ Acceleration🚀

7$\times$ Acceleration🚀🚀

Generated by HunyuanVideo 1.5

BibTeX

@misc{sun2026justintimetrainingfreespatialacceleration,
      title={Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers}, 
      author={Wenhao Sun and Ji Li and Zhaoqiang Liu},
      year={2026},
      eprint={2603.10744},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.10744}
}