StreamDiT: Real-Time Streaming Text-to-Video Generation

Real-Time Streaming Video Generation

StreamDiT enables real-time text-to-video generation at 16 FPS on a single GPU (H100)

(1 minute long videos)

(5 minute long video)

StreamDiT-30B

We applied our method to the 30B model to test its scalability (Note: StreamDiT-30B is not real-time on a single H100)

Abstract

Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications.

This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention.

To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer.

Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g., streaming generation, interactive generation, and video-to-video.

Framework Overview

StreamDiT Partitioning

Moving Buffer with Flow Matching: Our partitioning scheme enables streaming generation by maintaining a moving buffer of frames. The flow matching process operates on partitioned segments, allowing for continuous video generation while preserving temporal consistency across frame transitions.

Model Architecture

Time-varying DiT architecture with adaptive layer normalization.

Time-Varying DiT: Our model uses adaptive layer normalization (adaLN) DiT with varying time embedding to handle the streaming nature of video generation, enabling efficient processing of temporal sequences.

Window attention mechanism for efficient processing.

Window Attention: The window attention mechanism reduces computational complexity while maintaining quality, enabling real-time performance by focusing attention on relevant temporal windows.

Stream Generation Process

Stream generation variants showing different generation modes.

Generation Variants: StreamDiT supports multiple generation variants for different streaming scenarios. The unified framework handles various partitioning schemes and chunk sizes, enabling flexible deployment for different real-time applications while maintaining consistent quality across all modes.

Interactive Video Generation

Inference Pipeline

Interactive inference pipeline of StreamDiT: StreamDiT is specifically designed to achieve real-time responsiveness and interactivity, and its inference pipeline is structured accordingly. To decrease latency, the DiT denoiser, TAE (VAE) decoder, and text encoder run in separate processes. A prompt callback function operates continuously, listening for new user prompts in real time. When a user provides a new prompt, it is converted into a text embedding by the text encoders, and the embedding is sent to the DiT thread to update the existing embedding. Subsequent denoising steps then use this updated embedding through a cross-attention mechanism, dynamically adjusting the direction of text guidance. This design enables users to interactively influence and modify video content in real time through prompt inputs.

A little boy riding his bike in a garden in spring. -> A little boy riding his bike in a garden in summer. -> A little boy riding his bike in a garden in fall. -> A little boy riding his bike in a garden in winter.

A cat is walking in a garden. -> A tiger is walking in a garden.

Serene nature with a calm lake and cloudy sky in daylight. -> Quiet lake at night under a glowing moon and fading twilight. -> Fireworks exploding over a lake.

A man is walking in a desert. -> A man is walking in a cyberpunk city.

A horse is running on a grassland. -> A cheetah is running on a grassland. -> A horse is running on a grassland.

A man is walking on a desert. -> A man is walking towards a beach. -> A man is walking on a beach.

Results

Performance Comparison

StreamDiT-4B achieves real-time performance at 16 FPS on a single GPU while maintaining competitive quality with existing methods. Our model generates 512p video streams with temporal consistency and high visual fidelity.

Comparisons with Existing Works

We implemented the existing methods in our base 4B T2V model to perform apples-to-apples comparisons with StreamDiT

Reuse and Diffuse

FIFO-Diffusion

Ours (Teacher)

Ours (Distilled)

Prompt: An old man takes a pleasant stroll in Antarctica during a beautiful sunset. The old man wears a bright green dress that reaches down to his ankles, and a wide-brimmed sun hat that shields his face from the sun. The man's skin is weathered and wrinkled, with a kind face and a gentle smile. He walks slowly and deliberately, taking in the breathtaking scenery around him. The Antarctic landscape stretches out behind him, with snow-covered peaks and ice shelves glistening in the fading light. The sky above is a kaleidoscope of colors, with hues of pink, orange, and purple blending together in a beautiful sunset. The man's shadow stretches out across the snow as he walks, with the sun casting a warm glow over the entire scene. The lighting is soft and golden, with the sunset casting long shadows across the icy landscape. The video is shot in a cinematic style.

Reuse and Diffuse

FIFO-Diffusion

Ours (Teacher)

Ours (Distilled)

Prompt: Camera tracking shot. New York City is submerged underwater like Atlantis. The city's skyscrapers and buildings are covered in coral and seaweed, with schools of fish darting in and out of the windows. A large whale swims down the middle of the street, its massive body gliding effortlessly through the water. Sea turtles and sharks of various species swim through the streets, some swimming alongside the whale. The Empire State Building and the Statue of Liberty are visible in the distance, covered in coral and anemones. The streetlights are still on, casting a warm glow over the scene. The water is a deep blue, with a few rays of sunlight filtering down from above. The fish and other sea creatures are swimming and playing in the streets, as if they have always lived there. The video is shot in a cinematic style.

Real-Time Applications

StreamDiT enables unprecedented real-time performance at 16 FPS on a single GPU, opening up new possibilities for interactive applications that were previously impossible with existing text-to-video generation models.

🎥 Real-Time Streaming Generation

Generate continuous video streams in real-time without length limitations. Perfect for live content creation, streaming platforms, and interactive media applications.

⚡ Interactive Video Editing

Real-time video-to-video translation and style transfer. Users can see immediate results as they modify prompts or parameters.

🎮 Gaming, Virtual Reality & Avatar

Generate dynamic video content in real-time for gaming environments and VR experiences, maintaining smooth FPS performance for immersive interactions. Potential use cases also include real-time avatar control for personalized and responsive virtual presence.

🤖 Robotics Applications

StreamDiT is well-suited for real-time world simulation in robotics. Furthermore, the Streaming Diffusion Framework can be applied to diffusion policies, enabling temporally fine-grained, seamless continuous action generation.

BibTeX

@misc{kodaira2025streamditrealtimestreamingtexttovideo,
      title={StreamDiT: Real-Time Streaming Text-to-Video Generation}, 
      author={Akio Kodaira and Tingbo Hou and Ji Hou and Masayoshi Tomizuka and Yue Zhao},
      year={2025},
      eprint={2507.03745},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.03745}, 
}