Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

Step-Video-TI2V 技术报告：一种先进的文本驱动图像到视频生成模型

Step-Video Team StepFun

Abstract

摘要

We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.

我们推出了 Step-Video-TI2V，这是一个拥有 300 亿参数的最先进的文本驱动图像到视频生成模型，能够基于文本和图像输入生成最多 102 帧的视频。我们构建了 Step-Video-TI2V-Eval 作为文本驱动图像到视频任务的新基准，并使用该数据集将 Step-Video-TI2V 与开源和商业 TI2V 引擎进行比较。实验结果表明，Step-Video-TI2V 在图像到视频生成任务中表现出最先进的性能。Step-Video-TI2V 和 Step-Video-TI2V-Eval 均可在 https://github.com/stepfun-ai/Step-Video-TI2V 获取。

1 Introduction

1 引言

Text-driven image-to-video (TI2V) models are favored by video content creators because they offer greater control compared to text-to-video (T2V) models. Existing commercial video generation engines like Gen-3 RunwayML [2024], Kling Kuaishou [2024] and Hailuo MiniMax [2024] offer this capability to users. Recently, open-source TI2V models like Hun yuan Video-I2V Kong et al. [2025] and Wan2.1 Team [2025] have also been released.

文本驱动的图像到视频 (TI2V) 模型受到视频内容创作者的青睐，因为它们相比文本到视频 (T2V) 模型提供了更大的控制力。现有的商业视频生成引擎如 Gen-3 RunwayML [2024]、Kling Kuaishou [2024] 和 Hailuo MiniMax [2024] 都向用户提供了这一功能。最近，开源的 TI2V 模型如 Hun yuan Video-I2V Kong 等人 [2025] 和 Wan2.1 Team [2025] 也已发布。

In this report, we introduce Step-Video-TI2V, a new state-of-the-art TI2V model. Instead of training it from scratch, we continue the pre-training of the recently released Step-Video-T2V Ma et al. [2025], adapting it for the image-to-video generation task. To evaluate the performance of different TI2V models, we build Step-Video-TI2V-Eval as a new benchmark dataset for the TI2V task. We compare Step-Video-TI2V with open-source and commercial TI2V engines on the benchmark, demonstrating its state-of-the-art performance in image-to-video generation.

在本报告中，我们介绍了 Step-Video-TI2V，这是一种新的最先进的 TI2V 模型。我们没有从头开始训练它，而是继续对最近发布的 Step-Video-T2V (Ma et al. [2025]) 进行预训练，使其适应图像到视频生成任务。为了评估不同 TI2V 模型的性能，我们构建了 Step-Video-TI2V-Eval 作为 TI2V 任务的新基准数据集。我们在基准测试中将 Step-Video-TI2V 与开源和商业 TI2V 引擎进行了比较，展示了其在图像到视频生成方面的最先进性能。

Our contributions are four-fold. First, Step-Video-TI2V is a powerful open-source TI2V model with the largest model size to date. Second, it enables explicit control over the motion dynamics of generated videos, providing users with greater flexibility. Third, it excels in the anime-style TI2V task due to the training data composition. Fourth, we introduce a new benchmark dataset specifically designed for the TI2V task, fostering future research and evaluation in this domain.

我们的贡献有四个方面。首先，Step-Video-TI2V 是一个强大的开源 TI2V 模型，拥有迄今为止最大的模型规模。其次，它能够显式控制生成视频的运动动态，为用户提供更大的灵活性。第三，由于训练数据的组成，它在动漫风格的 TI2V 任务中表现出色。第四，我们引入了一个专门为 TI2V 任务设计的新基准数据集，促进了该领域未来的研究和评估。

2 Model

2 模型

2.1 TI2V Framework

2.1 TI2V 框架

Rather than pre-training from scratch, we train Step-Video-TI2V based on Step-Video-T2V Ma et al. [2025], a 30B-parameter open-source text-to-video model. To incorporate the image condition as the first frame of the generated video, we encode it into latent representations using Step-VideoT2V’s Video-VAE and concatenate them along the channel dimension of the video latent (§2.2). Additionally, we introduce a motion score condition, enabling users to control the dynamic level of the video generated from the image condition $(\S2.3)$ . Figure 1 shows an overview of our framework, highlighting these two modifications to the pre-trained T2V model. The details of Image Conditioning and Motion Conditioning are elaborated in $\S2.2$ and $\S2.3$ , respectively.

我们基于 Step-Video-T2V (Ma et al. [2025])，一个拥有 300 亿参数的开源文本到视频模型，训练了 Step-Video-TI2V，而不是从头开始预训练。为了将图像条件作为生成视频的第一帧，我们使用 Step-VideoT2V 的 Video-VAE 将其编码为潜在表示，并将其沿视频潜在表示的通道维度连接 (§2.2)。此外，我们引入了运动分数条件，使用户能够控制从图像条件生成的视频的动态水平 $(\S2.3)$。图 1 展示了我们框架的概览，突出了对预训练 T2V 模型的这两项修改。图像条件和运动条件的详细信息分别在 $\S2.2$ 和 $\S2.3$ 中详细阐述。

Figure 1: Overview of Step-Video-TI2V. Based on the pre-trained T2V model, we introduce two key modifications: Image Conditioning and Motion Conditioning. These enhancements enable video generation from a given image while allowing users to adjust the dynamic level of the output video.

图 1: Step-Video-TI2V 概述。基于预训练的 T2V 模型，我们引入了两个关键修改：图像条件 (Image Conditioning) 和运动条件 (Motion Conditioning)。这些增强功能使得可以从给定图像生成视频，同时允许用户调整输出视频的动态水平。

2.2 Image Conditioning

2.2 图像条件

Given an image condition of shape $[1,3,h,w]$ (representing the number of frames, channels, height, and width), we first add random Gaussian noise to obtain a noise-augmented conditional image Zhang et al. [2023], Blattmann et al. [2023] and project it into latent space representation $\mathbf{Z}_{c}\in \mathbb{R}^{1\times c\times h'\times w'}$ using the Video-VAE in Step-Video-T2V.

给定形状为 $[1,3,h,w]$ 的图像条件（表示帧数、通道数、高度和宽度），我们首先添加随机高斯噪声以获得噪声增强的条件图像 Zhang et al. [2023], Blattmann et al. [2023]，并使用 Step-Video-T2V 中的 Video-VAE 将其投影到潜在空间表示$\mathbf{Z}_{c}\in \mathbb{R}^{1\times c\times h'\times w'}$中。

We then concatenate $\mathbf{Z}_ {c}$ along the channel dimension of the video latent $\mathbf{Z}_ {v} \in \mathbb{R}^{f'\times c\times h'\times w'}$. To ensure compatibility with the video latent frames, we apply zero padding of shape $[f'-1, c, h', w']$to $\mathbf{Z}_ {c}$. The final latent input of DiT is ${\mathbf{Z}_ {t}=[\mathbf{Z}_ {v};\mathbf{Z}_{c}]\in \mathbb{R}^{f'\times 2c\times h'\times w'}}$. Thus, the channel of the corresponding patch embedding module in DiT is extended to 2c .

然后我们将$\mathbf{Z}_ {c}$ 沿着视频潜在表示 $\mathbf{Z}_ {v} \in \mathbb{R}^{f'\times c\times h'\times w'}$ 的通道维度进行拼接。为了确保与视频潜在帧的兼容性，我们对 $[f'-1, c, h', w']$应用形状为 $\mathbf{Z}_ {c}$的零填充。DiT 的最终潜在输入为 ${\mathbf{Z}_ {t}=[\mathbf{Z}_ {v};\mathbf{Z}_{c}]\in \mathbb{R}^{f'\times 2c\times h'\times w'}}$。因此，DiT 中对应的 patch embedding 模块的通道数扩展为 2c。

2.3 Motion Conditioning

2.3 运动条件

Existing open-source TI2V models generate videos directly from input text and images. However, most video generation models encounter stability issues—some generated videos exhibit high motion dynamics but contain artifacts, while others are more stable but have low motion dynamics. To balance motion dynamics and stability while providing users with control, we introduce motion embedding, enabling explicit control over the motion dynamics of the generated videos.

现有的开源 TI2V 模型直接从输入的文本和图像生成视频。然而，大多数视频生成模型都会遇到稳定性问题——一些生成的视频表现出高动态运动但包含伪影，而另一些则更稳定但动态运动较低。为了在平衡动态运动和稳定性的同时为用户提供控制，我们引入了运动嵌入 (motion embedding)，从而能够显式控制生成视频的动态运动。

To incorporate motion scores into training, we use OpenCV OpenCV Developers [2021] to compute optical flow between video frames. Frames are sampled at 12-frame intervals, converted to grayscale, and their flow magnitudes are extracted. The highest values are selected, and the mean magnitude of the selected values is used to generate motion embeddings. In the adaptive layer normalization (AdaLN-Single) process, an additional conditional MLP block embeds the motion score, which is then combined with the embedded timestep. During inference, the motion score is set by the user as a hyper parameter, allowing direct control over the level of motion dynamics.

为了将运动评分融入训练中，我们使用 OpenCV (OpenCV Developers [2021]) 来计算视频帧之间的光流。帧以 12 帧为间隔进行采样，转换为灰度图像，并提取其光流幅度。选择最高值，并使用所选值的平均幅度生成运动嵌入。在自适应层归一化 (AdaLN-Single) 过程中，一个额外的条件 MLP 块嵌入运动评分，然后将其与嵌入的时间步长结合。在推理过程中，运动评分由用户设置为超参数，从而直接控制运动动态的水平。

3 Benchmark

3 基准测试

3.1 Dataset

3.1 数据集

Table 1: Category and Statistics of Step-Video-TI2V-Eval. The corresponding number of instances is indicated in parentheses for each category.

表 1: Step-Video-TI2V-Eval 的类别和统计。每个类别对应的实例数量在括号中标明。

类别	细粒度类型及统计
现实世界	I2V 主体 (62), I2V 背景 (14), 相机运动 (11) 动态与艺术元素: 书法与绘画 (7)
动漫风格	I2V 主体 (19), I2V 背景 (25), 动作 (13) 动漫风格: 3D 动漫 (3), 扁平设计 (5), 插画角色 (3), 抽象背景 (3), 儿童绘画 (5), 故事书插图 (14), 机械 (3), 图形设计 (3), 现代美学 (6), 线稿 (2), 像素艺术 (2), 彩色素描 (1) 色彩美学: 高饱和度 (4), 高对比度 (3), 霓虹墨水 (4), 黑色背景与彩色线条 (2)

We build Step-Video-TI2V-Eval, a new benchmark designed for the text-driven image-to-video generation task. The dataset comprises 178 real-world and 120 anime-style prompt-image pairs, ensuring broad coverage of diverse user scenarios. To achieve comprehensive representation, we developed a fine-grained schema for data collection in both categories.

我们构建了 Step-Video-TI2V-Eval，这是一个专为文本驱动的图像到视频生成任务设计的新基准。该数据集包含 178 个现实世界和 120 个动漫风格的提示-图像对，确保广泛覆盖多样化的用户场景。为了实现全面的代表性，我们为这两类数据收集开发了一个细粒度的方案。

Expanding on prior work Huang et al. [2024], which categorized I2V Subject, I2V Background, Action and Camera Motion, we further structured the dataset based on category-specific attributes. For real-world scenes, we incorporated Dynamic and Artistic Elements (e.g., surreal imagery and musical performances). For anime-style scenes, we categorized samples by Anime Style (e.g., 3D anime, children’s illustrations) and Color Aesthetic (e.g., ink-based designs, high saturation). A detailed dataset distribution is provided in Table 1.

在 Huang 等人 [2024] 先前工作的基础上，他们对 I2V 主题、I2V 背景、动作和相机运动进行了分类，我们进一步根据类别特定属性对数据集进行了结构化处理。对于现实世界场景，我们融入了动态和艺术元素（例如，超现实图像和音乐表演）。对于动漫风格场景，我们按动漫风格（例如，3D 动漫、儿童插画）和色彩美学（例如，水墨设计、高饱和度）对样本进行了分类。详细的数据集分布见表 1。

To ensure high-quality prompts, human annotators crafted test descriptions for each image, specifying object motion and camera movements expected in the generated videos.

为确保高质量的提示，人类标注者为每张图像精心制作了测试描述，指定了生成视频中预期的物体运动和相机移动。

3.2 Evaluation Metric

3.2 评估指标

interactions, motion realism, lighting, and collisions. Penalties should be applied if elements such as human bodies, animals, objects, or backgrounds exhibit distortions or unnatural deformations. If neither model is clearly superior, a "Tie" should be assigned.

交互、运动真实感、光照和碰撞。如果人体、动物、物体或背景等元素出现扭曲或不自然的变形，应施加惩罚。如果两个模型都没有明显优势，则应分配“平局”。

4 Experiments

4 实验

4.1 Training Data

4.1 训练数据

We constructed a TI2V dataset comprising 5M text-image-video triples and continued training StepVideo-T2V based on it. This dataset was carefully filtered to ensure balanced motion dynamics, strong visual aesthetics, diverse concepts, accurate captions, and seamless scene continuity. Due to the initial design of the model, over $8\bar{0}%$ of the training data consists of anime-style videos and only anime-style data were used in the first half of the training stage. These factors significantly enhance our model’s performance in this category but simultaneously reduce its effectiveness on real-world-style videos (see results list in Table 2).

我们构建了一个包含500万文本-图像-视频三元组的TI2V数据集，并在此基础上继续训练StepVideo-T2V。该数据集经过精心筛选，以确保运动动态平衡、视觉美感强、概念多样、字幕准确以及场景连续性无缝。由于模型的初始设计，超过80%的训练数据由动漫风格视频组成，并且在训练阶段的前半部分仅使用了动漫风格数据。这些因素显著提升了我们模型在该类别中的表现，但同时也降低了其在现实世界风格视频上的效果（见表2中的结果列表）。

To enable motion control, we extracted motion scores for all videos in the training data and set thresholds to filter out videos with excessively high or low motion. We integrated the motion score into our model, as described in $\S2.3$ . Additionally, we fine-tuned our in-house video captioning model for the TI2V task, focusing primarily on describing the motion dynamics of objects and camera movements. For example, captions like "a flock of birds flying over a tree at sunset, camera pans left" provide detailed motion context to the TI2V task.

为了实现运动控制，我们从训练数据中提取了所有视频的运动分数，并设置了阈值以过滤掉运

[论文翻译]Step-Video-TI2V 技术报告：一种先进的文本驱动图像到视频生成模型

原文地址：https://arxiv.org/pdf/2503.11251v1