Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
Step-Video-TI2V 技术报告:一种先进的文本驱动图像到视频生成模型
Step-Video Team StepFun
Step-Video Team StepFun
Abstract
摘要
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
我们推出了 Step-Video-TI2V,这是一个拥有 300 亿参数的最先进的文本驱动图像到视频生成模型,能够基于文本和图像输入生成最多 102 帧的视频。我们构建了 Step-Video-TI2V-Eval 作为文本驱动图像到视频任务的新基准,并使用该数据集将 Step-Video-TI2V 与开源和商业 TI2V 引擎进行比较。实验结果表明,Step-Video-TI2V 在图像到视频生成任务中表现出最先进的性能。Step-Video-TI2V 和 Step-Video-TI2V-Eval 均可在 https://github.com/stepfun-ai/Step-Video-TI2V 获取。
1 Introduction
1 引言
Text-driven image-to-video (TI2V) models are favored by video content creators because they offer greater control compared to text-to-video (T2V) models. Existing commercial video generation engines like Gen-3 RunwayML [2024], Kling Kuaishou [2024] and Hailuo MiniMax [2024] offer this capability to users. Recently, open-source TI2V models like Hun yuan Video-I2V Kong et al. [2025] and Wan2.1 Team [2025] have also been released.
文本驱动的图像到视频 (TI2V) 模型受到视频内容创作者的青睐,因为它们相比文本到视频 (T2V) 模型提供了更大的控制力。现有的商业视频生成引擎如 Gen-3 RunwayML [2024]、Kling Kuaishou [2024] 和 Hailuo MiniMax [2024] 都向用户提供了这一功能。最近,开源的 TI2V 模型如 Hun yuan Video-I2V Kong 等人 [2025] 和 Wan2.1 Team [2025] 也已发布。
In this report, we introduce Step-Video-TI2V, a new state-of-the-art TI2V model. Instead of training it from scratch, we continue the pre-training of the recently released Step-Video-T2V Ma et al. [2025], adapting it for the image-to-video generation task. To evaluate the performance of different TI2V models, we build Step-Video-TI2V-Eval as a new benchmark dataset for the TI2V task. We compare Step-Video-TI2V with open-source and commercial TI2V engines on the benchmark, demonstrating its state-of-the-art performance in image-to-video generation.
在本报告中,我们介绍了 Step-Video-TI2V,这是一种新的最先进的 TI2V 模型。我们没有从头开始训练它,而是继续对最近发布的 Step-Video-T2V (Ma et al. [2025]) 进行预训练,使其适应图像到视频生成任务。为了评估不同 TI2V 模型的性能,我们构建了 Step-Video-TI2V-Eval 作为 TI2V 任务的新基准数据集。我们在基准测试中将 Step-Video-TI2V 与开源和商业 TI2V 引擎进行了比较,展示了其在图像到视频生成方面的最先进性能。
Our contributions are four-fold. First, Step-Video-TI2V is a powerful open-source TI2V model with the largest model size to date. Second, it enables explicit control over the motion dynamics of generated videos, providing users with greater flexibility. Third, it excels in the anime-style TI2V task due to the training data composition. Fourth, we introduce a new benchmark dataset specifically designed for the TI2V task, fostering future research and evaluation in this domain.
我们的贡献有四个方面。首先,Step-Video-TI2V 是一个强大的开源 TI2V 模型,拥有迄今为止最大的模型规模。其次,它能够显式控制生成视频的运动动态,为用户提供更大的灵活性。第三,由于训练数据的组成,它在动漫风格的 TI2V 任务中表现出色。第四,我们引入了一个专门为 TI2V 任务设计的新基准数据集,促进了该领域未来的研究和评估。
2 Model
2 模型
2.1 TI2V Framework
2.1 TI2V 框架
Rather than pre-training from scratch, we train Step-Video-TI2V based on Step-Video-T2V Ma et al. [2025], a 30B-parameter open-source text-to-video model. To incorporate the image condition as the first frame of the generated video, we encode it into latent representations using Step-VideoT2V’s Video-VAE and concatenate them along the channel dimension of the video latent (§2.2). Additionally, we introduce a motion score condition, enabling users to control the dynamic level of the video generated from the image condition $(\S2.3)$ . Figure 1 shows an overview of our framework, highlighting these two modifications to the pre-trained T2V model. The details of Image Conditioning and Motion Conditioning are elaborated in $\S2.2$ and $\S2.3$ , respectively.
我们基于 Step-Video-T2V (Ma et al. [2025]),一个拥有 300 亿参数的开源文本到视频模型,训练了 Step-Video-TI2V,而不是从头开始预训练。为了将图像条件作为生成视频的第一帧,我们使用 Step-VideoT2V 的 Video-VAE 将其编码为潜在表示,并将其沿视频潜在表示的通道维度连接 (§2.2)。此外,我们引入了运动分数条件,使用户能够控制从图像条件生成的视频的动态水平 $(\S2.3)$。图 1 展示了我们框架的概览,突出了对预训练 T2V 模型的这两项修改。图像条件和运动条件的详细信息分别在 $\S2.2$ 和 $\S2.3$ 中详细阐述。

Figure 1: Overview of Step-Video-TI2V. Based on the pre-trained T2V model, we introduce two key modifications: Image Conditioning and Motion Conditioning. These enhancements enable video generation from a given image while allowing users to adjust the dynamic level of the output video.
图 1: Step-Video-TI2V 概述。基于预训练的 T2V 模型,我们引入了两个关键修改:图像条件 (Image Conditioning) 和运动条件 (Motion Conditioning)。这些增强功能使得可以从给定图像生成视频,同时允许用户调整输出视频的动态水平。
2.2 Image Conditioning
2.2 图像条件
Given an image condition of shape $[1,3,h,w]$ (representing the number of frames, channels, height, and width), we first add random Gaussian noise to obtain a noise-augmented conditional image Zhang et al. [2023], Blattmann et al. [2023] and project it into latent space representation $\mathbf{Z}_{c}\in \mathbb{R}^{1\times c\times h'\times w'}$ using the Video-VAE in Step-Video-T2V.
给定形状为 $[1,3,h,w]$ 的图像条件(表示帧数、通道数、高度和宽度),我们首先添加随机高斯噪声以获得噪声增强的条件图像 Zhang et al. [2023], Blattmann et al. [2023],并使用 Step-Video-T2V 中的 Video-VAE 将其投影到潜在空间表示$\mathbf{Z}_{c}\in \mathbb{R}^{1\times c\times h'\times w'}$中。
We then concatenate $\mathbf{Z}_ {c}$ along the channel dimension of the video latent $\mathbf{Z}_ {v} \in \mathbb{R}^{f'\times c\times h'\times w'}$. To ensure compatibility with the video latent frames, we apply zero padding of shape $[f'-1, c, h', w']$to $\mathbf{Z}_ {c}$. The final latent input of DiT is ${\mathbf{Z}_ {t}=[\mathbf{Z}_ {v};\mathbf{Z}_{c}]\in \mathbb{R}^{f'\times 2c\times h'\times w'}}$. Thus, the channel of the corresponding patch embedding module in DiT is extended to 2c .
然后我们将$\mathbf{Z}_ {c}$ 沿着视频潜在表示 $\mathbf{Z}_ {v} \in \mathbb{R}^{f'\times c\times h'\times w'}$ 的通道维度进行拼接。为了确保与视频潜在帧的兼容性,我们对 $[f'-1, c, h', w']$应用形状为 $\mathbf{Z}_ {c}$的零填充。DiT 的最终潜在输入为 ${\mathbf{Z}_ {t}=[\mathbf{Z}_ {v};\mathbf{Z}_{c}]\in \mathbb{R}^{f'\times 2c\times h'\times w'}}$。因此,DiT 中对应的 patch embedding 模块的通道数扩展为 2c。
2.3 Motion Conditioning
2.3 运动条件
Existing open-source TI2V models generate videos directly from input text and images. However, most video generation models encounter stability issues—some generated videos exhibit high motion dynamics but contain artifacts, while others are more stable but have low motion dynamics. To balance motion dynamics and stability while providing users with control, we introduce motion embedding, enabling explicit control over the motion dynamics of the generated videos.
现有的开源 TI2V 模型直接从输入的文本和图像生成视频。然而,大多数视频生成模型都会遇到稳定性问题——一些生成的视频表现出高动态运动但包含伪影,而另一些则更稳定但动态运动较低。为了在平衡动态运动和稳定性的同时为用户提供控制,我们引入了运动嵌入 (motion embedding),从而能够显式控制生成视频的动态运动。
To incorporate motion scores into training, we use OpenCV OpenCV Developers [2021] to compute optical flow between video frames. Frames are sampled at 12-frame intervals, converted to grayscale, and their flow magnitudes are extracted. The highest values are selected, and the mean magnitude of the selected values is used to generate motion embeddings. In the adaptive layer normalization (AdaLN-Single) process, an additional conditional MLP block embeds the motion score, which is then combined with the embedded timestep. During inference, the motion score is set by the user as a hyper parameter, allowing direct control over the level of motion dynamics.
为了将运动评分融入训练中,我们使用 OpenCV (OpenCV Developers [2021]) 来计算视频帧之间的光流。帧以 12 帧为间隔进行采样,转换为灰度图像,并提取其光流幅度。选择最高值,并使用所选值的平均幅度生成运动嵌入。在自适应层归一化 (AdaLN-Single) 过程中,一个额外的条件 MLP 块嵌入运动评分,然后将其与嵌入的时间步长结合。在推理过程中,运动评分由用户设置为超参数,从而直接控制运动动态的水平。
3 Benchmark
3 基准测试
3.1 Dataset
3.1 数据集
Table 1: Category and Statistics of Step-Video-TI2V-Eval. The corresponding number of instances is indicated in parentheses for each category.
表 1: Step-Video-TI2V-Eval 的类别和统计。每个类别对应的实例数量在括号中标明。
| 类别 | 细粒度类型及统计 |
|---|---|
| 现实世界 | I2V 主体 (62), I2V 背景 (14), 相机运动 (11) 动态与艺术元素: 书法与绘画 (7) |
| 动漫风格 | I2V 主体 (19), I2V 背景 (25), 动作 (13) 动漫风格: 3D 动漫 (3), 扁平设计 (5), 插画角色 (3), 抽象背景 (3), 儿童绘画 (5), 故事书插图 (14), 机械 (3), 图形设计 (3), 现代美学 (6), 线稿 (2), 像素艺术 (2), 彩色素描 (1) 色彩美学: 高饱和度 (4), 高对比度 (3), 霓虹墨水 (4), 黑色背景与彩色线条 (2) |
We build Step-Video-TI2V-Eval, a new benchmark designed for the text-driven image-to-video generation task. The dataset comprises 178 real-world and 120 anime-style prompt-image pairs, ensuring broad coverage of diverse user scenarios. To achieve comprehensive representation, we developed a fine-grained schema for data collection in both categories.
我们构建了 Step-Video-TI2V-Eval,这是一个专为文本驱动的图像到视频生成任务设计的新基准。该数据集包含 178 个现实世界和 120 个动漫风格的提示-图像对,确保广泛覆盖多样化的用户场景。为了实现全面的代表性,我们为这两类数据收集开发了一个细粒度的方案。
Expanding on prior work Huang et al. [2024], which categorized I2V Subject, I2V Background, Action and Camera Motion, we further structured the dataset based on category-specific attributes. For real-world scenes, we incorporated Dynamic and Artistic Elements (e.g., surreal imagery and musical performances). For anime-style scenes, we categorized samples by Anime Style (e.g., 3D anime, children’s illustrations) and Color Aesthetic (e.g., ink-based designs, high saturation). A detailed dataset distribution is provided in Table 1.
在 Huang 等人 [2024] 先前工作的基础上,他们对 I2V 主题、I2V 背景、动作和相机运动进行了分类,我们进一步根据类别特定属性对数据集进行了结构化处理。对于现实世界场景,我们融入了动态和艺术元素(例如,超现实图像和音乐表演)。对于动漫风格场景,我们按动漫风格(例如,3D 动漫、儿童插画)和色彩美学(例如,水墨设计、高饱和度)对样本进行了分类。详细的数据集分布见表 1。
To ensure high-quality prompts, human annotators crafted test descriptions for each image, specifying object motion and camera movements expected in the generated videos.
为确保高质量的提示,人类标注者为每张图像精心制作了测试描述,指定了生成视频中预期的物体运动和相机移动。
3.2 Evaluation Metric
3.2 评估指标
interactions, motion realism, lighting, and collisions. Penalties should be applied if elements such as human bodies, animals, objects, or backgrounds exhibit distortions or unnatural deformations. If neither model is clearly superior, a "Tie" should be assigned.
交互、运动真实感、光照和碰撞。如果人体、动物、物体或背景等元素出现扭曲或不自然的变形,应施加惩罚。如果两个模型都没有明显优势,则应分配“平局”。
4 Experiments
4 实验
4.1 Training Data
4.1 训练数据
We constructed a TI2V dataset comprising 5M text-image-video triples and continued training StepVideo-T2V based on it. This dataset was carefully filtered to ensure balanced motion dynamics, strong visual aesthetics, diverse concepts, accurate captions, and seamless scene continuity. Due to the initial design of the model, over $8\bar{0}%$ of the training data consists of anime-style videos and only anime-style data were used in the first half of the training stage. These factors significantly enhance our model’s performance in this category but simultaneously reduce its effectiveness on real-world-style videos (see results list in Table 2).
我们构建了一个包含500万文本-图像-视频三元组的TI2V数据集,并在此基础上继续训练StepVideo-T2V。该数据集经过精心筛选,以确保运动动态平衡、视觉美感强、概念多样、字幕准确以及场景连续性无缝。由于模型的初始设计,超过80%的训练数据由动漫风格视频组成,并且在训练阶段的前半部分仅使用了动漫风格数据。这些因素显著提升了我们模型在该类别中的表现,但同时也降低了其在现实世界风格视频上的效果(见表2中的结果列表)。
To enable motion control, we extracted motion scores for all videos in the training data and set thresholds to filter out videos with excessively high or low motion. We integrated the motion score into our model, as described in $\S2.3$ . Additionally, we fine-tuned our in-house video captioning model for the TI2V task, focusing primarily on describing the motion dynamics of objects and camera movements. For example, captions like "a flock of birds flying over a tree at sunset, camera pans left" provide detailed motion context to the TI2V task.
为了实现运动控制,我们从训练数据中提取了所有视频的运动分数,并设置了阈值以过滤掉运动过高或过低的视频。我们将运动分数整合到模型中,如 $\S2.3$ 所述。此外,我们针对TI2V任务微调了内部的视频字幕生成模型,主要关注描述物体的运动动态和摄像机运动。例如,像“一群鸟在日落时分飞过一棵树,摄像机向左平移”这样的字幕为TI2V任务提供了详细的运动上下文。
4.2 Evaluation Result
4.2 评估结果
4.2.1 Evaluation on Step-Video-TI2V-Eval
4.2.1 Step-Video-TI2V-Eval 评估
We compare Step-Video-TI2V with two recently released open-source TI2V models, OSTopA and OSTopB, as well as two close-source commercial TI2V models, CSTopC and CSTopD. All these four models originate from China.
我们将 Step-Video-TI2V 与最近发布的两个开源 TI2V 模型 OSTopA 和 OSTopB,以及两个闭源的商业 TI2V 模型 CSTopC 和 CSTopD 进行了比较。这四个模型均源自中国。
Table 2: Comparison with baseline TI2V models using Step-Video-TI2V-Eval. 4 prompts were rejected by CSTopC. 15 prompts were rejected by CSTopD.
表 2: 使用 Step-Video-TI2V-Eval 与基线 TI2V 模型的比较。CSTopC 拒绝了 4 个提示,CSTopD 拒绝了 15 个提示。
| 评估维度 | 领域 | vs.OSTopA | vs.OSTopB | vs.CSTopC | vs.CSTopD |
|---|---|---|---|---|---|
| 指令遵循 | 真实 | 37-63-79 | 101-48-29 | 41-46-73 | 92-51-18 |
| 动漫 | 40-35-44 | 94-16-10 | 52-35-47 | 87-18-17 | |
| 主体与背景一致性 | 真实 | 46-92-39 | 43-71-64 | 45-65-50 | 36-77-47 |
| 动漫 | 42-61-18 | 50-35-35 | 29-62-43 | 37-63-23 | |
| 物理规律遵循 | 真实 | 52-57-49 | 71-40-66 | 58-33-69 | 67-33-60 |
| 动漫 | 75-17-28 | 67-30-24 | 78-17-39 | 68-41-14 | |
| 总分 | 292-325-277 | 426-240-228 | 303-258-321 | 387-283-179 |
Generally, Step-Video-TI2V achieves state-of-the-art performance on the total score across the three evaluation dimensions in Step-Video-TI2V-Eval, either outperforming or matching leading open-source and closed-source commercial TI2V models.
通常,Step-Video-TI2V 在 Step-Video-TI2V-Eval 的三个评估维度的总分上达到了最先进的性能,要么优于要么匹配领先的开源和闭源商业 TI2V 模型。
Specially, Step-Video-TI2V performs worse in the Instruction Adherence dimension compared to OSTopA and CSTopC, primarily due to the limited inclusion of real-world-style training data during the training phase. We will continue refining Step-Video-TI2V with more balanced training data and release an improved checkpoint in the near future. We also observed that Step-Video-TI2V performs exceptionally well on test cases with camera movement requirements, thanks to the strong performance of our video captioning model in accurately describing camera movements in videos.
具体来说,Step-Video-TI2V 在指令遵循维度上表现不如 OSTopA 和 CSTopC,主要原因是训练阶段包含的真实世界风格训练数据有限。我们将继续通过更平衡的训练数据优化 Step-Video-TI2V,并在不久的将来发布改进的检查点。我们还观察到,由于我们的视频字幕模型在准确描述视频中的相机运动方面表现出色,Step-Video-TI2V 在需要相机运动的测试案例中表现尤为突出。
4.2.2 Evaluation on VBench-I2V
4.2.2 VBench-I2V 评估
VBench Huang et al. [2024] is a comprehensive benchmark suite that deconstruct s “video generation quality” into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. We utilize the VBench-I2V benchmark to assess the performance of Step-VideoTI2V alongside other TI2V models.
VBench Huang 等人 [2024] 是一个全面的基准测试套件,它将“视频生成质量”分解为具体的、分层的、解耦的维度,每个维度都有定制的提示和评估方法。我们利用 VBench-I2V 基准来评估 Step-VideoTI2V 以及其他 TI2V 模型的性能。
Evaluation results are listed in Table 3, which show that Step-Video-TI2V achieves state-of-the-art results on VBench-I2V, comparing to the two latest released TI2V models. Due to the simpler motion complexity of VBench-I2V compared to Step-Video-TI2V, the issue of instruction adherence is less pronounced in this benchmark. We also presented two results of Step-Video-TI2V, with the motion set to 5 and 10, respectively. As expected, this mechanism effectively balances the motion dynamics and stability (or consistency) of the generated videos.
评估结果列于表 3 中,结果显示,与最新发布的两个 TI2V 模型相比,Step-Video-TI2V 在 VBench-I2V 上取得了最先进的结果。由于 VBench-I2V 的运动复杂性比 Step-Video-TI2V 简单,指令遵循问题在这个基准测试中不太明显。我们还展示了 Step-Video-TI2V 的两个结果,分别将运动设置为 5 和 10。正如预期的那样,这种机制有效地平衡了生成视频的运动动态和稳定性(或一致性)。
Table 3: Comparison with two open-source TI2V models using VBench-I2V.
表 3: 使用 VBench-I2V 与两个开源 TI2V 模型的比较
| 评分项 | Step-Video-TI2V (motion=10) | Step-Video-TI2V (motion=5) | OSTopA | OSTopB |
|---|---|---|---|---|
| 总分 | 87.98 | 87.80 | 87.49 | 86.77 |
| I2V 评分 | 95.11 | 95.50 | 94.63 | 93.25 |
| 视频-文本相机运动 | 48.15 | 49.22 | 29.58 | 46.45 |
| 视频-图像主体一致性 | 97.44 | 97.85 | 97.73 | 95.88 |
| 视频-图像背景一致性 | 98.45 | 98.63 | 98.83 | 96.47 |
| 质量评分 | 80.86 | 80.11 | 80.36 | 80.28 |
| 主体一致性 | 95.62 | 96.02 | 94.52 | 96.28 |
| 背景一致性 | 96.92 | 97.06 | 96.47 | 97.38 |
| 运动平滑度 | 99.08 | 99.24 | 98.09 | 99.10 |
| 动态程度 | 48.78 | 36.58 | 53.41 | 38.13 |
| 美学质量 | 61.74 | 62.29 | 61.04 | 61.82 |
| 成像质量 | 70.17 | 70.43 | 71.12 | 70.82 |
5 Conclusion
5 结论
In this work, we introduce Step-Video-TI2V, an open-source model that achieves state-of-the-art performance on the TI2V task. Our model offers explicit control over motion dynamics, providing users with enhanced flexibility in video generation. It also excels in the anime-style TI2V task, owing to the specific composition of the training data. To advance the field further, we present a new benchmark dataset tailored for the TI2V task, laying the groundwork for future research and evaluation. We hope our contributions will drive innovation in video generation and empower the broader research and application communities.
在本工作中,我们介绍了 Step-Video-TI2V,这是一个开源模型,在 TI2V 任务中实现了最先进的性能。我们的模型提供了对运动动态的显式控制,为用户在视频生成中提供了更高的灵活性。由于训练数据的特定组成,它在动漫风格的 TI2V 任务中也表现出色。为了进一步推动该领域的发展,我们提出了一个专门为 TI2V 任务设计的新基准数据集,为未来的研究和评估奠定了基础。我们希望我们的贡献能够推动视频生成的创新,并为更广泛的研究和应用社区赋能。
References
参考文献
RunwayML. Gen-3 alpha. https://runwayml.com/research/introducing-gen-3-alpha, 2024.
RunwayML. Gen-3 alpha. https://runwayml.com/research/introducing-gen-3-alpha, 2024.
Kuaishou. Kling. https://klingai.kuaishou.com, 2024.
快手。Kling。https://klingai.kuaishou.com, 2024.
MiniMax. Hailuo. https://hailuoai.com/video, 2024.
MiniMax. Hailuo. https://hailuoai.com/video, 2024.
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Dax Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, and Caesar Zhong. Hun yuan video: A systematic framework for large video generative models, 2025. URL https://arxiv.org/abs/2412.03603.
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Dax Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, 和 Caesar Zhong. 混元视频:大视频生成模型的系统框架, 2025. URL https://arxiv.org/abs/2412.03603.
Wan Team. Wan: Open and advanced large-scale video generative models. 2025.
Wan Team. Wan: 开放且先进的大规模视频生成模型. 2025.
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang,
郭庆马、黄浩阳、严坤、陈良宇、段楠、殷胜明、万昌毅、明冉晨、宋小牛、陈星、周宇、孙德山、周德宇、周健、谭凯军、安康、陈梅、纪伟、吴启玲、孙文、韩鑫、魏亚楠、葛铮、李奥杰、王斌、黄碧珠
Contributors and Acknowledgments
贡献者与致谢
We designate core contributors as those who have been involved in the development of Step-VideoT2V throughout its entire process, while contributors are those who worked on the early versions or contributed part-time. Contributors are listed in alphabetical order by first name.
我们将核心贡献者定义为全程参与 Step-VideoT2V 开发的成员,而贡献者则是指参与早期版本或兼职贡献的成员。贡献者按名字的字母顺序排列。
• Core Contributors:
• 核心贡献者:
