HybridFlow: A Flexible and Efficient RLHF Framework

HybridFlow: 灵活高效的 RLHF 框架

Abstract

摘要

Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs. RLHF complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge into a many-to-many multicast. Traditional RL frameworks execute the dataflow using a single controller to instruct both intra-node computation and inter-node communication, which can be inefficient in RLHF due to large control dispatch overhead for distributed intra-node computation. Existing RLHF systems adopt a multi-controller paradigm, which can be inflexible due to nesting distributed computation and data communication. We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierar- chical APIs that decouple and encapsulate computation and data dependencies in the complex RLHF dataflow, allowing efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various devices. We further design a 3D-Hybrid Engine for efficient actor model resharding between training and generation phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results demonstrate $1.53!\times!!\sim!!20.57!\times$ throughput improvement when running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source code will be available at https://github.com/volcengine/verl

基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 在大语言模型 (Large Language Model, LLM) 对齐中被广泛使用。传统的强化学习可以建模为数据流，其中每个节点表示神经网络 (Neural Network, NN) 的计算，每条边表示神经网络之间的数据依赖关系。RLHF 通过将每个节点扩展为分布式的大语言模型训练或生成程序，并将每条边扩展为多对多的组播，使得数据流复杂化。传统的强化学习框架使用单一控制器来执行数据流，同时指导节点内的计算和节点间的通信，这在 RLHF 中可能效率低下，因为分布式节点内计算的控制调度开销较大。现有的 RLHF 系统采用多控制器范式，由于嵌套了分布式计算和数据通信，可能不够灵活。我们提出了 HybridFlow，它以混合方式结合了单控制器和多控制器范式，从而能够灵活地表示并高效地执行 RLHF 数据流。我们精心设计了一组分层 API，将复杂的 RLHF 数据流中的计算和数据依赖关系解耦并封装，从而实现高效的运算编排以实施 RLHF 算法，并灵活地将计算映射到各种设备上。我们进一步设计了一个 3D-Hybrid 引擎，用于在训练和生成阶段之间高效地进行 Actor 模型重分片，实现零内存冗余并显著减少通信开销。我们的实验结果表明，与最先进的基线相比，使用 HybridFlow 运行各种 RLHF 算法时，吞吐量提升了 $1.53!\times!!\sim!!20.57!\times$。HybridFlow 的源代码将发布在 https://github.com/volcengine/verl。

CCS Concepts: • Computing methodologies $\rightarrow$ Distributed computing methodologies; Machine learning.

CCS 概念：• 计算方法 $\rightarrow$ 分布式计算方法；机器学习。

Keywords: Distributed systems, Reinforcement Learning from Human Feedback

关键词：分布式系统，基于人类反馈的强化学习

ACM Reference Format:

ACM 引用格式:

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. HybridFlow: A Flexible and Efficient RLHF Framework. In Twentieth European Conference on Computer Systems (EuroSys ’25), March 30- April 3, 2025, Rotterdam, Netherlands. ACM, New York, NY, USA, 19 pages. https://doi.org/10.1145/3689031.3696075

HybridFlow：一种灵活高效的 RLHF 框架

1 Introduction

1 引言

Large language models (LLMs) such as GPT [11], Llama [73] and Claude [7] have revolutionized various artificial intelligence (AI) applications, ranging from writing [2], searching [52] to coding [63]. LLMs are first pre-trained on trillions of tokens from books, websites, etc,. via next-word prediction to accumulate broad knowledge [11]. Next, LLMs are trained on domain-specific datasets via supervised fine-tuning (SFT), to be able to follow human instructions [11]. Despite the outstanding capabilities of LLMs on natural language tasks after pre-training and SFT, the detrimental and biased contents in the training datasets may still mislead an LLM to generate toxic and undesirable content. Reinforcement Learning from Human Feedback (RLHF) is introduced to further align an LLM to human values, for building helpful and harmless AI applications [7, 55].

大语言模型 (LLMs) 如 GPT [11]、Llama [73] 和 Claude [7] 已经彻底改变了各种人工智能 (AI) 应用，从写作 [2]、搜索 [52] 到编码 [63]。LLMs 首先通过预测下一个词在书籍、网站等数万亿 Token 上进行预训练，以积累广泛的知识 [11]。接着，LLMs 通过监督微调 (SFT) 在特定领域的数据集上进行训练，以便能够遵循人类指令 [11]。尽管 LLMs 在预训练和 SFT 后在自然语言任务上表现出色，但训练数据集中的有害和偏见内容仍可能误导 LLMs 生成有毒和不良内容。人类反馈强化学习 (RLHF) 被引入以进一步将 LLMs 与人类价值观对齐，从而构建有用且无害的 AI 应用 [7, 55]。

RLHF is built upon traditional RL algorithms [4, 68, 78], e.g., Proximal Policy Optimization (PPO) [68] and REINFORCE [78]. The widely adopted PPO-based RLHF system typically consists of four LLMs [7, 55]: an actor, a critic, a reference policy network and a reward model. PPO-based RLHF proceeds in iterations, each with three stages: (1) response generation using the actor model with a batch of prompts;

RLHF基于传统的强化学习算法 [4, 68, 78]，例如近端策略优化（PPO） [68] 和REINFORCE [78]。广泛采用的基于PPO的RLHF系统通常由四个大语言模型组成 [7, 55]：一个actor（执行者）、一个critic（评价者）、一个参考策略网络和一个奖励模型。基于PPO的RLHF以迭代方式进行，每个迭代包含三个阶段：(1) 使用actor模型生成一批提示的响应；

(2) preparation of training data by scoring the generated responses through a single forward pass of the critic, reference policy, and reward models; (3) learning from human preference by updating actor and critic through forward and backward computation. Other RLHF variants [19, 43] follow similar stages but involves different numbers of models and data dependencies among the models.

(2) 通过批评模型、参考策略和奖励模型的单次前向传递对生成的响应进行评分，从而准备训练数据；(3) 通过前向和后向计算更新演员和批评模型，从人类偏好中学习。其他 RLHF 变体 [19, 43] 遵循类似的阶段，但涉及不同数量的模型以及模型之间的数据依赖关系。

Traditional RL can be modeled as a dataflow [46], which is a directed acyclic graph (DAG): each node in the RL dataflow represents computation of a neural network (e.g., actor or critic network which can be CNN or MLP); each edge denotes data dependency between NN computations (e.g., output of the critic is used as input to actor training [68].) RLHF dataflow is more complex, with more complicated models involved (e.g., LLMs for the actor/critic/reference/reward models), each running distinct computation, and more diverse data dependencies among them (i.e., multicast between distributed model partitions). Training and generation of an LLM in the RLHF dataflow requires distributed computation (e.g., using tensor/pipeline/data parallelism) [40, 71]. Therefore, each node in the RLHF dataflow is a complex distributed program, corresponding to distributed computation of the respective LLM. Models in different nodes typically use different parallelism strategies as their workloads vary. The edge represents data resharding, which is often a many-tomany multicast. Consequently, Flexible representation and efficient execution of the complex and resource intensive RLHF is imperative.

传统的强化学习（RL）可以被建模为一种数据流 [46]，这是一种有向无环图（DAG）：RL 数据流中的每个节点表示神经网络的计算（例如，演员或评论家网络，可以是 CNN 或 MLP）；每条边表示神经网络计算之间的数据依赖关系（例如，评论家的输出被用作演员训练的输入 [68]。）RLHF（基于人类反馈的强化学习）数据流更为复杂，涉及更复杂的模型（例如，用于演员/评论家/参考/奖励模型的大语言模型），每个模型运行不同的计算，并且它们之间存在更多样化的数据依赖关系（即，分布式模型分区之间的多播）。在 RLHF 数据流中训练和生成大语言模型需要分布式计算（例如，使用张量/管道/数据并行）[40, 71]。因此，RLHF 数据流中的每个节点都是一个复杂的分布式程序，对应于相应大语言模型的分布式计算。不同节点中的模型通常使用不同的并行策略，因为它们的工作负载不同。边表示数据重分片，通常是一种多对多的多播。因此，对复杂且资源密集的 RLHF 进行灵活的表示和高效执行是必要的。

Traditional RL frameworks such as RLLib [45] and RLLib Flow [46] utilize a hierarchical single-controller paradigm to run RL dataflows. A centralized controller assigns nodes in the dataflow to different processes and coordinates their execution order. Each node process can further spawn more workers to perform computation, again following the singlecontroller paradigm. However, they only provide primitives for data-parallel training and are constrained to neural networks that are at most hundreds of MB in size [45, 46]. In the RLHF dataflow, each node corresponds to an LLM with up to billions of operators, computed using some complex parallelism. A single-controller paradigm is inefficient due to the substantial overhead of dispatching operators to distributed accelerators [1, 9].

传统RL框架（如RLLib [45]和RLLib Flow [46]）采用分层单控制器范式来运行RL数据流。集中控制器将数据流中的节点分配给不同的进程，并协调它们的执行顺序。每个节点进程可以进一步生成更多工作器来执行计算，同样遵循单控制器范式。然而，它们仅提供数据并行训练的基元，并且仅限于大小最多为数百MB的神经网络 [45, 46]。在RLHF数据流中，每个节点对应一个具有数十亿操作符的大语言模型，使用某种复杂的并行性进行计算。由于将操作符分配到分布式加速器的开销较大，单控制器范式效率低下 [1, 9]。

Existing RLHF systems adopt a multi-controller paradigm to manage intra-node computation and inter-node data resharding [17, 30, 80]. Each controller independently manages the computation of one device and uses multiple point-topoint operations to coordinate data dependencies between different nodes. This multi-controller paradigm introduces negligible dispatch overhead when performing LLM computation (detailed in $\S2.2)$ ).

现有的 RLHF 系统采用多控制器范式来管理节点内计算和节点间数据重分区 [17, 30, 80]。每个控制器独立管理一个设备的计算，并使用多个点对点操作来协调不同节点之间的数据依赖关系。这种多控制器范式在执行大语言模型计算时引入了可忽略的调度开销（详见 $\S2.2$）。

However, without central control, it is inflexible to implement various RLHF dataflow, as modifying a single node to adapt to different data dependencies requires changing all dependent nodes’ implementation, hindering code reuse.

然而，在没有中央控制的情况下，实现各种RLHF数据流缺乏灵活性，因为修改单个节点以适应不同的数据依赖关系需要更改所有依赖节点的实现，这阻碍了代码的复用。

To address these limitations, we propose HybridFlow, a flexible and efficient RLHF framework to easily represent and execute diverse RLHF dataflows, attaining high throughput. Our key observation is that utilizing the single-controller paradigm on the inter-node level enables flexible expression of various data dependencies and easy coordination of inter-node data resharding with minimal overhead, while integrating the multi-controller paradigm within intra-node computation enhances computation efficiency substantially. We advocate a hierarchical hybrid programming model to generate RLHF dataflows. At the node level, multiple model classes are provided that encapsulate distributed computation (training, inference and generation) of different LLMs in the dataflow into primitive APIs. These APIs can seamlessly support various parallelism strategies from the existing LLM frameworks, including 3D parallelism [71], ZeRO [59], and PyTorch FSDP [57]), and perform distributed computation under the multi-controller paradigm. Among the nodes, a set of transfer protocols are designed to hide the complexity of data resharding from users, as coordinated by a single controller. This programming model abstracts away the complexity of distributed computing, allowing users to implement an RLHF dataflow in a few lines of code and run RLHF through a single process of the single controller. It also effectively decouples intra-node computation and inter-node data transfer, allowing independent optimization of each model without changing the code of other models in the dataflow.

为了解决这些限制，我们提出了 HybridFlow，一个灵活且高效的 RLHF 框架，能够轻松表示和执行多样化的 RLHF 数据流，实现高吞吐量。我们的关键观察是，在节点间层面利用单控制器范式能够灵活表达各种数据依赖关系，并轻松协调节点间数据重分片，同时最小化开销，而在节点内计算中集成多控制器范式则显著提升了计算效率。我们提倡采用分层混合编程模型来生成 RLHF 数据流。在节点层面，提供了多个模型类，将数据流中不同大语言模型的分布式计算（训练、推理和生成）封装为原始 API。这些 API 可以无缝支持现有大语言框架中的各种并行策略，包括 3D 并行 [71]、ZeRO [59] 和 PyTorch FSDP [57]，并在多控制器范式下执行分布式计算。在节点间，设计了一组传输协议，由单控制器协调，隐藏了用户数据重分片的复杂性。该编程模型抽象了分布式计算的复杂性，允许用户通过几行代码实现 RLHF 数据流，并通过单控制器的单个进程运行 RLHF。它还有效地解耦了节点内计算和节点间数据传输，允许独立优化每个模型，而无需更改数据流中其他模型的代码。

Training and generation of the actor model represent major computation in the RLHF dataflow. We further design a 3D-Hybrid Engine to enable efficient execution of training and generation of the actor model, introducing zero memory redundancy and significantly reduced communication overhead during model parameter resharding between the training and generation stages. Our hybrid programming model also facilitates flexible placement of models onto the same or different sets of GPU devices. This allows us to provide an effective algorithm to optimize GPU allocation and placement of the models, with various model sizes and distinct workloads, for any RLHF dataflow. Our contributions in designing HybridFlow are summarized as follows:

演员模型的训练和生成是RLHF数据流中的主要计算任务。我们进一步设计了一个3D混合引擎，以实现演员模型训练和生成的高效执行，在训练和生成阶段之间的模型参数重分片过程中引入零内存冗余，并显著减少通信开销。我们的混合编程模型还促进了模型在同一组或不同组GPU设备上的灵活放置。这使我们能够提供一种有效的算法，针对任何RLHF数据流，优化GPU分配和模型放置，适用于各种模型大小和不同的工作负载。我们在设计HybridFlow中的贡献总结如下：

• We propose a hierarchical hybrid programming model for conveniently building the RLHF dataflow. This programming model enables efficient distributed execution of intra-node computation and flexible inter-node data resharding and transfer, for various RLHF algorithms (§4). • We design a 3D-Hybrid Engine that executes training and generation of the actor model with high computation efficiency and zero-redundancy transition between the training stage and the generation stage (§5). • We devise an effective mapping algorithm to automatically identify optimized GPU allocation and placement of each node (model) in the RLHF dataflow (§6). • We conduct extensive experiments comparing HybridFlow with state-of-the-art RLHF systems [17, 30, 82] under various

• 我们提出了一种分层混合编程模型，用于方便地构建 RLHF 数据流。该编程模型能够高效地执行节点内计算的分布式执行，并灵活地进行节点间数据的重分区和传输，适用于各种 RLHF 算法（§4）。
• 我们设计了一个 3D 混合引擎，以高计算效率执行演员模型的训练和生成，并在训练阶段和生成阶段之间实现零冗余切换（§5）。
• 我们开发了一种有效的映射算法，以自动识别 RLHF 数据流中每个节点（模型）的优化 GPU 分配和放置（§6）。
• 我们进行了广泛的实验，将 HybridFlow 与最先进的 RLHF 系统 [17, 30, 82] 在各种环境下进行比较。

Figure 1. Dataflow graph of 3 RLHF algorithms [19, 43, 55]. Stage $\textcircled{1}$ , $\circledcirc$ , $\circled{3}$ represent Generation, Preparation, and Training, respectively.

图 1: 3 种 RLHF (Reinforcement Learning from Human Feedback) 算法的数据流图 [19, 43, 55]。阶段 $\textcircled{1}$ 、 $\circledcirc$ 、 $\circled{3}$ 分别表示生成、准备和训练。

RLHF algorithms, model sizes and cluster scales. Our evaluation demonstrates $1.53!\times!!\sim!!20.57!\times$ throughput improvements.

RLHF 算法、模型规模和集群规模。我们的评估展示了 $1.53!\times!!\sim!!20.57!\times$ 的吞吐量提升。

We have open-sourced HybridFlow and believe that HybridFlow can boost future RLHF research and development.

我们已经开源了 HybridFlow，并相信 HybridFlow 能够推动未来 RLHF（人类反馈强化学习）的研究和开发。

2 Background and Motivation

2 背景与动机

2.1 Reinforcement Learning from Human Feedback RLHF Workflow. RLHF aligns the linguistic space of LLMs with human values, using a set of human-ranked candidates of given prompts [7, 19, 41, 43, 55, 70, 91]. An RLHF system typically consists of multiple models, e.g., an actor, a critic, a reference policy, and one or multiple reward models. The actor and the reference are each pre-trained/fined-tuned LLM (i.e., the LLM that is undergoing RLHF). The critic and reward models can be different LLMs fine-tuned on the human preference dataset, with the language modeling head replaced by a scalar output head [7, 55]. The RLHF workflow can be decomposed into 3 stages (Figure 1) and we take PPO as an example:

2.1 基于人类反馈的强化学习 (RLHF) 工作流程。RLHF 通过使用一组由人类排名的给定提示的候选方案，将大语言模型的语言空间与人类价值观对齐 [7, 19, 41, 43, 55, 70, 91]。一个 RLHF 系统通常由多个模型组成，例如一个行动者、一个评论者、一个参考策略以及一个或多个奖励模型。行动者和参考模型都是经过预训练/微调的大语言模型（即正在进行 RLHF 的大语言模型）。评论者和奖励模型可以是基于人类偏好数据集微调的不同大语言模型，其语言建模头被替换为标量输出头 [7, 55]。RLHF 工作流程可以分为 3 个阶段（图 1），我们以 PPO 为例：

•Stage 1 (Generation): The actor produces responses from a batch of prompts using auto-regressive generation.

阶段1（生成）：AI智能体通过自回归生成从一批提示中生成响应。

•Stage 2 (Preparation): Using prompts and generated responses, the critic computes their values [66, 68], the reference policy computes their reference log probabilities, and the reward model computes their rewards [7, 55], all via a single pass of forward computation of the respective model.

• 阶段 2 (准备): 使用提示和生成的响应，批评者计算它们的值 [66, 68]，参考策略计算它们的参考对数概率，奖励模型计算它们的奖励 [7, 55]，所有这些都通过各自模型的一次前向计算完成。

•Stage 3 (Learning/Training): The actor and the critic are updated via Adam [38], using the batch of data produced by previous stages and the loss function [55].

• 阶段 3 (学习/训练)：使用 Adam [38] 和之前阶段生成的数据批次以及损失函数 [55]，更新 actor 和 critic。

Other RLHF algorithms largely follow the 3-stage workflow as well (Figure 1(b)(c)). Safe-RLHF [19] introduces an auxiliary pretrain loss following PPO-ptx [55] and includes an additional cost model to fit human preferences and safety labels simultaneously. ReMax [43] requires an additional generation pass for variance reduction and eliminates the critic model in the dataflow. Researchers are actively exploring novel RLHF algorithms [41, 70, 91] and integrating traditional RL methods into RLHF domains [37]. These variances necessitate a flexible representation of the RLHF dataflow graph to accommodate diverse algorithmic requirements.

其他 RLHF 算法也基本遵循三阶段工作流程（图 1(b)(c)）。Safe-RLHF [19] 在 PPO-ptx [55] 的基础上引入了辅助预训练损失，并包含一个额外的成本模型，以同时拟合人类偏好和安全标签。ReMax [43] 需要额外的生成过程以减少方差，并在数据流中消除了评论家模型。研究人员正在积极探索新的 RLHF 算法 [41, 70, 91]，并将传统强化学习方法整合到 RLHF 领域 [37]。这些变化要求 RLHF 数据流图具有灵活的表示能力，以适应不同的算法需求。

Figure 2. Programming model used in RLHF systems. (a) Existing RLHF systems adopt the multi-controller paradigm. (b) HybridFlow utilizes a hybrid programming model: the single-controller coordinates models; each model uses multicontroller paradigm in distributed computation. Inactive node in grey represents operation not executed at this time.

图 2: RLHF 系统中使用的编程模型。(a) 现有的 RLHF 系统采用多控制器范式。(b) HybridFlow 使用混合编程模型：单控制器协调模型；每个模型在分布式计算中使用多控制器范式。灰色节点表示此时未执行的操作。

Parallelism Strategies. LLMs are trained and served with data, pipeline, and tensor parallelism [36, 40, 54]. With data parallelism (DP), the input data is split into multiple subsets; each subset is processed by a separate device (e.g., a GPU) [69]. ZeRO [59] is a memory-optimized solution for DP training, progressively sharding optimizer states, gradients, and model parameters across GPUs. Pipeline parallelism (PP) [32, 53] and tensor parallelism (TP) [71] distribute model parameters, gradients and optimizer states across multiple GPUs. Modern distributed training frameworks like Megatron-LM [71] and MegaScale [36] utilize 3D parallelism or PTD parallelism [54], where P, T, D stand for PP, TP, DP, respectively. In 3D parallelism, PP size represents the number of pipeline stages in model training, TP size refers to the number of shards that a tensor is partitioned into, and DP size is the number of model replicas. LLM serving systems employ 3D parallelism similar to training while only model parameters and KVCache are sharded [16, 29, 40].

并行策略。大语言模型通过数据、管道和张量并行进行训练和部署 [36, 40, 54]。在数据并行 (DP) 中，输入数据被分割成多个子集；每个子集由独立的设备（例如 GPU）处理 [69]。ZeRO [59] 是一种针对 DP 训练的内存优化解决方案，逐步在 GPU 之间分片优化器状态、梯度和模型参数。管道并行 (PP) [32, 53] 和张量并行 (TP) [71] 将模型参数、梯度和优化器状态分布在多个 GPU 上。现代分布式训练框架如 Megatron-LM [71] 和 MegaScale [36] 采用 3D 并行或 PTD 并行 [54]，其中 P、T、D 分别代表 PP、TP、DP。在 3D 并行中，PP 大小表示模型训练中的管道阶段数量，TP 大小指张量被分片成的数量，DP 大小是模型副本的数量。大语言模型部署系统采用与训练类似的 3D 并行，但仅分片模型参数和 KVCache [16, 29, 40]。

LLM models in the RLHF dataflow may perform distinct computations, including training (one forward pass, one backward pass and model update), inference (one forward pass) and generation (auto-regressive generation with multiple forward passes). In particular, training and generation are performed on the actor model, training and inference on the critic, and inference on reference policy and reward models. Distinct parallel strategies can be applied to different models for varied computations to achieve optimal throughput.

RLHF 数据流中的大语言模型可能执行不同的计算，包括训练（一次前向传播、一次反向传播和模型更新）、推理（一次前向传播）和生成（多次前向传播的自回归生成）。具体而言，训练和生成在 actor 模型上执行，训练和推理在 critic 模型上执行，推理在参考策略和奖励模型上执行。可以为不同模型应用不同的并行策略以实现最佳吞吐量。

2.2 Programming Model for Distributed ML

2.2 分布式机器学习的编程模型

Single-Controller. It employs a centralized controller to manage the overall execution flow of the distributed program. With centralized control logic, users can build core functionalities of the dataflow as a single process (Figure 2(b)), while the controller automatically generates distributed workers to carry out the computation. With a global view of the hardware and dataflow graph, the single-controller paradigm allows flexible and optimized resource mapping and execution order coordination among dataflow tasks. However, coordination messages are passed from the controller to all workers, incurring significant dispatch overhead when executing expansive dataflow graphs on large clusters [1, 9].

单控制器。它采用集中式控制器来管理分布式程序的整体执行流程。通过集中控制逻辑，用户可以将数据流的核心功能构建为单个进程（图 2(b)），而控制器会自动生成分布式工作节点来执行计算。凭借对硬件和数据流图的全局视图，单控制器范式允许在数据流任务之间进行灵活且优化的资源映射和执行顺序协调。然而，协调消息从控制器传递给所有工作节点，在大规模集群上执行扩展性数据流图时会产生显著的调度开销 [1, 9]。

Multi-Controller. Each device (aka worker) has its own controller. State-of-the-art distributed LLM training and serving systems adopt the multi-controller paradigm, due to its scalability and low dispatch overhead (control messaging largely passed from CPU to GPU over fast PCIe links) [36, 40, 60, 71]. As shown in the example that employs multi-controller RLHF implementation in Figure 2(a), a separate program is run for each model, and all workers of one model execute the same program. Each worker only possesses a local view of the system state and requires point-to-point communication between two models (blue code and arrows) to coordinate model execution order. To implement an RLHF workflow in the multi-controller architecture, a user must intricately integrate the code for collective communication, computation, and point-to-point data transfer in the program run at each device. This leads to deeply nested code of computation and data transfer, challenging to develop, maintain, and optimize. In Figure 2(a), each model performs local computation and all_gather operations (black code), while the actor model must explicitly manage send operations to the critic and reward models, and the latter must correspondingly implement receive operations at precise points in their program

多控制器。每个设备（即工作节点）都有自己的控制器。最先进的分布式大语言模型训练和服务系统采用多控制器范式，因为其具有可扩展性和低调度开销（控制消息主要通过快速的PCIe链路从CPU传递到GPU） [36, 40, 60, 71]。如图 2(a) 所示，采用多控制器RLHF实现的示例中，每个模型都运行一个独立的程序，且一个模型的所有工作节点执行相同的程序。每个工作节点仅拥有系统状态的局部视图，并且需要在两个模型之间进行点对点通信（蓝色代码和箭头）以协调模型执行顺序。要在多控制器架构中实现RLHF工作流，用户必须将集体通信、计算和点对点数据传输的代码巧妙地集成到每个设备运行的程序中。这导致计算和数据传输的代码深度嵌套，开发、维护和优化都极具挑战性。在图 2(a) 中，每个模型执行本地计算和all_gather操作（黑色代码），而actor模型必须显式管理向critic和reward模型的发送操作，后者则必须在程序的精确点相应地实现接收操作。

2.3 RLHF Characteristics

2.3 RLHF 特性

Heterogeneous model workloads. The actor, critic, reference and reward models in RLHF may execute training, inference or generation at different stages, with different memory footprint and computation demand. For reference policy and reward models, only their model parameters need to be stored in GPU memory, as they perform only the forward pass computation. For the actor and the critic, their model parameters, gradients, and optimizer states must be stored as they undergo model training. Moreover, a small actor model (e.g., a 7B pre-trained/fine-tuned LLM) can be paired with larger critic and reward models (e.g., 70B LLMs) in RLHF for better alignment [7]. Given such heterogeneity, different parallelism strategies and tailored optimization s are needed for running each model during RLHF.

异构模型工作负载。在 RLHF 中，actor、critic、reference 和 reward 模型可能在不同的阶段执行训练、推理或生成，具有不同的内存占用和计算需求。对于 reference policy 和 reward 模型，只需要将它们的模型参数存储在 GPU 内存中，因为它们仅执行前向传递计算。对于 actor 和 critic，它们的模型参数、梯度和优化器状态必须存储，因为它们会进行模型训练。此外，在 RLHF 中，可以将一个较小的 actor 模型（例如，7B 预训练/微调的大语言模型）与较大的 critic 和 reward 模型（例如，70B 大语言模型）配对，以实现更好的对齐 [7]。考虑到这种异构性，在 RLHF 期间运行每个模型需要不同的并行策略和定制优化。

Unbalanced computation between actor training and generation. In the RLHF dataflow, training and generation of the actor model are represented by two nodes (Figure 1), which often render majority of the workload in each RLHF iteration (e.g., $58.9%$ of total RLHF time with HybridFlow). Actor training is computation bound [24], often requiring a larger model-parallel (MP) size (i.e., the number of partitions the model is partitioned into) and distributing the workload to more GPUs, e.g., 8 partitions of a 7B model on 8 GPUs. Using the same parallelism strategy (e.g., the same MP size) for generation can lead to under utilization of GPU computation resources due to its memory-bound nature [40]. Previous studies show that combining a larger DP size with a smaller MP size (hybrid data and model parallelism), e.g., partition a 7B model into two and replicate it four times on 8 GPUs, can improve the generation throughput [44, 92]. Although using different parallelism strategies for actor training and generation may optimize throughput in both stages, resharding the actor model weights at runtime between the two stages can incur significant communication and memory overhead. For example, aligning a 70B actor model requires transferring 140GB of model weights from training to generation per RLHF iteration, taking up to $36.4%$ of an iteration time when the two stages are on different devices [30].

演员训练与生成之间的计算不平衡。在RLHF数据流中，演员模型的训练和生成由两个节点表示（图1），这两个节点通常在每次RLHF迭代中承担大部分工作负载（例如，在HybridFlow中占RLHF总时间的58.9%）。演员训练是计算密集型的[24]，通常需要更大的模型并行（MP）规模（即模型被划分成的分区数量），并将工作负载分配到更多的GPU上，例如，在8个GPU上将7B模型划分为8个分区。由于生成是内存密集型的[40]，使用相同的并行策略（例如，相同的MP规模）可能导致GPU计算资源的利用不足。先前的研究表明，结合更大的数据并行（DP）规模和更小的MP规模（混合数据与模型并行），例如将7B模型划分为两个分区并在8个GPU上复制四次，可以提高生成吞吐量[44, 92]。尽管为演员训练和生成使用不同的并行策略可能优化两个阶段的吞吐量，但在运行时间在这两个阶段之间重新划分演员模型权重可能会导致显著的通信和内存开销。例如，对齐一个70B的演员模型需要在每次RLHF迭代时从训练阶段传输140GB的模型权重到生成阶段，当两个阶段位于不同设备上时，这可能会占用高达36.4%的迭代时间[30]。

Figure 3. Dataflow execution given a model placement plan. Blocks with numbers represent GPUs. In dashed boxes, the models are placed on different sets of devices and can be concurrently computed. Reference model (blue) and reward model (green) are colocated on the same set of GPUs and executed sequentially.

图 3: 给定模型放置计划的数据流执行。带数字的方块代表 GPU。在虚线框中，模型被放置在不同的设备集上，可以并行计算。参考模型（蓝色）和奖励模型（绿色）被放置在相同的 GPU 集上，并顺序执行。

Diverse model placement requirements. Strategic device placement of models in the RLHF dataflow is necessary, according to computation workloads and data dependencies of the models. Figure 3 gives an example model placement plan and the corresponding RLHF execution flow. Models placed on different sets of devices can be executed in parallel if no data dependencies exist. Models placed on the same set of GPUs, referred to as colocated models, share the GPU memory and are executed sequentially in a time-sharing manner, as out-of-memory (OOM) error may easily happen if colocated LLMs execute concurrently.

多样化的模型部署需求。根据模型的计算工作负载和数据依赖关系，在RLHF数据流中进行策略性的设备部署是必要的。图3展示了一个模型部署计划示例以及相应的RLHF执行流程。如果不存在数据依赖关系，部署在不同设备集合上的模型可以并行执行。部署在同一组GPU上的模型，称为共置模型，共享GPU内存并以分时方式顺序执行，因为如果共置的大语言模型并发执行，很容易发生内存不足（OOM）错误。

We observe a compromise: placing models on different devices permits parallel processing but may inevitably lead to some GPU idle time, given staged model execution in RLHF. In Figure 3, actor and critic are placed separately, performing training in parallel, but incurring 1/3 of their GPU time being idle, during other RLHF stages. Supporting various

我们观察到一种折衷方案：将模型放置在不同的设备上可以实现并行处理，但由于 RLHF 中的分阶段模型执行，不可避免地会导致一些 GPU 空闲时间。在图 3 中，actor 和 critic 被分开放置，并行执行训练，但在其他 RLHF 阶段中，它们的 GPU 时间有 1/3 处于空闲状态。支持多种

Table 1. Comparison of RLHF frameworks. Figures illustrate execution of one PPO iteration. Numbers 1-6 represent response generation, reward model inference, reference model inference, critic inference, actor training, and critic training, respectively.

placement strategies and maximizing device utilization are crucial for optimizing RLHF performance at any model size and cluster scale.

表 1: RLHF 框架对比。图中展示了一次 PPO 迭代的执行过程。数字 1-6 分别代表响应生成、奖励模型推理、参考模型推理、评论家推理、演员训练和评论家训练。

RLHF 系统	DeepSpeed-Chat	OpenRLHF	NeMo-Aligner	HybridFlow
并行性	训练：ZeRO 生成：TP	训练：ZeRO 生成：TP	训练和生成均采用 3D 并行	训练：3D, ZeRO, FSDP 生成：3D 并行
演员权重	模型重新分片	使用两个演员副本	使用相同的模型分区	零冗余
训练和生成中的模型	从 ZeRO 到 TP	两阶段的权重	两阶段（共享权重）	模型重新分片
放置	将所有模型放在同一组设备上	每个模型放在不同的设备上	演员/参考模型放在部分 GPU 上，评论家/奖励模型放在其他 GPU 上	支持多种模型放置
执行模式			5	5
演员 → GPU 进程			口 4 6
			口
			2	3
			3	口
评论家奖励模型
			口
参考策略				口

在任何模型规模和集群规模下，放置策略和最大化设备利用率对于优化 RLHF 性能至关重要。

2.4 Limitations of existing RLHF systems Inflexible support for various RLHF dataflow grap

2.4 现有RLHF系统的局限性对各种RLHF数据流图的支持不够灵活

Existing RLHF systems adopt the multi-controller paradigm for dataflow implementation [17, 30, 80, 82]. To implement various RLHF algorithms, a user must navigate and manage code that mixes collective communication, model computation (potentially using various distributed training/serving frameworks), and point-to-point data transfer. This code structure lacks modularity/function encapsulation, making the RLHF systems tightly coupled with specific LLM training and serving frameworks. Consequently, a user needs to implement and optimize different RLHF dataflows case-bycase [46], hindering code reuse and increasing the risk of making mistakes. Existing RLHF frameworks only support the PPO algorithm. In addition, limited parallel strategies are supported due to implementation complexity. For example, to incorporate 3D parallelism for LLM training and generation in DeepSpeed-Chat [82], one may have to re-implement the whole system due to the mixed code structure.

现有的RLHF系统采用多控制器范式来实现数据流[17, 30, 80, 82]。为了实现各种RLHF算法，用户必须处理和管理混合了集体通信、模型计算（可能使用各种分布式训练/服务框架）以及点对点数据传输的代码。这种代码结构缺乏模块化/功能封装，使得RLHF系统与特定的大语言模型训练和服务框架紧密耦合。因此，用户需要逐个案例实现和优化不同的RLHF数据流[46]，这阻碍了代码重用并增加了出错的风险。现有的RLHF框架仅支持PPO算法。此外，由于实现复杂性，支持的并行策略有限。例如，为了在DeepSpeed-Chat[82]中集成3D并行用于大语言模型训练和生成，由于混合的代码结构，可能需要重新实现整个系统。

Inefficient RLHF execution. Table 1 summarizes parallelism strategies, model placement, and execution patterns adopted by the existing RLHF systems. DeepSpeed-Chat [82] and OpenRLHF [30] adopt ZeRO-3 for actor training and TP for actor generation. OpenRLHF uses different copies of the actor model on different devices for training and generation, incurring redundant memory usage and frequent weight synchron iz ation among devices. DeepSpeed-Chat maintains the same copy of actor model on the same set of devices for training and generation, and reshards model weights between training and generation (due to different parallelisms used in the two stages), which may still incur substantial memory and communication overhead for large models (detailed in $\S5.4)$ . NeMo-Aligner [17] uses the same 3D parallelism config u rations in actor training and generation, experiencing low generation throughput (§8.4).

低效的 RLHF 执行。表 1 总结了现有 RLHF 系统采用的并行策略、模型放置和执行模式。DeepSpeed-Chat [82] 和 OpenRLHF [30] 采用 ZeRO-3 进行 actor 训练，采用 TP 进行 actor 生成。OpenRLHF 在不同设备上使用不同的 actor 模型副本进行训练和生成，导致冗余的内存使用和设备之间频繁的权重同步。DeepSpeed-Chat 在同一组设备上保持相同的 actor 模型副本进行训练和生成，并在训练和生成之间重新分片模型权重（由于两个阶段使用了不同的并行策略），对于大型模型可能仍然会产生大量的内存和通信开销（详见 $\S5.4)$。NeMo-Aligner [17] 在 actor 训练和生成中使用相同的 3D 并行配置，导致生成吞吐量较低（§8.4）。

Existing RLHF frameworks are limited to one model placement plan and hence one RLHF execution pattern, as shown in Table 1. Implementing a different placement is difficult, requiring changing the inner logic of model initialization and inter-node data transfer as highlighted in blue in Figure 2. OpenRLHF and NeMo-Aligner allow concurrent model computation in the preparation and learning stages; in the generation stage, models except the actor are idle, wasting the GPUs they occupy. DeepSpeed-Chat colocates all models on the same set of devices, and each device runs each model sequentially according to the RLHF dataflow. With unbalanced workloads among the models, such a placement can be inefficient in resource utilization (evaluated in $\S8.3\$ ).

现有的RLHF框架仅限于一种模型放置计划，因此只有一种RLHF执行模式，如表1所示。实现不同的模型放置方案很困难，需要改变模型初始化和节点间数据传输的内部逻辑，如图2中蓝色部分所示。OpenRLHF和NeMo-Aligner允许在准备和学习阶段进行并行的模型计算；在生成阶段，除了actor模型外，其他模型都处于空闲状态，浪费了它们占用的GPU资源。DeepSpeed-Chat将所有模型放在同一组设备上，每个设备根据RLHF数据流依次运行每个模型。由于模型之间的工作负载不均衡，这种放置方式在资源利用上可能效率低下（在$\S8.3\$中进行了评估）。

2.5 Design Considerations

2.5 设计考虑

To tackle limitations of existing systems, the key question is - How to design a flexible and efficient programming model to implement RLHF dataflow? A single-controller design is particularly advantageous at the inter-node level due to its flexibility in coordinating data transfer, execution order, and resource virtualization among distributed computation of different models [9, 50]. The RLHF dataflow graph typically consists of only a few nodes. Dispatching control messages to different nodes from the single-controller incurs negligible overhead as compared to distributed computation required for nodes (models) in the dataflow. The multi-controller paradigm, known for its low latency in dispatching operators to accelerators [20], can be leveraged in distributed computation of each model. With these insights, we propose a hierarchical hybrid programming model for RLHF dataflow implementation. Our key design principle is to combine single-controller and multi-controller paradigms in a hybrid manner. This design ensures flexible expression and efficient execution of RLHF dataflow, maintaining low control overhead at both inter-node and intra-node levels. As shown in Figure 2(b), this paradigm decouples intra-node distributed computation and inter-node data transfer, allowing each model to focus solely on local computation without managing inter-node communication.

为了应对现有系统的局限性，关键问题在于如何设计一个灵活高效的编程模型来实现RLHF数据流？单控制器设计在节点间级别具有显著优势，因为它在协调数据传输、执行顺序和资源虚拟化方面具有灵活性，特别是在不同模型的分布式计算中 [9, 50]。RLHF数据流图通常仅由少数几个节点组成。与数据流中节点（模型）所需的分布式计算相比，从单控制器向不同节点分发控制消息的开销可以忽略不计。多控制器范式以其在将操作分发到加速器时的低延迟而闻名 [20]，可以在每个模型的分布式计算中加以利用。基于这些见解，我们提出了一种分层的混合编程模型来实现RLHF数据流。我们的关键设计原则是以混合方式结合单控制器和多控制器范式。这种设计确保了RLHF数据流的灵活表达和高效执行，同时在节点间和节点内保持低控制开销。如图2(b)所示，这种范式解耦了节点内的分布式计算和节点间的数据传输，使每个模型能够专注于本地计算，而无需管理节点间通信。

3 HybridFlow Overview

3 HybridFlow 概述

Figure 4 depicts the architecture of HybridFlow, which consists of three major components: Hybrid Programming Model,

图 4: HybridFlow 的架构，由三个主要组件组成：混合编程模型

3D-Hybrid Engine and Auto-Mapping algorithm. The hybrid programming model includes a set of hierarchical APIs to enable flexible expression of the RLHF dataflow and efficient computation of models in the dataflow (§4). The 3DHybrid Engine is particularly designed for efficient training and generation of the actor model, allowing different 3D parallel configurations in the two stages and enabling zero memory redundancy and minimized communication overhead during the transition between two stages (§5). The automapping algorithm determines optimized device placement of each model to maximize the throughput of RLHF (§6).

3D混合引擎与自动映射算法。混合编程模型包含一组分层API，用于灵活表达RLHF数据流并高效计算数据流中的模型（§4）。3D混合引擎专为演员模型的高效训练和生成设计，允许在两个阶段中使用不同的3D并行配置，并在阶段转换期间实现零内存冗余和最小化通信开销（§5）。自动映射算法确定每个模型的优化设备放置，以最大化RLHF的吞吐量（§6）。

The workflow of our RLHF system goes as follows. A user provides the following inputs to start the RLHF system: (i) model specifications, including the architecture and size of the actor/critic/reference policy/reward models in the RLHF dataflow; (ii) device placement of the models in the dataflow, as obtained by running the auto-mapping algorithm under given GPU cluster configurations; (iii) parallelism strategy for running each model in each stage, e.g., a tuple of (p, t, d) for 3D parallelism, where p, t, d represent PP size, TP size and DP size, respectively. The single controller program takes these inputs to initialize models in the RLHF dataflow and virtual i zed resource pool, dispatches operations/models to devices according to the placement plan, and invokes functions run by the multiple controllers on devices to carry out distributed computation of each model.

我们的 RLHF 系统工作流程如下。用户提供以下输入以启动 RLHF 系统：(i) 模型规格，包括 RLHF 数据流中 actor/critic/reference policy/reward 模型的架构和大小；(ii) 在给定 GPU 集群配置下运行自动映射算法获得的数据流中模型的设备放置；(iii) 每个阶段运行每个模型的并行策略，例如 (p, t, d) 元组用于 3D 并行，其中 p, t, d 分别代表 PP 大小、TP 大小和 DP 大小。单个控制器程序接收这些输入以初始化 RLHF 数据流和虚拟化资源池中的模型，根据放置计划将操作/模型分配到设备，并调用设备上由多个控制器运行的函数以执行每个模型的分布式计算。

The multi-controller program implements the ParallelWorker class: it constructs parallel groups of each model among allocated devices according to its parallelism strategies, invokes the 3D-Hybrid Engine for actor training and generation, and can be integrated seamlessly with existing LLM engines [40, 57, 60, 71] for training, inference and generation of other models. The transfer protocols are coordinated by the single controller program to support resharding of data (including prompts, responses, and other model outputs in RLHF) between models with distinct parallelism strategies. The data resharding of the actor between training and generation is handled by 3D-Hybrid Engine.

多控制器程序实现了 ParallelWorker 类：它根据其并行策略在分配的设备中构建每个模型的并行组，调用 3D-Hybrid Engine 进行训练和生成，并可以无缝集成现有的 LLM 引擎 [40, 57, 60, 71]，用于其他模型的训练、推理和生成。传输协议由单个控制器程序协调，以支持具有不同并行策略的模型之间的数据（包括提示、响应和 RLHF 中的其他模型输出）重分片。训练和生成之间的数据重分片由 3D-Hybrid Engine 处理。

4 Hybrid Programming Model

4 混合编程模型

4.1 Hierarchical APIs

4.1 分层 API

Intra-node: encapsulating distributed program. For distributed computation of each model in different RLHF stages, we provide a base class, 3 D Parallel Worker. Given allocated devices, it facilitates distributed model weight initialization and establishes 3D parallel groups for each model. A parallel group includes a set of GPUs to host a specific parallel dimension of the model, e.g., different tensor shards in TP and different model replicas in DP. Figure 5(a) illustrates initialization of the actor model with our APIs, while initialization of other models is similar.

节点内部：封装分布式程序。针对不同RLHF阶段中每个模型的分布式计算，我们提供了一个基类，3D并行工作器。在给定分配的设备后，它有助于分布式模型权重的初始化，并为每个模型建立3D并行组。一个并行组包括一组GPU，用于承载模型的特定并行维度，例如TP中的不同张量分片和DP中的不同模型副本。图5(a)展示了使用我们的API初始化actor模型的过程，而其他模型的初始化过程类似。

Figure 5. An illustration of hierarchical APIs. (a) Model with 3D parallel configuration, resource allocation, and 3 D Parallel Worker initialization. (b) Asynchronous data resharding between two models with collect and distribute functions in 3D_PROTO.

图 5: 分层 API 的示意图。(a) 具有 3D 并行配置、资源分配和 3D 并行 Worker 初始化的模型。(b) 在 3D_PROTO 中具有收集和分发功能的两个模型之间的异步数据重分片。

Inheriting from the 3 D Parallel Worker class, several mode classes, for actor, critic, reference, and reward model, respectively, are provided. Each of these model classes encapsulates APIs to implement the model’s distributed forward and backward computation, auto-regressive generation, and optimizer updates, decoupling the distributed computation code with data dependencies with other models. These APIs can be easily implemented by reusing the computation scripts from existing LLM systems. For example, the computation involved in update actor function of Actor Worker (the class for the actor model) is similar to the pre-training scripts in Megatron-LM [71]. A model class encapsulates fundamental operations for implementing various RLHF algorithms, e.g., generate sequences in the actor model class for generating responses based on the prompts and compute reward in the reward model class for evaluating responses through a forward pass. (More APIs are detailed in Appendix A).

继承自 3D Parallel Worker 类，分别为 actor、critic、reference 和 reward 模型提供了多个模型类。每个模型类封装了实现模型的分布式前向和反向计算、自回归生成和优化器更新的 API，将分布式计算代码与其他模型的数据依赖解耦。这些 API 可以通过重用现有大语言模型系统中的计算脚本来轻松实现。例如，Actor Worker（actor 模型的类）中 update actor 函数涉及的计算与 Megatron-LM [71] 中的预训练脚本类似。模型类封装了实现各种 RLHF 算法的基本操作，例如在 actor 模型类中生成序列以基于提示生成响应，在 reward 模型类中通过前向传递计算奖励以评估响应。（更多 API 详见附录 A）。

Besides base class 3 D Parallel Worker that implements 3D parallelism, we further provide base classes for PyTorch FSDP (FSDPWorker) and ZeRO (ZeROWorker), and the corresponding model classes inheriting each base class, to support different parallelism strategies in model computation. ParallelWorker in Figure 4 denotes one of these base classes.

除了实现 3D 并行的基础类 3D Parallel Worker，我们还提供了 PyTorch FSDP (FSDPWorker) 和 ZeRO (ZeROWorker) 的基础类，以及继承每个基础类的相应模型类，以支持模型计算中的不同并行策略。图 4 中的 ParallelWorker 表示这些基础类之一。

Inter-node: unifying data resharding implementation between models. Many-to-many multicast is involved for data transfer between models employing different parallelism strategies on different devices. We unify this data transfer implementation by associating each operation in each model class with a transfer protocol, using @register. Each transfer protocol consists of a collect function and a distribute function, to aggregate output data and distribute input data according to the parallelism strategy of each model. In the example in Figure 5(a), update actor operation is registered to transfer protocol 3D_PROTO, as 3D parallelism is used for actor training. In 3D_PROTO, the collect function gathers all the output data of corresponding model function (e.g., the loss scalar return from the update actor) in each DP group to the single controller, and the distribute function distributes the input data to the registered function (e.g., advantages for the update actor) to each DP group. Data resharding is enabled using the source model’s output collect function and the destination model’s input distribute function. Figure 5(b) illustrates data resharding between the actor (generation) and the critic (inference), where computation of the models adopts different 3D parallelism strategies. The single controller gathers data futures using the collect function in 3D_PROTO of actor (steps $\textcircled{1}{-}\textcircled{3})$ and sends it to critic (step $\textcircled{4})$ ); critic distributes the received data futures to each DP group using the distribute function in its 3D_PROTO (step $\mathfrak{G}$ ). Then remote data is retrieved from actor to critic, with each of critic’s GPUs only fetching the required local batch of the actor’s output data according to its DP rank (step $\circled{6}$ ). The actual data transfer only occurs between GPUs, avoiding any central bottleneck.

节点间：统一模型间的数据重分片实现。在不同设备上采用不同并行策略的模型之间进行数据传输时，涉及多对多的多播。我们通过在模型类中的每个操作上关联一个传输协议来统一这一数据传输实现，使用@register。每个传输协议由一个收集函数和一个分发函数组成，根据每个模型的并行策略来聚合输出数据并分发输入数据。在图 5(a) 的示例中，由于 actor 训练采用了 3D 并行，因此 update actor 操作被注册到传输协议 3D_PROTO。在 3D_PROTO 中，收集函数将每个 DP 组中相应模型函数的所有输出数据（例如，从 update actor 返回的损失标量）聚合到单个控制器，分发函数将输入数据（例如，update actor 的优势值）分发到每个 DP 组。数据重分片通过源模型的输出收集函数和目标模型的输入分发函数实现。图 5(b) 展示了 actor（生成）和 critic（推理）之间的数据重分片，其中模型的计算采用了不同的 3D 并行策略。单个控制器使用 actor 的 3D_PROTO 中的收集函数收集数据未来（步骤 $\textcircled{1}{-}\textcircled{3})$），并将其发送给 critic（步骤 $\textcircled{4})$）；critic 使用其 3D_PROTO 中的分发函数将接收到的数据未来分发给每个 DP 组（步骤 $\mathfrak{G}$）。然后，从 actor 中检索远程数据到 critic，critic 的每个 GPU 根据其 DP 等级仅获取 actor 输出数据所需的本地批次（步骤 $\circled{6}$）。实际的数据传输仅在 GPU 之间进行，避免了任何中心瓶颈。

We provide 8 transfer protocols, including 3D_PROTO, DP _PROTO, ONE_TO_ALL, etc., that cover most data resharding scenarios (detailed in Appendix B). A user can further extend the transfer protocols through implementing customized collect and distribute functions.

我们提供了 8 种传输协议，包括 3D_PROTO、DP_PROTO、ONE_TO_ALL 等，涵盖了大多数数据重分片场景（详见附录 B）。用户可以通过实现自定义的收集和分发函数进一步扩展传输协议。

Facilitating flexible model placement. We provide a Resource Pool class that virtualizes a set of GPU devices. When applying a Resource Pool instance to a model class (Figure 5(a)), distributed computation of the model will be mapped to the devices. Models utilizing the same Resource Poo instance are colocated on the same set of GPUs; models are placed on different sets of GPUs when different Resource Pool instances are applied in their model classes. We assume no overlap between different Resource Pool instances.

促进灵活的模型部署。我们提供了一个资源池类，用于虚拟化一组 GPU 设备。当将资源池实例应用于模型类时（图 5(a)），模型的分布式计算将被映射到这些设备上。使用相同资源池实例的模型将被部署在同一组 GPU 上；当模型类中应用不同的资源池实例时，模型将被部署在不同的 GPU 组上。我们假设不同的资源池实例之间没有重叠。

Asynchronous dataflow execution. When models are placed on separate sets of devices, their execution is triggered automatically as soon as their inputs become available [50]. In Figure 5(b), the data future from actor is immediately returned after the controller’s call (steps $\textcircled{1}{-}\textcircled{3})$ ; the controller then initiates a new call to critic and distributes the futures following the transfer protocol (steps $\scriptstyle(4)-(5))$ . When some models are placed on the same set of devices, they are executed sequentially based on the calling order. With our programming model, HybridFlow is flexible in supporting diverse distributed execution patterns without any code change of the RLHF algorithm (Figure 6).

异步数据流执行。当模型被放置在不同的设备集上时，一旦它们的输入可用，它们的执行就会自动触发 [50]。在图 5(b) 中，来自 actor 的数据 future 在控制器的调用后立即返回 (步骤 $\textcircled{1}{-}\textcircled{3})$ ；然后控制器发起对 critic 的新调用，并按照传输协议分发 futures (步骤 $\scriptstyle(4)-(5))$ 。当某些模型被放置在同一设备集上时，它们会根据调用顺序依次执行。通过我们的编程模型，HybridFlow 能够灵活支持各种分布式执行模式，而无需对 RLHF 算法进行任何代码更改 (图 6)。

# 通过复用 Rewardworker 初始化成本模型
icost=Rewardworker(cost_config,resource_pool)) # 省略其他模型初始化
algo_type # 指定不同的 RLHF 数值计算
# PPO 和 Safe-RLHF 的示例
for(prompts，pretrain_batch）in dataloader: # 阶段 1: 生成响应
batch=_actor.generate_sequences(prompts)
batch=actor.generate_sequences(prompts, do_sample=False）
# 阶段 2: 准备经验 (为 ReMax 添加)
Xbatch = critic.compute_values(batch) # 在 ReMax 中不需要:
batch = reference.compute_log_prob(batch;
batch =_reward.compute_reward(batch) # 为 Safe-RLHF 添加
batch=cost.compute_cost(batch)i
batch= compute_advantages(batch,algo_type)
# 阶段 3: Actor 和 critic 训练
critic_metrics = critic.update_critic(batch，loss_func=algo_type)
pretrain_loss=actor.compute_toss(pretrain_batch)
batch["pretrain_loss"]=pretrain_loss
actor_metrics =actor.update_actor(batch,Toss_func=algo_type)

4.2 Implementation of different RLHF algorithms

4.2 不同RLHF算法的实现

Our APIs enable streamlined development of various RLHF algorithms (dataflows). Users can implement an RLHF algorithm in a few lines of code as a single process program to run on the single controller, that involves a sequence of primitive API calls to invoke distributed computation of models. Examples of PPO, ReMax, and Safe-RLHF are given in Figure 6. PPO can be implemented in just 8 lines by invoking model operations including compute values and generate sequences, which are executed under the multicontroller paradigm on multiple GPUs. To adapt to SafeRLHF which integrates an additional cost model to evaluate safety preferences and the pre-taining loss for actor, only 5 more lines of code are added on top of PPO implementation. To adapt to ReMax, one additional call to actor generation is needed, and the critic-related code can be removed.

我们的 API 支持各种 RLHF 算法（数据流）的简化开发。用户只需几行代码即可实现一个 RLHF 算法，作为在单个控制器上运行的单一进程程序，其中涉及一系列原始 API 调用来调用模型的分布式计算。图 6 中给出了 PPO、ReMax 和 Safe-RLHF 的示例。PPO 只需 8 行代码即可实现，通过调用包括计算值和生成序列在内的模型操作，这些操作在多个 GPU 上的多控制器范式下执行。为了适应 SafeRLHF，它集成了额外的成本模型来评估安全偏好和 actor 的预训练损失，只需在 PPO 实现的基础上添加 5 行代码。为了适应 ReMax，需要额外调用一次 actor 生成，并可以删除与 critic 相关的代码。

Achieving flexible. This flexibility of extension is crucial for researchers to explore different RLHF algorithms: they can reuse distributed computation encapsulated in each model class and simply adjust the code for numerical computations according to specific algorithms, such as GAE [67] and KL divergence in compute advantage and loss functions of actor and critic. The streamlined development can be attributed to the hybrid programming model. Our modular API design simplifies development, facilitates extensive code reuse, and enables directly incorporating the codebase of existing LLM training/serving frameworks. It also decouples model computation and data transfer among models. Any change in the distributed frameworks does not affect the code of the RLHF algorithm (Figure 6), enabling individualized optimization for each model’s execution (§5). Flexible placement of models with diverse workloads is supported, enabling optimized mapping of RLHF dataflow onto various devices (§6).

实现灵活性。这种扩展的灵活性对于研究人员探索不同的 RLHF 算法至关重要：他们可以重用封装在每个模型类中的分布式计算，并根据特定算法（例如 GAE [67] 和 KL 散度）简单地调整数值计算的代码，以计算优势和损失函数。这种简化的开发归功于混合编程模型。我们的模块化 API 设计简化了开发，促进了广泛的代码重用，并能够直接集成现有大语言模型训练/服务框架的代码库。它还解耦了模型计算和模型之间的数据传输。分布式框架的任何更改都不会影响 RLHF 算法的代码（图 6），从而实现对每个模型执行的个性化优化（§5）。支持具有不同工作负载的模型的灵活放置，从而优化 RLHF 数据流在各种设备上的映射（§6）。

Figure 7. 3D-Hybrid Engine workflow in one RLHF iteration. 4 GPUs are used for actor training and generation. 1-2-2 $(p-t-d)$ parallel groups are used in training and 1-1-2-2 $\left(p_{g^{-}}\right.$ $t_{g}{-}d_{g}{-}d)$ parallel groups are used in generation.

图 7: 3D-Hybrid Engine 在一次 RLHF 迭代中的工作流程。4 个 GPU 用于演员训练和生成。训练中使用 1-2-2 $(p-t-d)$ 并行组，生成中使用 1-1-2-2 $\left(p_{g^{-}}\right.$ $t_{g}{-}d_{g}{-}d)$ 并行组。

5 3D-Hybrid Engine

5 3D-混合引擎

We design the 3D-Hybrid Engine to support efficient training and generation of the actor model, targeting significant RLHF throughput improvement.

我们设计了3D-Hybrid Engine，以支持高效的演员模型训练和生成，旨在显著提升RLHF的吞吐量。

5.1 Parallel Groups

5.1 并行组

To eliminate redundant actor model copies, we advocate deploying actor training and generation stages on the same set of devices, $N_{a}$ GPUs allocated to the actor, and execute them sequentially on the same copy of actor model weights. Nonetheless, actor training and generation may well adopt different 3D parallelism strategies, i.e., the generation stage typically requires smaller TP and PP sizes but a larger DP size, than the training stage (§2.3). 3D-Hybrid Engine enables efficient model parameter resharding between actor training and generation across the same set of devices in this context.

为了消除冗余的演员模型副本，我们建议将演员训练和生成阶段部署在同一组设备上，即分配给演员的 $N_{a}$ 个 GPU 上，并在同一副本的演员模型权重上顺序执行它们。然而，演员训练和生成阶段可能会采用不同的 3D 并行策略，即生成阶段通常需要更小的 TP 和 PP 规模，但 DP 规模比训练阶段更大（§2.3）。3D-Hybrid Engine 在这种情况下能够高效地在同一组设备上实现演员训练和生成之间的模型参数重分片。

Let $p{-}t{-}d$ denote 3D parallel groups constructed for actor training, corresponding to the set of GPUs to host $\boldsymbol{p}$ pipeline stages, $t$ tensor shards, and $d$ model replicas [54]. 3D-Hybrid Engine builds different parallel groups for actor training and generation, according to their different 3D parallelism strategies, respectively. We use $\mathcal{p}{g},t{g}$ , and $d_{g}$ to denote the size of generation pipeline parallel group, generation tensor parallel group, and micro data parallel group, respectively, in the generation stage. $d_{g}$ indicates the ratio of model replica number in generation over that in training, i.e., each DP replica in training becomes $d_{g}$ micro DP replicas, to process $d_{g}$ micro batches of prompts and responses. We have 𝑁𝑎=𝑝×𝑡×𝑑=𝑝𝑔×𝑡𝑔×𝑑𝑔×𝑑such that 𝑑𝑔=𝑝𝑝𝑔𝑡𝑡𝑔. The micro DP groups are employed exclusively in actor generation stage to render a larger DP size for full device utilization. The generation parallel groups are denoted by $p_{g}{-}t_{g}{-}d_{g}{-}d$ .

令 $p{-}t{-}d$ 表示为演员训练构建的3D并行组，对应于托管 $\boldsymbol{p}$ 流水线阶段、$t$ 张量分片和 $d$ 模型副本的GPU集合 [54]。3D-Hybrid Engine 根据不同的3D并行策略，分别为演员训练和生成构建不同的并行组。我们使用 $\mathcal{p}{g},t{g}$ 和 $d_{g}$ 分别表示生成阶段的生成流水线并行组、生成张量并行组和微数据并行组的大小。$d_{g}$ 表示生成阶段的模型副本数与训练阶段的比例，即训练中的每个DP副本变为 $d_{g}$ 个微DP副本，以处理 $d_{g}$ 个微批次的提示和响应。我们有 𝑁𝑎=𝑝×𝑡×𝑑=𝑝𝑔×𝑡𝑔×𝑑𝑔×𝑑，使得 𝑑𝑔=𝑝𝑝𝑔𝑡𝑡𝑔。微DP组仅在演员生成阶段使用，以提供更大的DP大小以实现设备的充分利用。生成并行组表示为 $p_{g}{-}t_{g}{-}d_{g}{-}d$。

5.2 3D-Hybrid Engine Workflow

5.2 3D混合引擎工作流程

Between actor training in iteration 𝑖of RLHF and actor generation in iteration $i+1$ , the actor model parameters need to be resharded and prompts data to be distributed, following the parallel group configurations in the two stages. In iteration $i+1$ of RLHF, 3D-Hybrid Engine gathers the actor model parameters updated in iteration 𝑖(step $\textcircled{1}$ in Figure 7), for generation within each micro DP group. Then, the batch of prompts are loaded to each model replica (step $\textcircled{2}],$ ), which generates responses (Generation stage of RLHF). Following this, 3D-Hybrid Engine performs an all-gather operation on the generation results within each micro DP group (step $\textcircled{3}.$ ), and re-partitions model parameters according to the 3D parallelism for actor training (step $\textcircled{4}.$ ). With model weights, prompts and responses correctly re-distributed, the loss of the actor model is computed and actor model weights are updated following the RLHF algorithm (step $\circledcirc$ ) - actor training stage of iteration $i+1$ .

在 RLHF 迭代 𝑖 中的 actor 训练和迭代 $i+1$ 中的 actor 生成之间，actor 模型参数需要根据两个阶段的并行组配置进行重新分片，并将提示数据分发。在 RLHF 迭代 $i+1$ 中，3D-Hybrid Engine 收集在迭代 𝑖 中更新的 actor 模型参数（图 7 中的步骤 $\textcircled{1}$），以在每个微 DP 组内进行生成。然后，将一批提示加载到每个模型副本中（步骤 $\textcircled{2}$），并生成响应（RLHF 的生成阶段）。随后，3D-Hybrid Engine 在每个微 DP 组内对生成结果执行 all-gather 操作（步骤 $\textcircled{3}$），并根据 3D 并行性重新分区模型参数以进行 actor 训练（步骤 $\textcircled{4}$）。在正确重新分配模型权重、提示和响应后，计算 actor 模型的损失，并根据 RLHF 算法更新 actor 模型权重（步骤 $\circledcirc$）——迭代 $i+1$ 的 actor 训练阶段。

Figure 8. Model weights resharding. 2 machines each with 4 GPUs are used for actor training and generation.

图 8: 模型权重重新分片。使用两台机器，每台机器配备 4 个 GPU，用于 Actor 训练和生成。

5.3 Zero redundancy model resharding

5.3 零冗余模型重分片

Parallel grouping methods in 3D parallelism are typically as follows: PP and TP groups are formed by assigning consecutive ranks to pipeline stages and tensor shards, respectively; DP groups are constructed by selecting ranks at regular intervals, determined by the product of PP size and TP size. In Figure 8(a), actor training uses 3D parallel groups, 1-4-2: there is one PP group for all GPUs (for illustration clarify); the TP groups are [G1, G2, G3, G4], [G5, G6, G7, G8], and the DP groups are [G1, G5], [G2, G6], [G3, G7], [G4, G8]. Suppose the same parallel grouping methods are used but with different parallel sizes, e.g., 1-2-2-2 for generation in Figure 8(a). During the transition from training to generation, 3D-Hybrid Engine applies all-gather operations among the model parallel groups to aggregate all parameters, and then retain only a subset of model weights on each device for its generation, according to the parallel groups the device belongs to. On some GPUs (e.g., G2, G3, G6, G7), there is no overlap between training and generation model weights, and separate memory is needed to maintain weights for subsequent training as well (grey boxes in Figure 8(a)).We call the system HybridFlow-V, when 3D-Hybrid Engine uses the above vanilla parallel grouping methods in the two stages.

3D并行中的并行分组方法通常如下：PP组和TP组分别通过为流水线阶段和张量分片分配连续的rank来形成；DP组通过以固定间隔选择rank来构建，间隔由PP大小和TP大小的乘积决定。在图8(a)中，actor训练使用了3D并行分组，1-4-2：所有GPU有一个PP组（为了说明清晰）；TP组是[G1, G2, G3, G4]、[G5, G6, G7, G8]，DP组是[G1, G5]、[G2, G6]、[G3, G7]、[G4, G8]。假设使用相同的并行分组方法但并行大小不同，例如图8(a)中生成的1-2-2-2。在从训练过渡到生成时，3D-Hybrid Engine在模型并行组之间应用all-gather操作以聚合所有参数，然后根据设备所属的并行组，仅保留每个设备上的部分模型权重用于生成。在某些GPU上（例如G2、G3、G6、G7），训练和生成的模型权重没有重叠，还需要单独的内存来维护后续训练的权重（图8(a)中的灰色框）。当3D-Hybrid Engine在两个阶段使用上述vanilla并行分组方法时，我们称该系统为HybridFlow-V。

Table 2. Transition overhead between training $&$ generation

表 2. 训练和生成之间的转换开销

	DS-Chat	HybridFlow-V	HybridFlow
通信量 (Comm.Vol)	tpd-1M tpd	tp-1M tp	tp-tgPgM tgpgtp
峰值内存 (Peak Mem.)	M	M	1M tgPg
冗余 (Redundancy)		M	0

We further design a new parallel grouping method for 3DHybrid Engine to use in the generation stage, that eliminates the redundancy in weights storage and leads to minimal memory footprint and communication due to actor model resharding between training and generation. Specifically, we form generation TP and PP groups by selecting ranks at regular intervals, determined by $\frac{t}{t_{g}}$ and $\frac{p}{p_{g}}$ , and construct micro DP groups by sequentially assigning ranks along the generation TP or PP dimensions. In Figure 8(b), 1-2-2-2 parallel groups are used in generation: the generation TP groups are [G1, G3], [G2, G4], [G5, G7], [G6, G8]; and the micro DP groups are [G1, G2], [G3, G4], [G5, G6], [G7, G8]. This strategic rearrangement of generation parallel groups leads to overlap between training and generation model weights on each device, enabling reuse of training weights during generation and zero redundancy in device memory usage due to model resharding. In addition, 3D-Hybrid Engine conducts several all-gather operations concurrently, one within each micro DP group, leading to significantly reduced communication overhead.

我们进一步为3DHybrid Engine设计了一种新的并行分组方法，用于生成阶段，该方法消除了权重存储中的冗余，并在训练和生成之间由于actor模型的重新分片而实现了最小的内存占用和通信开销。具体来说，我们通过按$\frac{t}{t_{g}}$和$\frac{p}{p_{g}}$的间隔选择rank来形成生成TP和PP组，并沿着生成TP或PP维度顺序分配rank来构建微DP组。在图8(b)中，生成阶段使用了1-2-2-2并行分组：生成TP组为[G1, G3]、[G2, G4]、[G5, G7]、[G6, G8]；微DP组为[G1, G2]、[G3, G4]、[G5, G6]、[G7, G8]。这种生成并行组的战略性重新排列使得每个设备上的训练和生成模型权重重叠，从而在生成过程中重用训练权重，并且由于模型重新分片而实现了设备内存使用的零冗余。此外，3D-Hybrid Engine在每个微DP组内同时进行多个all-gather操作，显著减少了通信开销。

5.4 Transition overhead

5.4 转换开销

In Table 2, we compare communication overhead and memory footprint during the transition between training and generation stages, among different actor engine designs. We assume model size of the actor is $M$ and $N_{a}$ GPUs are used for its training and generation. The actor engine in DeepSpeedChat conducts an all-gather operation across all GPUs during transition; HybridFlow-V performs this all-gather within training TP and PP groups. The communication volumes for these operations are $\begin{array}{r}{\frac{\bar{N}{a}-1}{N{a}}M=\frac{t p d-1}{t p d}M}\end{array}$ for DeepSpeedChat and ${\frac{t p-1}{t p}}M$ for HybridFlow-V, calculated following [13]. Both engines aggregate all model parameters in each GPU’s memory before subsequently partitioning model states according to the generation parallel groups, resulting in a peak memory usage of model parameters $M$ . As they cannot reuse training weights during generation on some GPUs, training weights need to be maintained on them, amounting to𝑡𝑝1𝑑 and $\textstyle{\frac{1}{t p}}$ redundant memory consumption, respectively.

表 2 中，我们比较了不同执行引擎设计在训练和生成阶段转换期间的通信开销和内存占用情况。我们假设执行器模型的大小为 $M$，并使用 $N_{a}$ 个 GPU 进行训练和生成。DeepSpeedChat 中的执行引擎在转换期间在所有 GPU 之间执行全收集操作；HybridFlow-V 则在训练 TP 和 PP 组内执行此全收集操作。根据 [13]，这些操作的通信量分别为 DeepSpeedChat 的 $\begin{array}{r}{\frac{\bar{N}{a}-1}{N{a}}M=\frac{t p d-1}{t p d}M}\end{array}$ 和 HybridFlow-V 的 ${\frac{t p-1}{t p}}M$。两个引擎在生成并行组之前都会在每个 GPU 的内存中聚合所有模型参数，从而导致模型参数的峰值内存使用量为 $M$。由于它们在某些 GPU 上无法在生成期间重复使用训练权重，因此需要在这些 GPU 上维护训练权重，分别导致 𝑡𝑝1𝑑 和 $\textstyle{\frac{1}{t p}}$ 的冗余内存消耗。

With our parallel grouping method for the generation stage, HybridFlow confines the all-gather operation within each micro DP group. The communication overhead is reduced to $\begin{array}{r}{\frac{d_{g}-1}{t_{f}}\bar{M}=\frac{t\bar{p}-t_{g}p_{g}}{t_{g}p_{g}t_{f}}M}\end{array}$ 𝑡𝑡𝑝−𝑝𝑡𝑔𝑡𝑝𝑝𝑔𝑀. Each GPU only needs to collect remote parameters within its micro DP group and can reuse the training weights in generation. Therefore, the peak memory usage of model parameters in HybridFlow precisely matches the model partition size on each GPU in generation, eliminating any redundancy in GPU memory usage.

通过我们在生成阶段的并行分组方法，HybridFlow 将 all-gather 操作限制在每个微 DP 组内。通信开销减少到 $\begin{array}{r}{\frac{d_{g}-1}{t_{f}}\bar{M}=\frac{t\bar{p}-t_{g}p_{g}}{t_{g}p_{g}t_{f}}M}\end{array}$ 𝑡𝑡𝑝−𝑝𝑡𝑔𝑡𝑝𝑝𝑔𝑀。每个 GPU 只需要收集其微 DP 组内的远程参数，并可以在生成过程中重用训练权重。因此，HybridFlow 中模型参数的峰值内存使用量与生成时每个 GPU 上的模型分区大小完全匹配，消除了 GPU 内存使用中的任何冗余。

6 Auto Device Mapping

6 自动设备映射

Our hybrid programming model requires users to input the following configurations, which are referred to as a mapping of the RLHF dataflow to the given devices: (a) device placement of the models in the dataflow; (b) the corresponding parallelism strategy for running each model in each stage.

我们的混合编程模型要求用户输入以下配置，这些配置被称为RLHF数据流到给定设备的映射：(a) 数据流中模型的设备放置；(b) 每个阶段中运行每个模型的相应并行策略。

We provide an efficient algorithm (Algorithm 1) for users to identify the optimized mapping of executing the RLHF dataflow on a given cluster of devices, that minimizes the end-to-end latency of each RLHF iteration. Given a dataflow $D$ , we first explore all possible placement plans $\mathcal{P}$ for the models in the given cluster (Line 3). For example, the PPO algorithm involves four models, resulting in 15 possible placements (from the Bell partition problem [10, 62]), ranging from a completely standalone placement where all models are placed on different devices (e.g., OpenRLHF’s placement) to colocating all models on the same set of devices (e.g., DeepSpeed-Chat’s placement). We refer to colocated models on the same set of GPUs as a colocated set. Model

[论文翻译]HybridFlow: 灵活高效的 RLHF 框架

原文地址：https://arxiv.org/pdf/2409.19256v2