[论文翻译]HybridFlow: 灵活高效的 RLHF 框架


原文地址:https://arxiv.org/pdf/2409.19256v2


HybridFlow: A Flexible and Efficient RLHF Framework

HybridFlow: 灵活高效的 RLHF 框架

Abstract

摘要

Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs. RLHF complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge into a many-to-many multicast. Traditional RL frameworks execute the dataflow using a single controller to instruct both intra-node computation and inter-node communication, which can be inefficient in RLHF due to large control dispatch overhead for distributed intra-node computation. Existing RLHF systems adopt a multi-controller paradigm, which can be inflexible due to nesting distributed computation and data communication. We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierar- chical APIs that decouple and encapsulate computation and data dependencies in the complex RLHF dataflow, allowing efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various devices. We further design a 3D-Hybrid Engine for efficient actor model resharding between training and generation phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results demonstrate $1.53!\times!!\sim!!20.57!\times$ throughput improvement when running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source code will be available at https://github.com/volcengine/verl

基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 在大语言模型 (Large Language Model, LLM) 对齐中被广泛使用。传统的强化学习可以建模为数据流,其中每个节点表示神经网络 (Neural Network, NN) 的计算,每条边表示神经网络之间的数据依赖关系。RLHF 通过将每个节点扩展为分布式的大语言模型训练或生成程序,并将每条边扩展为多对多的组播,使得数据流复杂化。传统的强化学习框架使用单一控制器来执行数据流,同时指导节点内的计算和节点间的通信,这在 RLHF 中可能效率低下,因为分布式节点内计算的控制调度开销较大。现有的 RLHF 系统采用多控制器范式,由于嵌套了分布式计算和数据通信,可能不够灵活。我们提出了 HybridFlow,它以混合方式结合了单控制器和多控制器范式,从而能够灵活地表示并高效地执行 RLHF 数据流。我们精心设计了一组分层 API,将复杂的 RLHF 数据流中的计算和数据依赖关系解耦并封装,从而实现高效的运算编排以实施 RLHF 算法,并灵活地将计算映射到各种设备上。我们进一步设计了一个 3D-Hybrid 引擎,用于在训练和生成阶段之间高效地进行 Actor 模型重分片,实现零内存冗余并显著减少通信开销。我们的实验结果表明,与最先进的基线相比,使用 HybridFlow 运行各种 RLHF 算法时,吞吐量提升了 $1.53!\times!!\sim!!20.57!\times$。HybridFlow 的源代码将发布在 https://github.com/volcengine/verl

CCS Concepts: • Computing methodologies $\rightarrow$ Distributed computing methodologies; Machine learning.

CCS 概念:• 计算方法 $\rightarrow$ 分布式计算方法;机器学习。

Keywords: Distributed systems, Reinforcement Learning from Human Feedback

关键词:分布式系统,基于人类反馈的强化学习

ACM Reference Format:

ACM 引用格式:

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. HybridFlow: A Flexible and Efficient RLHF Framework. In Twentieth European Conference on Computer Systems (EuroSys ’25), March 30- April 3, 2025, Rotterdam, Netherlands. ACM, New York, NY, USA, 19 pages. https://doi.org/10.1145/3689031.3696075

HybridFlow:一种灵活高效的 RLHF 框架

1 Introduction

1 引言

Large language models (LLMs) such as GPT [11], Llama [73] and Claude [7] have revolutionized various artificial intelligence (AI) applications, ranging from writing [2], searching [52] to coding [63]. LLMs are first pre-trained on trillions of tokens from books, websites, etc,. via next-word prediction to accumulate broad knowledge [11]. Next, LLMs are trained on domain-specific datasets via supervised fine-tuning (SFT), to be able to follow human instructions [11]. Despite the outstanding capabilities of LLMs on natural language tasks after pre-training and SFT, the detrimental and biased contents in the training datasets may still mislead an LLM to generate toxic and undesirable content. Reinforcement Learning from Human Feedback (RLHF) is introduced to further align an LLM to human values, for building helpful and harmless AI applications [7, 55].

大语言模型 (LLMs) 如 GPT [11]、Llama [73] 和 Claude [7] 已经彻底改变了各种人工智能 (AI) 应用,从写作 [2]、搜索 [52] 到编码 [63]。LLMs 首先通过预测下一个词在书籍、网站等数万亿 Token 上进行预训练,以积累广泛的知识 [11]。接着,LLMs 通过监督微调 (SFT) 在特定领域的数据集上进行训练,以便能够遵循人类指令 [11]。尽管 LLMs 在预训练和 SFT 后在自然语言任务上表现出色,但训练数据集中的有害和偏见内容仍可能误导 LLMs 生成有毒和不良内容。人类反馈强化学习 (RLHF) 被引入以进一步将 LLMs 与人类价值观对齐,从而构建有用且无害的 AI 应用 [7, 55]。

RLHF is built upon traditional RL algorithms [4, 68, 78], e.g., Proximal Policy Optimization (PPO) [68] and REINFORCE [78]. The widely adopted PPO-based RLHF system typically consists of four LLMs [7, 55]: an actor, a critic, a reference policy network and a reward model. PPO-based RLHF proceeds in iterations, each with three stages: (1) response generation using the actor model with a batch of prompts;

RLHF基于传统的强化学习算法 [4, 68, 78],例如近端策略优化(PPO) [68] 和REINFORCE [78]。广泛采用的基于PPO的RLHF系统通常由四个大语言模型组成 [7, 55]:一个actor(执行者)、一个critic(评价者)、一个参考策略网络和一个奖励模型。基于PPO的RLHF以迭代方式进行,每个迭代包含三个阶段:(1) 使用actor模型生成一批提示的响应;

(2) preparation of training data by scoring the generated responses through a single forward pass of the critic, reference policy, and reward models; (3) learning from human preference by updating actor and critic through forward and backward computation. Other RLHF variants [19, 43] follow similar stages but involves different numbers of models and data dependencies among the models.

(2) 通过批评模型、参考策略和奖励模型的单次前向传递对生成的响应进行评分,从而准备训练数据;(3) 通过前向和后向计算更新演员和批评模型,从人类偏好中学习。其他 RLHF 变体 [19, 43] 遵循类似的阶段,但涉及不同数量的模型以及模型之间的数据依赖关系。

Traditional RL can be modeled as a dataflow [46], which is a directed acyclic graph (DAG): each node in the RL dataflow represents computation of a neural network (e.g., actor or critic network which can be CNN or MLP); each edge denotes data dependency between NN computations (e.g., output of the critic is used as input to actor training [68].) RLHF dataflow is more complex, with more complicated models involved (e.g., LLMs for the actor/critic/reference/reward models), each running distinct computation, and more diverse data dependencies among them (i.e., multicast between distributed model partitions). Training and generation of an LLM in the RLHF dataflow requires distributed computation (e.g., using tensor/pipeline/data parallelism) [40, 71]. Therefore, each node in the RLHF dataflow is a complex distributed program, corresponding to distributed computation of the respective LLM. Models in different nodes typically use different parallelism strategies as their workloads vary. The edge represents data resharding, which is often a many-tomany multicast. Consequently, Flexible representation and efficient execution of the complex and resource intensive RLHF is imperative.

传统的强化学习(RL)可以被建模为一种数据流 [46],这是一种有向无环图(DAG):RL 数据流中的每个节点表示神经网络的计算(例如,演员或评论家网络,可以是 CNN 或 MLP);每条边表示神经网络计算之间的数据依赖关系(例如,评论家的输出被用作演员训练的输入 [68]。)RLHF(基于人类反馈的强化学习)数据流更为复杂,涉及更复杂的模型(例如,用于演员/评论家/参考/奖励模型的大语言模型),每个模型运行不同的计算,并且它们之间存在更多样化的数据依赖关系(即,分布式模型分区之间的多播)。在 RLHF 数据流中训练和生成大语言模型需要分布式计算(例如,使用张量/管道/数据并行)[40, 71]。因此,RLHF 数据流中的每个节点都是一个复杂的分布式程序,对应于相应大语言模型的分布式计算。不同节点中的模型通常使用不同的并行策略,因为它们的工作负载不同。边表示数据重分片,通常是一种多对多的多播。因此,对复杂且资源密集的 RLHF 进行灵活的表示和高效执行是必要的。

Traditional RL frameworks such as RLLib [45] and RLLib Flow [46] utilize a hierarchical single-controller paradigm to run RL dataflows. A centralized controller assigns nodes in the dataflow to different processes and coordinates their execution order. Each node process can further spawn more workers to perform computation, again following the singlecontroller paradigm. However, they only provide primitives for data-parallel training and are constrained to neural networks that are at most hundreds of MB in size [45, 46]. In the RLHF dataflow, each node corresponds to an LLM with up to billions of operators, computed using some complex parallelism. A single-controller paradigm is inefficient due to the substantial overhead of dispatching operators to distributed accelerators [1, 9].

传统RL框架(如RLLib [45]和RLLib Flow [46])采用分层单控制器范式来运行RL数据流。集中控制器将数据流中的节点分配给不同的进程,并协调它们的执行顺序。每个节点进程可以进一步生成更多工作器来执行计算,同样遵循单控制器范式。然而,它们仅提供数据并行训练的基元,并且仅限于大小最多为数百MB的神经网络 [45, 46]。在RLHF数据流中,每个节点对应一个具有数十亿操作符的大语言模型,使用某种复杂的并行性进行计算。由于将操作符分配到分布式加速器的开销较大,单控制器范式效率低下 [1, 9]。

Existing RLHF systems adopt a multi-controller paradigm to manage intra-node computation and inter-node data resharding [17, 30, 80]. Each controller independently manages the computation of one device and uses multiple point-topoint operations to coordinate data dependencies between different nodes. This multi-controller paradigm introduces negligible dispatch overhead when performing LLM computation (detailed in $\S2.2)$ ).

现有的 RLHF 系统采用多控制器范式来管理节点内计算和节点间数据重分区 [17, 30, 80]。每个控制器独立管理一个设备的计算,并使用多个点对点操作来协调不同节点之间的数据依赖关系。这种多控制器范式在执行大语言模型计算时引入了可忽略的调度开销(详见 $\S2.2$)。

However, without central control, it is inflexible to implement various RLHF dataflow, as modifying a single node to adapt to different data dependencies requires changing all dependent nodes’ implementation, hindering code reuse.

然而,在没有中央控制的情况下,实现各种RLHF数据流缺乏灵活性,因为修改单个节点以适应不同的数据依赖关系需要更改所有依赖节点的实现,这阻碍了代码的复用。

To address these limitations, we propose HybridFlow, a flexible and efficient RLHF framework to easily represent and execute diverse RLHF dataflows, attaining high throughput. Our key observation is that utilizing the single-controller paradigm on the inter-node level enables flexible expression of various data dependencies and easy coordination of inter-node data resharding with minimal overhead, while integrating the multi-controller paradigm within intra-node computation enhances computation efficiency substantially. We advocate a hierarchical hybrid programming model to generate RLHF dataflows. At the node level, multiple model classes are provided that encapsulate distributed computation (training, inference and generation) of different LLMs in the dataflow into primitive APIs. These APIs can seamlessly support various parallelism strategies from the existing LLM frameworks, including 3D parallelism [71], ZeRO [59], and PyTorch FSDP [57]), and perform distributed computation under the multi-controller paradigm. Among the nodes, a set of transfer protocols are designed to hide the complexity of data resharding from users, as coordinated by a single controller. This programming model abstracts away the complexity of distributed computing, allowing users to implement an RLHF dataflow in a few lines of code and run RLHF through a single process of the single controller. It also effectively decouples intra-node computation and inter-node data transfer, allowing independent optimization of each model without changing the code of other models in the dataflow.

为了解决这些限制,我们提出了 HybridFlow,一个灵活且高效的 RLHF 框架,能够轻松表示和执行多样化的 RLHF 数据流,实现高吞吐量。我们的关键观察是,在节点间层面利用单控制器范式能够灵活表达各种数据依赖关系,并轻松协调节点间数据重分片,同时最小化开销,而在节点内计算中集成多控制器范式则显著提升了计算效率。我们提倡采用分层混合编程模型来生成 RLHF 数据流。在节点层面,提供了多个模型类,将数据流中不同大语言模型的分布式计算(训练、推理和生成)封装为原始 API。这些 API 可以无缝支持现有大语言框架中的各种并行策略,包括 3D 并行 [71]、ZeRO [59] 和 PyTorch FSDP [57],并在多控制器范式下执行分布式计算。在节点间,设计了一组传输协议,由单控制器协调,隐藏了用户数据重分片的复杂性。该编程模型抽象了分布式计算的复杂性,允许用户通过几行代码实现 RLHF 数据流,并通过单控制器的单个进程运行 RLHF。它还有效地解耦了节点内计算和节点间数据传输,允许独立优化每个模型,而无需更改数据流中其他模型的代码。

Training and generation of the actor model represent major computation in the RLHF dataflow. We further design a 3D-Hybrid Engine to enable efficient execution of training and generation of the actor model, introducing zero memory redundancy and significantly reduced communication overhead during model parameter resharding between the training and generation stages. Our hybrid programming model also facilitates flexible placement of models onto the same or different sets of GPU devices. This allows us to provide an effective algorithm to optimize GPU allocation and placement of the models, with various model sizes and distinct workloads, for any RLHF dataflow. Our contributions in designing HybridFlow are summarized as follows:

演员模型的训练和生成是RLHF数据流中的主要计算任务。我们进一步设计了一个3D混合引擎,以实现演员模型训练和生成的高效执行,在训练和生成阶段之间的模型参数重分片过程中引入零内存冗余,并显著减少通信开销。我们的混合编程模型还促进了模型在同一组或不同组GPU设备上的灵活放置。这使我们能够提供一种有效的算法,针对任何RLHF数据流,优化GPU分配和模型放置,适用于各种模型大小和不同的工作负载。我们在设计HybridFlow中的贡献总结如下:

• We propose a hierarchical hybrid programming model for conveniently building the RLHF dataflow. This programming model enables efficient distributed execution of intra-node computation and flexible inter-node data resharding and transfer, for various RLHF algorithms (§4). • We design a 3D-Hybrid Engine that executes training and generation of the actor model with high computation efficiency and zero-redundancy transition between the training stage and the generation stage (§5). • We devise an effective mapping algorithm to automatically identify optimized GPU allocation and placement of each node (model) in the RLHF dataflow (§6). • We conduct extensive experiments comparing HybridFlow with state-of-the-art RLHF systems [17, 30, 82] under various

• 我们提出了一种分层混合编程模型,用于方便地构建 RLHF 数据流。该编程模型能够高效地执行节点内计算的分布式执行,并灵活地进行节点间数据的重分区和传输,适用于各种 RLHF 算法(§4)。
• 我们设计了一个 3D 混合引擎,以高计算效率执行演员模型的训练和生成,并在训练阶段和生成阶段之间实现零冗余切换(§5)。
• 我们开发了一种有效的映射算法,以自动识别 RLHF 数据流中每个节点(模型)的优化 GPU 分配和放置(§6)。
• 我们进行了广泛的实验,将 HybridFlow 与最先进的 RLHF 系统 [17, 30, 82] 在各种环境下进行比较。


Figure 1. Dataflow graph of 3 RLHF algorithms [19, 43, 55]. Stage $\textcircled{1}$ , $\circledcirc$ , $\circled{3}$ represent Generation, Preparation, and Training, respectively.

图 1: 3 种 RLHF (Reinforcement Learning from Human Feedback) 算法的数据流图 [19, 43, 55]。阶段 $\textcircled{1}$ 、 $\circledcirc$ 、 $\circled{3}$ 分别表示生成、准备和训练。

RLHF algorithms, model sizes and cluster scales. Our evaluation demonstrates $1.53!\times!!\sim!!20.57!\times$ throughput improvements.

RLHF 算法、模型规模和集群规模。我们的评估展示了 $1.53!\times!!\sim!!20.57!\times$ 的吞吐量提升。

We have open-sourced HybridFlow and believe that HybridFlow can boost future RLHF research and development.

我们已经开源了 HybridFlow,并相信 HybridFlow 能够推动未来 RLHF(人类反馈强化学习)的研究和开发。

2 Background and Motivation

2 背景与动机

2.1 Reinforcement Learning from Human Feedback RLHF Workflow. RLHF aligns the linguistic space of LLMs with human values, using a set of human-ranked candidates of given prompts [7, 19, 41, 43, 55, 70, 91]. An RLHF system typically consists of multiple models, e.g., an actor, a critic, a reference policy, and one or multiple reward models. The actor and the reference are each pre-trained/fined-tuned LLM (i.e., the LLM that is undergoing RLHF). The critic and reward models can be different LLMs fine-tuned on the human preference dataset, with the language modeling head replaced by a scalar output head [7, 55]. The RLHF workflow can be decomposed into 3 stages (Figure 1) and we take PPO as an example:

2.1 基于人类反馈的强化学习 (RLHF) 工作流程。RLHF 通过使用一组由人类排名的给定提示的候选方案,将大语言模型的语言空间与人类价值观对齐 [7, 19, 41, 43, 55, 70, 91]。一个 RLHF 系统通常由多个模型组成,例如一个行动者、一个评论者、一个参考策略以及一个或多个奖励模型。行动者和参考模型都是经过预训练/微调的大语言模型(即正在进行 RLHF 的大语言模型)。评论者和奖励模型可以是基于人类偏好数据集微调的不同大语言模型,其语言建模头被替换为标量输出头 [7, 55]。RLHF 工作流程可以分为 3 个阶段(图 1),我们以 PPO 为例:

•Stage 1 (Generation): The actor produces responses from a batch of prompts using auto-regressive generation.

阶段1(生成):AI智能体通过自回归生成从一批提示中生成响应。

•Stage 2 (Preparation): Using prompts and generated responses, the critic computes their values [66, 68], the reference policy computes their reference log probabilities, and the reward model computes their rewards [7, 55], all via a single pass of forward computation of the respective model.

• 阶段 2 (准备): 使用提示和生成的响应,批评者计算它们的值 [66, 68],参考策略计算它们的参考对数概率,奖励模型计算它们的奖励 [7, 55],所有这些都通过各自模型的一次前向计算完成。

•Stage 3 (Learning/Training): The actor and the critic are updated via Adam [38], using the batch of data produced by previous stages and the loss function [55].

• 阶段 3 (学习/训练):使用 Adam [38] 和之前阶段生成的数据批次以及损失函数 [55],更新 actor 和 critic。

Other RLHF algorithms largely follow the 3-stage workflow as well (Figure 1(b)(c)). Safe-RLHF [19] introduces an auxiliary pretrain loss following PPO-ptx [55] and includes an additional cost model to fit human preferences and safety labels simultaneously. ReMax [43] requires an additional generation pass for variance reduction and eliminates the critic model in the dataflow. Researchers are actively exploring novel RLHF algorithms [41, 70, 91] and integrating traditional RL methods into RLHF domains [37]. These variances necessitate a flexible representation of the RLHF dataflow graph to accommodate diverse algorithmic requirements.

其他 RLHF 算法也基本遵循三阶段工作流程(图 1(b)(c))。Safe-RLHF [19] 在 PPO-ptx [55] 的基础上引入了辅助预训练损失,并包含一个额外的成本模型,以同时拟合人类偏好和安全标签。ReMax [43] 需要额外的生成过程以减少方差,并在数据流中消除了评论家模型。研究人员正在积极探索新的 RLHF 算法 [41, 70, 91],并将传统强化学习方法整合到 RLHF 领域 [37]。这些变化要求 RLHF 数据流图具有灵活的表示能力,以适应不同的算法需求。


Figure 2. Programming model used in RLHF systems. (a) Existing RLHF systems adopt the multi-controller paradigm. (b) HybridFlow utilizes a hybrid programming model: the single-controller coordinates models; each model uses multicontroller paradigm in distributed computation. Inactive node in grey represents operation not executed at this time.

图 2: RLHF 系统中使用的编程模型。(a) 现有的 RLHF 系统采用多控制器范式。(b) HybridFlow 使用混合编程模型:单控制器协调模型;每个模型在分布式计算中使用多控制器范式。灰色节点表示此时未执行的操作。

Parallelism Strategies. LLMs are trained and served with data, pipeline, and tensor parallelism [36, 40, 54]. With data parallelism (DP), the input data is split into multiple subsets; each subset is processed by a separate device (e.g., a GPU) [69]. ZeRO [59] is a memory-optimized solution for DP training, progressively sharding optimizer states, gradients, and model parameters across GPUs. Pipeline parallelism (PP) [32, 53] and tensor parallelism (TP) [71] distribute model parameters, gradients and optimizer states across multiple GPUs. Modern distributed training frameworks like Megatron-LM [71] and MegaScale [36] utilize 3D parallelism or PTD parallelism [54], where P, T, D stand for PP, TP, DP, respectively. In 3D parallelism, PP size represents the number of pipeline stages in model training, TP size refers to the number of shards that a tensor is partitioned into, and DP size is the number of model replicas. LLM serving systems employ 3D parallelism similar to training while only model parameters and KVCache are sharded [16, 29, 40].

并行策略。大语言模型通过数据、管道和张量并行进行训练和部署 [36, 40, 54]。在数据并行 (DP) 中,输入数据被分割成多个子集;每个子集由独立的设备(例如 GPU)处理 [69]。ZeRO [59] 是一种针对 DP 训练的内存优化解决方案,逐步在 GPU 之间分片优化器状态、梯度和模型参数。管道并行 (PP) [32, 53] 和张量并行 (TP) [71] 将模型参数、梯度和优化器状态分布在多个 GPU 上。现代分布式训练框架如 Megatron-LM [71] 和 MegaScale [36] 采用 3D 并行或 PTD 并行 [54],其中 P、T、D 分别代表 PP、TP、DP。在 3D 并行中,PP 大小表示模型训练中的管道阶段数量,TP 大小指张量被分片成的数量,DP 大小是模型副本的数量。大语言模型部署系统采用与训练类似的 3D 并行,但仅分片模型参数和 KVCache [16, 29, 40]。

LLM models in the RLHF dataflow may perform distinct computations, including training (one forward pass, one backward pass and model update), inference (one forward pass) and generation (auto-regressive generation with multiple forward passes). In particular, training and generation are performed on the actor model, training and inference on the critic, and inference on reference policy and reward models. Distinct parallel strategies can be applied to different models for varied computations to achieve optimal throughput.

RLHF 数据流中的大语言模型可能执行不同的计算,包括训练(一次前向传播、一次反向传播和模型更新)、推理(一次前向传播)和生成(多次前向传播的自回归生成)。具体而言,训练和生成在 actor 模型上执行,训练和推理在 critic 模型上执行,推理在参考策略和奖励模型上执行。可以为不同模型应用不同的并行策略以实现最佳吞吐量。

2.2 Programming Model for Distributed ML

2.2 分布式机器学习的编程模型

Single-Controller. It employs a centralized controller to manage the overall execution flow of the distributed program. With centralized control logic, users can build core functionalities of the dataflow as a single process (Figure 2(b)), while the controller automatically generates distributed workers to carry out the computation. With a global view of the hardware and dataflow graph, the single-controller paradigm allows flexible and optimized resource mapping and execution order coordination among dataflow tasks. However, coordination messages are passed from the controller to all workers, incurring significant dispatch overhead when executing expansive dataflow graphs on large clusters [1, 9].

单控制器。它采用集中式控制器来管理分布式程序的整体执行流程。通过集中控制逻辑,用户可以将数据流的核心功能构建为单个进程(图 2(b)),而控制器会自动生成分布式工作节点来执行计算。凭借对硬件和数据流图的全局视图,单控制器范式允许在数据流任务之间进行灵活且优化的资源映射和执行顺序协调。然而,协调消息从控制器传递给所有工作节点,在大规模集群上执行扩展性数据流图时会产生显著的调度开销 [1, 9]。

Multi-Controller. Each device (aka worker) has its own controller. State-of-the-art distributed LLM training and serving systems adopt the multi-controller paradigm, due to its scalability and low dispatch overhead (control messaging largely passed from CPU to GPU over fast PCIe links) [36, 40, 60, 71]. As shown in the example that employs multi-controller RLHF implementation in Figure 2(a), a separate program is run for each model, and all workers of one model execute the same program. Each worker only possesses a local view of the system state and requires point-to-point communication between two models (blue code and arrows) to coordinate model execution order. To implement an RLHF workflow in the multi-controller architecture, a user must intricately integrate the code for collective communication, computation, and point-to-point data transfer in the program run at each device. This leads to deeply nested code of computation and data transfer, challenging to develop, maintain, and optimize. In Figure 2(a), each model performs local computation and all_gather operations (black code), while the actor model must explicitly manage send operations to the critic and reward models, and the latter must correspondingly implement receive operations at precise points in their program

多控制器。每个设备(即工作节点)都有自己的控制器。最先进的分布式大语言模型训练和服务系统采用多控制器范式,因为其具有可扩展性和低调度开销(控制消息主要通过快速的PCIe链路从CPU传递到GPU) [36, 40, 60, 71]。如图 2(a) 所示,采用多控制器RLHF实现的示例中,每个模型都运行一个独立的程序,且一个模型的所有工作节点执行相同的程序。每个工作节点仅拥有系统状态的局部视图,并且需要在两个模型之间进行点对点通信(蓝色代码和箭头)以协调模型执行顺序。要在多控制器架构中实现RLHF工作流,用户必须将集体通信、计算和点对点数据传输的代码巧妙地集成到每个设备运行的程序中。这导致计算和数据传输的代码深度嵌套,开发、维护和优化都极具挑战性。在图 2(a) 中,每个模型执行本地计算和all_gather操作(黑色代码),而actor模型必须显式管理向critic和reward模型的发送操作,后者则必须在程序的精确点相应地实现接收操作。

2.3 RLHF Characteristics

2.3 RLHF 特性

Heterogeneous model workloads. The actor, critic, reference and reward models in RLHF may execute training, inference or generation at different stages, with different memory footprint and computation demand. For reference policy and reward models, only their model parameters need to be stored in GPU memory, as they perform only the forward pass computation. For the actor and the critic, their model parameters, gradients, and optimizer states must be stored as they undergo model training. Moreover, a small actor model (e.g., a 7B pre-trained/fine-tuned LLM) can be paired with larger critic and reward models (e.g., 70B LLMs) in RLHF for better alignment [7]. Given such heterogeneity, different parallelism strategies and tailored optimization s are needed for running each model during RLHF.

异构模型工作负载。在 RLHF 中,actor、critic、reference 和 reward 模型可能在不同的阶段执行训练、推理或生成,具有不同的内存占用和计算需求。对于 reference policy 和 reward 模型,只需要将它们的模型参数存储在 GPU 内存中,因为它们仅执行前向传递计算。对于 actor 和 critic,它们的模型参数、梯度和优化器状态必须存储,因为它们会进行模型训练。此外,在 RLHF 中,可以将一个较小的 actor 模型(例如,7B 预训练/微调的大语言模型)与较大的 critic 和 reward 模型(例如,70B 大语言模型)配对,以实现更好的对齐 [7]。考虑到这种异构性,在 RLHF 期间运行每个模型需要不同的并行策略和定制优化。

Unbalanced computation between actor training and generation. In the RLHF dataflow, training and generation of the actor model are represented by two nodes (Figure 1), which often render majority of the workload in each RLHF iteration (e.g., $58.9%$ of total RLHF time with HybridFlow). Actor training is computation bound [24], often requiring a larger model-parallel (MP) size (i.e., the number of partitions the model is partitioned into) and distributing the workload to more GPUs, e.g., 8 partitions of a 7B model on 8 GPUs. Using the same parallelism strategy (e.g., the same MP size) for generation can lead to under utilization of GPU computation resources due to its memory-bound nature [40]. Previous studies show that combining a larger DP size with a smaller MP size (hybrid data and model parallelism), e.g., partition a 7B model into two and replicate it four times on 8 GPUs, can improve the generation throughput [44, 92]. Although using different parallelism strategies for actor training and generation may optimize throughput in both stages, resharding the actor model weights at runtime between the two stages can incur significant communication and memory overhead. For example, aligning a 70B actor model requires transferring 140GB of model weights from training to generation per RLHF iteration, taking up to $36.4%$ of an iteration time when the two stages are on different devices [30].

演员训练与生成之间的计算不平衡。在RLHF数据流中,演员模型的训练和生成由两个节点表示(图1),这两个节点通常在每次RLHF迭代中承担大部分工作负载(例如,在HybridFlow中占RLHF总时间的58.9%)。演员训练是计算密集型的[24],通常需要更大的模型并行(MP)规模(即模型被划分成的分区数量),并将工作负载分配到更多的GPU上,例如,在8个GPU上将7B模型划分为8个分区。由于生成是内存密集型的[40],使用相同的并行策略(例如,相同的MP规模)可能导致GPU计算资源的利用不足。先前的研究表明,结合更大的数据并行(DP)规模和更小的MP规模(混合数据与模型并行),例如将7B模型划分为两个分区并在8个GPU上复制四次,可以提高生成吞吐量[44, 92]。尽管为演员训练和生成使用不同的并行策略可能优化两个阶段的吞吐量,但在运行时间在这两个阶段之间重新划分演员模型权重可能会导致显著的通信和内存开销。例如,对齐一个70B的演员模型需要在每次RLHF迭代时从训练阶段传输140GB的模型权重到生成阶段,当两个阶段位于不同设备上时,这可能会占用高达36.4%的迭代时间[30]。


Figure 3. Dataflow execution given a model placement plan. Blocks with numbers represent GPUs. In dashed boxes, the models are placed on different sets of devices and can be concurrently computed. Reference model (blue) and reward model (green) are colocated on the same set of GPUs and executed sequentially.

图 3: 给定模型放置计划的数据流执行。带数字的方块代表 GPU。在虚线框中,模型被放置在不同的设备集上,可以并行计算。参考模型(蓝色)和奖励模型(绿色)被放置在相同的 GPU 集上,并顺序执行。

Diverse model placement requirements. Strategic device placement of models in the RLHF dataflow is necessary, according to computation workloads and data dependencies of the models. Figure 3 gives an example model placement plan and the corresponding RLHF execution flow. Models placed on different sets of devices can be executed in parallel if no data dependencies exist. Models placed on the same set of GPUs, referred to as colocated models, share the GPU memory and are executed sequentially in a time-sharing manner, as out-of-memory (OOM) error may easily happen if colocated LLMs execute concurrently.

多样化的模型部署需求。根据模型的计算工作负载和数据依赖关系,在RLHF数据流中进行策略性的设备部署是必要的。图3展示了一个模型部署计划示例以及相应的RLHF执行流程。如果不存在数据依赖关系,部署在不同设备集合上的模型可以并行执行。部署在同一组GPU上的模型,称为共置模型,共享GPU内存并以分时方式顺序执行,因为如果共置的大语言模型并发执行,很容易发生内存不足(OOM)错误。

We observe a compromise: placing models on different devices permits parallel processing but may inevitably lead to some GPU idle time, given staged model execution in RLHF. In Figure 3, actor and critic are placed separately, performing training in parallel, but incurring 1/3 of their GPU time being idle, during other RLHF stages. Supporting various

我们观察到一种折衷方案:将模型放置在不同的设备上可以实现并行处理,但由于 RLHF 中的分阶段模型执行,不可避免地会导致一些 GPU 空闲时间。在图 3 中,actor 和 critic 被分开放置,并行执行训练,但在其他 RLHF 阶段中,它们的 GPU 时间有 1/3 处于空闲状态。支持多种

Table 1. Comparison of RLHF frameworks. Figures illustrate execution of one PPO iteration. Numbers 1-6 represent response generation, reward model inference, reference model inference, critic inference, actor training, and critic training, respectively.

placement strategies and maximizing device utilization are crucial for optimizing RLHF performance at any model size and cluster scale.

表 1: RLHF 框架对比。图中展示了一次 PPO 迭代的执行过程。数字 1-6 分别代表响应生成、奖励模型推理、参考模型推理、评论家推理、演员训练和评论家训练。

RLHF 系统 DeepSpeed-Chat OpenRLHF NeMo-Aligner HybridFlow
并行性 训练:ZeRO 生成:TP 训练:ZeRO 生成:TP 训练和生成均采用 3D 并行 训练:3D, ZeRO, FSDP 生成:3D 并行
演员权重 模型重新分片 使用两个演员副本 使用相同的模型分区 零冗余
训练和生成中的模型 从 ZeRO 到 TP 两阶段的权重 两阶段(共享权重) 模型重新分片
放置 将所有模型放在同一组设备上 每个模型放在不同的设备上 演员/参考模型放在部分 GPU 上,评论家/奖励模型放在其他 GPU 上 支持多种模型放置
执行模式 5 5
演员 → GPU 进程 口 4 6
2 3
3
评论家 奖励模型
参考策略

在任何模型规模和集群规模下,放置策略和最大化设备利用率对于优化 RLHF 性能至关重要。

2.4 Limitations of existing RLHF systems Inflexible support for various RLHF dataflow grap

2.4 现有RLHF系统的局限性 对各种RLHF数据流图的支持不够灵活

Existing RLHF systems adopt the multi-controller paradigm for dataflow implementation [17, 30, 80, 82]. To implement various RLHF algorithms, a user must navigate and manage code that mixes collective communication, model computation (potentially using various distributed training/serving frameworks), and point-to-point data transfer. This code structure lacks modularity/function encapsulation, making the RLHF systems tightly coupled with specific LLM training and serving frameworks. Consequently, a user needs to implement and optimize different RLHF dataflows case-bycase [46], hindering code reuse and increasing the risk of making mistakes. Existing RLHF frameworks only support the PPO algorithm. In addition, limited parallel strategies are supported due to implementation complexity. For example, to incorporate 3D parallelism for LLM training and generation in DeepSpeed-Chat [82], one may have to re-implement the whole system due to the mixed code structure.

现有的RLHF系统采用多控制器范式来实现数据流[17, 30, 80, 82]。为了实现各种RLHF算法,用户必须处理和管理混合了集体通信、模型计算(可能使用各种分布式训练/服务框架)以及点对点数据传输的代码。这种代码结构缺乏模块化/功能封装,使得RLHF系统与特定的大语言模型训练和服务框架紧密耦合。因此,用户需要逐个案例实现和优化不同的RLHF数据流[46],这阻碍了代码重用并增加了出错的风险。现有的RLHF框架仅支持PPO算法。此外,由于实现复杂性,支持的并行策略有限。例如,为了在DeepSpeed-Chat[82]中集成3D并行用于大语言模型训练和生成,由于混合的代码结构,可能需要重新实现整个系统。

Inefficient RLHF execution. Table 1 summarizes parallelism strategies, model placement, and execution patterns adopted by the existing RLHF systems. DeepSpeed-Chat [82] and OpenRLHF [30] adopt ZeRO-3 for actor training and TP for actor generation. OpenRLHF uses different copies of the actor model on different devices for training and generation, incurring redundant memory usage and frequent weight synchron iz ation among devices. DeepSpeed-Chat maintains the same copy of actor model on the same set of devices for training and generation, and reshards model weights between training and generation (due to different parallelisms used in the two stages), which may still incur substantial memory and communication overhead for large models (detailed in $\S5.4)$ . NeMo-Aligner [17] uses the same 3D parallelism config u rations in actor training and generation, experiencing low generation throughput (§8.4).

低效的 RLHF 执行。表 1 总结了现有 RLHF 系统采用的并行策略、模型放置和执行模式。DeepSpeed-Chat [82] 和 OpenRLHF [30] 采用 ZeRO-3 进行 actor 训练,采用 TP 进行 actor 生成。OpenRLHF 在不同设备上使用不同的 actor 模型副本进行训练和生成,导致冗余的内存使用和设备之间频繁的权重同步。DeepSpeed-Chat 在同一组设备上保持相同的 actor 模型副本进行训练和生成,并在训练和生成之间重新分片模型权重(由于两个阶段使用了不同的并行策略),对于大型模型可能仍然会产生大量的内存和通信开销(详见 $\S5.4)$。NeMo-Aligner [17] 在 actor 训练和生成中使用相同的 3D 并行配置,导致生成吞吐量较低(§8.4)。

Existing RLHF frameworks are limited to one model placement plan and hence one RLHF execution pattern, as shown in Table 1. Implementing a different placement is difficult, requiring changing the inner logic of model initialization and inter-node data transfer as highlighted in blue in Figure 2. OpenRLHF and NeMo-Aligner allow concurrent model computation in the preparation and learning stages; in the generation stage, models except the actor are idle, wasting the GPUs they occupy. DeepSpeed-Chat colocates all models on the same set of devices, and each device runs each model sequentially according to the RLHF dataflow. With unbalanced workloads among the models, such a placement can be inefficient in resource utilization (evaluated in $\S8.3\$ ).

现有的RLHF框架仅限于一种模型放置计划,因此只有一种RLHF执行模式,如表1所示。实现不同的模型放置方案很困难,需要改变模型初始化和节点间数据传输的内部逻辑,如图2中蓝色部分所示。OpenRLHF和NeMo-Aligner允许在准备和学习阶段进行并行的模型计算;在生成阶段,除了actor模型外,其他模型都处于空闲状态,浪费了它们占用的GPU资源。DeepSpeed-Chat将所有模型放在同一组设备上,每个设备根据RLHF数据流依次运行每个模型。由于模型之间的工作负载不均衡,这种放置方式在资源利用上可能效率低下(在$\S8.3\$中进行了评估)。

2.5 Design Considerations

2.5 设计考虑

To tackle limitations of existing systems, the key question is - How to design a flexible and efficient programming model to implement RLHF dataflow? A single-controller design is particularly advantageous at the inter-node level due to its flexibility in coordinating data transfer, execution order, and resource virtualization among distributed computation of different models [9, 50]. The RLHF dataflow graph typically consists of only a few nodes. Dispatching control messages to different nodes from the single-controller incurs negligible overhead as compared to distributed computation required for nodes (models) in the dataflow. The multi-controller paradigm, known for its low latency in dispatching operators to accelerators [20], can be leveraged in distributed computation of each model. With these insights, we propose a hierarchical hybrid programming model for RLHF dataflow implementation. Our key design principle is to combine single-controller and multi-controller paradigms in a hybrid manner. This design ensures flexible expression and efficient execution of RLHF dataflow, maintaining low control overhead at both inter-node and intra-node levels. As shown in Figure 2(b), this paradigm decouples intra-node distributed computation and inter-node data transfer, allowing each model to focus solely on local computation without managing inter-node communication.

为了应对现有系统的局限性,关键问题在于如何设计一个灵活高效的编程模型来实现RLHF数据流?单控制器设计在节点间级别具有显著优势,因为它在协调数据传输、执行顺序和资源虚拟化方面具有灵活性,特别是在不同模型的分布式计算中 [9, 50]。RLHF数据流图通常仅由少数几个节点组成。与数据流中节点(模型)所需的分布式计算相比,从单控制器向不同节点分发控制消息的开销可以忽略不计。多控制器范式以其在将操作分发到加速器时的低延迟而闻名 [20],可以在每个模型的分布式计算中加以利用。基于这些见解,我们提出了一种分层的混合编程模型来实现RLHF数据流。我们的关键设计原则是以混合方式结合单控制器和多控制器范式。这种设计确保了RLHF数据流的灵活表达和高效执行,同时在节点间和节点内保持低控制开销。如图2(b)所示,这种范式解耦了节点内的分布式计算和节点间的数据传输,使每个模型能够专注于本地计算,而无需管理节点间通信。

3 HybridFlow Overview

3 HybridFlow 概述

Figure 4 depicts the architecture of HybridFlow, which consists of three major components: Hybrid Programming Model,

图 4: HybridFlow 的架构,由三个主要组件组成:混合编程模型

3D-Hybrid Engine and Auto-Mapping algorithm. The hybrid programming model includes a set of hierarchical APIs to enable flexible expression of the RLHF dataflow and efficient computation of models in the dataflow (§4). The 3DHybrid Engine is particularly designed for efficient training and generation of the actor model, allowing different 3D parallel configurations in the two stages and enabling zero memory redundancy and minimized communication overhead during the transition between two stages (§5). The automapping algorithm determines optimized device placement of each model to maximize the throughput of RLHF (§6).

3D混合引擎与自动映射算法。混合编程模型包含一组分层API,用于灵活表达RLHF数据流并高效计算数据流中的模型(§4)。3D混合引擎专为演员模型的高效训练和生成设计,允许在两个阶段中使用不同的3D并行配置,并在阶段转换期间实现零内存冗余和最小化通信开销(§5)。自动映射算法确定每个模型的优化设备放置,以最大化RLHF的吞吐量(§6)。

The workflow of our RLHF system goes as follows. A user provides the following inputs to start the RLHF system: (i) model specifications, including the architecture and size of the actor/critic/reference policy/reward models in the RLHF dataflow; (ii) device placement of the models in the dataflow, as obtained by running the auto-mapping algorithm under given GPU cluster configurations; (iii) parallelism strategy for running each model in each stage, e.g., a tuple of (p, t, d) for 3D parallelism, where p, t, d represent PP size, TP size and DP size, respectively. The single controller program takes these inputs to initialize models in the RLHF dataflow and virtual i zed resource pool, dispatches operations/models to devices according to the placement plan, and invokes functions run by the multiple controllers on devices to carry out distributed computation of each model.

我们的 RLHF 系统工作流程如下。用户提供以下输入以启动 RLHF 系统:(i) 模型规格,包括 RLHF 数据流中 actor/critic/reference policy/reward 模型的架构和大小;(ii) 在给定 GPU 集群配置下运行自动映射算法获得的数据流中模型的设备放置;(iii) 每个阶段运行每个模型的并行策略,例如 (p, t, d) 元组用于 3D 并行,其中 p, t, d 分别代表 PP 大小、TP 大小和 DP 大小。单个控制器程序接收这些输入以初始化 RLHF 数据流和虚拟化资源池中的模型,根据放置计划将操作/模型分配到设备,并调用设备上由多个控制器运行的函数以执行每个模型的分布式计算。

The multi-controller program implements the ParallelWorker class: it constructs parallel groups of each model among allocated devices according to its parallelism strategies, invokes the 3D-Hybrid Engine for actor training and generation, and can be integrated seamlessly with existing LLM engines [40, 57, 60, 71] for training, inference and generation of other models. The transfer protocols are coordinated by the single controller program to support resharding of data (including prompts, responses, and other model outputs in RLHF) between models with distinct parallelism strategies. The data resharding of the actor between training and generation is handled by 3D-Hybrid Engine.

多控制器程序实现了 ParallelWorker 类:它根据其并行策略在分配的设备中构建每个模型的并行组,调用 3D-Hybrid Engine 进行训练和生成,并可以无缝集成现有的 LLM 引擎 [40, 57, 60, 71],用于其他模型的训练、推理和生成。传输协议由单个控制器程序协调,以支持具有不同并行策略的模型之间的数据(包括提示、响应和 RLHF 中的其他模型输出)重分片。训练和生成之间的数据重分片由 3D-Hybrid Engine 处理。

4 Hybrid Programming Model

4 混合编程模型

4.1 Hierarchical APIs

4.1 分层 API

Intra-node: encapsulating distributed program. For distributed computation of each model in different RLHF stages, we provide a base class, 3 D Parallel Worker. Given allocated devices, it facilitates distributed model weight initialization and establishes 3D parallel groups for each model. A parallel group includes a set of GPUs to host a specific parallel dimension of the model, e.g., different tensor shards in TP and different model replicas in DP. Figure 5(a) illustrates initialization of the actor model with our APIs, while initialization of other models is similar.

节点内部:封装分布式程序。针对不同RLHF阶段中每个模型的分布式计算,我们提供了一个基类,3D并行工作器。在给定分配的设备后,它有助于分布式模型权重的初始化,并为每个模型建立3D并行组。一个并行组包括一组GPU,用于承载模型的特定并行维度,例如TP中的不同张量分片和DP中的不同模型副本。图5(a)展示了使用我们的API初始化actor模型的过程,而其他模型的初始化过程类似。


Figure 5. An illustration of hierarchical APIs. (a) Model with 3D parallel configuration, resource allocation, and 3 D Parallel Worker initialization. (b) Asynchronous data resharding between two models with collect and distribute functions in 3D_PROTO.

图 5: 分层 API 的示意图。(a) 具有 3D 并行配置、资源分配和 3D 并行 Worker 初始化的模型。(b) 在 3D_PROTO 中具有收集和分发功能的两个模型之间的异步数据重分片。

Inheriting from the 3 D Parallel Worker class, several mode classes, for actor, critic, reference, and reward model, respectively, are provided. Each of these model classes encapsulates APIs to implement the model’s distributed forward and backward computation, auto-regressive generation, and optimizer updates, decoupling the distributed computation code with data dependencies with other models. These APIs can be easily implemented by reusing the computation scripts from existing LLM systems. For example, the computation involved in update actor function of Actor Worker (the class for the actor model) is similar to the pre-training scripts in Megatron-LM [71]. A model class encapsulates fundamental operations for implementing various RLHF algorithms, e.g., generate sequences in the actor model class for generating responses based on the prompts and compute reward in the reward model class for evaluating responses through a forward pass. (More APIs are detailed in Appendix A).

继承自 3D Parallel Worker 类,分别为 actor、critic、reference 和 reward 模型提供了多个模型类。每个模型类封装了实现模型的分布式前向和反向计算、自回归生成和优化器更新的 API,将分布式计算代码与其他模型的数据依赖解耦。这些 API 可以通过重用现有大语言模型系统中的计算脚本来轻松实现。例如,Actor Worker(actor 模型的类)中 update actor 函数涉及的计算与 Megatron-LM [71] 中的预训练脚本类似。模型类封装了实现各种 RLHF 算法的基本操作,例如在 actor 模型类中生成序列以基于提示生成响应,在 reward 模型类中通过前向传递计算奖励以评估响应。(更多 API 详见附录 A)。

Besides base class 3 D Parallel Worker that implements 3D parallelism, we further provide base classes for PyTorch FSDP (FSDPWorker) and ZeRO (ZeROWorker), and the corresponding model classes inheriting each base class, to support different parallelism strategies in model computation. ParallelWorker in Figure 4 denotes one of these base classes.

除了实现 3D 并行的基础类 3D Parallel Worker,我们还提供了 PyTorch FSDP (FSDPWorker) 和 ZeRO (ZeROWorker) 的基础类,以及继承每个基础类的相应模型类,以支持模型计算中的不同并行策略。图 4 中的 ParallelWorker 表示这些基础类之一。

Inter-node: unifying data resharding implementation between models. Many-to-many multicast is involved for data transfer between models employing different parallelism strategies on different devices. We unify this data transfer implementation by associating each operation in each model class with a transfer protocol, using @register. Each transfer protocol consists of a collect function and a distribute function, to aggregate output data and distribute input data according to the parallelism strategy of each model. In the example in Figure 5(a), update actor operation is registered to transfer protocol 3D_PROTO, as 3D parallelism is used for actor training. In 3D_PROTO, the collect function gathers all the output data of corresponding model function (e.g., the loss scalar return from the update actor) in each DP group to the single controller, and the distribute function distributes the input data to the registered function (e.g., advantages for the update actor) to each DP group. Data resharding is enabled using the source model’s output collect function and the destination model’s input distribute function. Figure 5(b) illustrates data resharding between the actor (generation) and the critic (inference), where computation of the models adopts different 3D parallelism strategies. The single controller gathers data futures using the collect function in 3D_PROTO of actor (steps $\textcircled{1}{-}\textcircled{3})$ and sends it to critic (step $\textcircled{4})$ ); critic distributes the received data futures to each DP group using the distribute function in its 3D_PROTO (step $\mathfrak{G}$ ). Then remote data is retrieved from actor to critic, with each of critic’s GPUs only fetching the required local batch of the actor’s output data according to its DP rank (step $\circled{6}$ ). The actual data transfer only occurs between GPUs, avoiding any central bottleneck.

节点间:统一模型间的数据重分片实现。在不同设备上采用不同并行策略的模型之间进行数据传输时,涉及多对多的多播。我们通过在模型类中的每个操作上关联一个传输协议来统一这一数据传输实现,使用@register。每个传输协议由一个收集函数和一个分发函数组成,根据每个模型的并行策略来聚合输出数据并分发输入数据。在图 5(a) 的示例中,由于 actor 训练采用了 3D 并行,因此 update actor 操作被注册到传输协议 3D_PROTO。在 3D_PROTO 中,收集函数将每个 DP 组中相应模型函数的所有输出数据(例如,从 update actor 返回的损失标量)聚合到单个控制器,分发函数将输入数据(例如,update actor 的优势值)分发到每个 DP 组。数据重分片通过源模型的输出收集函数和目标模型的输入分发函数实现。图 5(b) 展示了 actor(生成)和 critic(推理)之间的数据重分片,其中模型的计算采用了不同的 3D 并行策略。单个控制器使用 actor 的 3D_PROTO 中的收集函数收集数据未来(步骤 $\textcircled{1}{-}\textcircled{3})$),并将其发送给 critic(步骤 $\textcircled{4})$);critic 使用其 3D_PROTO 中的分发函数将接收到的数据未来分发给每个 DP 组(步骤 $\mathfrak{G}$)。然后,从 actor 中检索远程数据到 critic,critic 的每个 GPU 根据其 DP 等级仅获取 actor 输出数据所需的本地批次(步骤 $\circled{6}$)。实际的数据传输仅在 GPU 之间进行,避免了任何中心瓶颈。

We provide 8 transfer protocols, including 3D_PROTO, DP _PROTO, ONE_TO_ALL, etc., that cover most data resharding scenarios (detailed in Appendix B). A user can further extend the transfer protocols through implementing customized collect and distribute functions.

我们提供了 8 种传输协议,包括 3D_PROTO、DP_PROTO、ONE_TO_ALL 等,涵盖了大多数数据重分片场景(详见附录 B)。用户可以通过实现自定义的收集和分发函数进一步扩展传输协议。

Facilitating flexible model placement. We provide a Resource Pool class that virtualizes a set of GPU devices. When applying a Resource Pool instance to a model class (Figure 5(a)), distributed computation of the model will be mapped to the devices. Models utilizing the same Resource Poo instance are colocated on the same set of GPUs; models are placed on different sets of GPUs when different Resource Pool instances are applied in their model classes. We assume no overlap between different Resource Pool instances.

促进灵活的模型部署。我们提供了一个资源池类,用于虚拟化一组 GPU 设备。当将资源池实例应用于模型类时(图 5(a)),模型的分布式计算将被映射到这些设备上。使用相同资源池实例的模型将被部署在同一组 GPU 上;当模型类中应用不同的资源池实例时,模型将被部署在不同的 GPU 组上。我们假设不同的资源池实例之间没有重叠。

Asynchronous dataflow execution. When models are placed on separate sets of devices, their execution is triggered automatically as soon as their inputs become available [50]. In Figure 5(b), the data future from actor is immediately returned after the controller’s call (steps $\textcircled{1}{-}\textcircled{3})$ ; the controller then initiates a new call to critic and distributes the futures following the transfer protocol (steps $\scriptstyle(4)-(5))$ . When some models are placed on the same set of devices, they are executed sequentially based on the calling order. With our programming model, HybridFlow is flexible in supporting diverse distributed execution patterns without any code change of the RLHF algorithm (Figure 6).

异步数据流执行。当模型被放置在不同的设备集上时,一旦它们的输入可用,它们的执行就会自动触发 [50]。在图 5(b) 中,来自 actor 的数据 future 在控制器的调用后立即返回 (步骤 $\textcircled{1}{-}\textcircled{3})$ ;然后控制器发起对 critic 的新调用,并按照传输协议分发 futures (步骤 $\scriptstyle(4)-(5))$ 。当某些模型被放置在同一设备集上时,它们会根据调用顺序依次执行。通过我们的编程模型,HybridFlow 能够灵活支持各种分布式执行模式,而无需对 RLHF 算法进行任何代码更改 (图 6)。

# 通过复用 Rewardworker 初始化成本模型
icost=Rewardworker(cost_config,resource_pool)) # 省略其他模型初始化
algo_type # 指定不同的 RLHF 数值计算
# PPO 和 Safe-RLHF 的示例
for(prompts,pretrain_batch)in dataloader: # 阶段 1: 生成响应
batch=_actor.generate_sequences(prompts)
batch=actor.generate_sequences(prompts, do_sample=False)
# 阶段 2: 准备经验 (为 ReMax 添加)
Xbatch = critic.compute_values(batch) # 在 ReMax 中不需要:
batch = reference.compute_log_prob(batch;
batch =_reward.compute_reward(batch) # 为 Safe-RLHF 添加
batch=cost.compute_cost(batch)i
batch= compute_advantages(batch,algo_type)
# 阶段 3: Actor 和 critic 训练
critic_metrics = critic.update_critic(batch,loss_func=algo_type)
pretrain_loss=actor.compute_toss(pretrain_batch)
batch["pretrain_loss"]=pretrain_loss
actor_metrics =actor.update_actor(batch,Toss_func=algo_type)

4.2 Implementation of different RLHF algorithms

4.2 不同RLHF算法的实现

Our APIs enable streamlined development of various RLHF algorithms (dataflows). Users can implement an RLHF algorithm in a few lines of code as a single process program to run on the single controller, that involves a sequence of primitive API calls to invoke distributed computation of models. Examples of PPO, ReMax, and Safe-RLHF are given in Figure 6. PPO can be implemented in just 8 lines by invoking model operations including compute values and generate sequences, which are executed under the multicontroller paradigm on multiple GPUs. To adapt to SafeRLHF which integrates an additional cost model to evaluate safety preferences and the pre-taining loss for actor, only 5 more lines of code are added on top of PPO implementation. To adapt to ReMax, one additional call to actor generation is needed, and the critic-related code can be removed.

我们的 API 支持各种 RLHF 算法(数据流)的简化开发。用户只需几行代码即可实现一个 RLHF 算法,作为在单个控制器上运行的单一进程程序,其中涉及一系列原始 API 调用来调用模型的分布式计算。图 6 中给出了 PPO、ReMax 和 Safe-RLHF 的示例。PPO 只需 8 行代码即可实现,通过调用包括计算值和生成序列在内的模型操作,这些操作在多个 GPU 上的多控制器范式下执行。为了适应 SafeRLHF,它集成了额外的成本模型来评估安全偏好和 actor 的预训练损失,只需在 PPO 实现的基础上添加 5 行代码。为了适应 ReMax,需要额外调用一次 actor 生成,并可以删除与 critic 相关的代码。

Achieving flexible. This flexibility of extension is crucial for researchers to explore different RLHF algorithms: they can reuse distributed computation encapsulated in each model class and simply adjust the code for numerical computations according to specific algorithms, such as GAE [67] and KL divergence in compute advantage and loss functions of actor and critic. The streamlined development can be attributed to the hybrid programming model. Our modular API design simplifies development, facilitates extensive code reuse, and enables directly incorporating the codebase of existing LLM training/serving frameworks. It also decouples model computation and data transfer among models. Any change in the distributed frameworks does not affect the code of the RLHF algorithm (Figure 6), enabling individualized optimization for each model’s execution (§5). Flexible placement of models with diverse workloads is supported, enabling optimized mapping of RLHF dataflow onto various devices (§6).

实现灵活性。这种扩展的灵活性对于研究人员探索不同的 RLHF 算法至关重要:他们可以重用封装在每个模型类中的分布式计算,并根据特定算法(例如 GAE [67] 和 KL 散度)简单地调整数值计算的代码,以计算优势和损失函数。这种简化的开发归功于混合编程模型。我们的模块化 API 设计简化了开发,促进了广泛的代码重用,并能够直接集成现有大语言模型训练/服务框架的代码库。它还解耦了模型计算和模型之间的数据传输。分布式框架的任何更改都不会影响 RLHF 算法的代码(图 6),从而实现对每个模型执行的个性化优化(§5)。支持具有不同工作负载的模型的灵活放置,从而优化 RLHF 数据流在各种设备上的映射(§6)。

Figure 7. 3D-Hybrid Engine workflow in one RLHF iteration. 4 GPUs are used for actor training and generation. 1-2-2 $(p-t-d)$ parallel groups are used in training and 1-1-2-2 $\left(p_{g^{-}}\right.$ $t_{g}{-}d_{g}{-}d)$ parallel groups are used in generation.

图 7: 3D-Hybrid Engine 在一次 RLHF 迭代中的工作流程。4 个 GPU 用于演员训练和生成。训练中使用 1-2-2 $(p-t-d)$ 并行组,生成中使用 1-1-2-2 $\left(p_{g^{-}}\right.$ $t_{g}{-}d_{g}{-}d)$ 并行组。

5 3D-Hybrid Engine

5 3D-混合引擎

We design the 3D-Hybrid Engine to support efficient training and generation of the actor model, targeting significant RLHF throughput improvement.

我们设计了3D-Hybrid Engine,以支持高效的演员模型训练和生成,旨在显著提升RLHF的吞吐量。

5.1 Parallel Groups

5.1 并行组

To eliminate redundant actor model copies, we advocate deploying actor training and generation stages on the same set of devices, $N_{a}$ GPUs allocated to the actor, and execute them sequentially on the same copy of actor model weights. Nonetheless, actor training and generation may well adopt different 3D parallelism strategies, i.e., the generation stage typically requires smaller TP and PP sizes but a larger DP size, than the training stage (§2.3). 3D-Hybrid Engine enables efficient model parameter resharding between actor training and generation across the same set of devices in this context.

为了消除冗余的演员模型副本,我们建议将演员训练和生成阶段部署在同一组设备上,即分配给演员的 $N_{a}$ 个 GPU 上,并在同一副本的演员模型权重上顺序执行它们。然而,演员训练和生成阶段可能会采用不同的 3D 并行策略,即生成阶段通常需要更小的 TP 和 PP 规模,但 DP 规模比训练阶段更大(§2.3)。3D-Hybrid Engine 在这种情况下能够高效地在同一组设备上实现演员训练和生成之间的模型参数重分片。

Let $p{-}t{-}d$ denote 3D parallel groups constructed for actor training, corresponding to the set of GPUs to host $\boldsymbol{p}$ pipeline stages, $t$ tensor shards, and $d$ model replicas [54]. 3D-Hybrid Engine builds different parallel groups for actor training and generation, according to their different 3D parallelism strategies, respectively. We use $\mathcal{p}{g},t{g}$ , and $d_{g}$ to denote the size of generation pipeline parallel group, generation tensor parallel group, and micro data parallel group, respectively, in the generation stage. $d_{g}$ indicates the ratio of model replica number in generation over that in training, i.e., each DP replica in training becomes $d_{g}$ micro DP replicas, to process $d_{g}$ micro batches of prompts and responses. We have 𝑁𝑎=𝑝×𝑡×𝑑=𝑝𝑔×𝑡𝑔×𝑑𝑔×𝑑such that 𝑑𝑔=𝑝𝑝𝑔𝑡𝑡𝑔. The micro DP groups are employed exclusively in actor generation stage to render a larger DP size for full device utilization. The generation parallel groups are denoted by $p_{g}{-}t_{g}{-}d_{g}{-}d$ .

令 $p{-}t{-}d$ 表示为演员训练构建的3D并行组,对应于托管 $\boldsymbol{p}$ 流水线阶段、$t$ 张量分片和 $d$ 模型副本的GPU集合 [54]。3D-Hybrid Engine 根据不同的3D并行策略,分别为演员训练和生成构建不同的并行组。我们使用 $\mathcal{p}{g},t{g}$ 和 $d_{g}$ 分别表示生成阶段的生成流水线并行组、生成张量并行组和微数据并行组的大小。$d_{g}$ 表示生成阶段的模型副本数与训练阶段的比例,即训练中的每个DP副本变为 $d_{g}$ 个微DP副本,以处理 $d_{g}$ 个微批次的提示和响应。我们有 𝑁𝑎=𝑝×𝑡×𝑑=𝑝𝑔×𝑡𝑔×𝑑𝑔×𝑑,使得 𝑑𝑔=𝑝𝑝𝑔𝑡𝑡𝑔。微DP组仅在演员生成阶段使用,以提供更大的DP大小以实现设备的充分利用。生成并行组表示为 $p_{g}{-}t_{g}{-}d_{g}{-}d$。

5.2 3D-Hybrid Engine Workflow

5.2 3D混合引擎工作流程

Between actor training in iteration 𝑖of RLHF and actor generation in iteration $i+1$ , the actor model parameters need to be resharded and prompts data to be distributed, following the parallel group configurations in the two stages. In iteration $i+1$ of RLHF, 3D-Hybrid Engine gathers the actor model parameters updated in iteration 𝑖(step $\textcircled{1}$ in Figure 7), for generation within each micro DP group. Then, the batch of prompts are loaded to each model replica (step $\textcircled{2}],$ ), which generates responses (Generation stage of RLHF). Following this, 3D-Hybrid Engine performs an all-gather operation on the generation results within each micro DP group (step $\textcircled{3}.$ ), and re-partitions model parameters according to the 3D parallelism for actor training (step $\textcircled{4}.$ ). With model weights, prompts and responses correctly re-distributed, the loss of the actor model is computed and actor model weights are updated following the RLHF algorithm (step $\circledcirc$ ) - actor training stage of iteration $i+1$ .

在 RLHF 迭代 𝑖 中的 actor 训练和迭代 $i+1$ 中的 actor 生成之间,actor 模型参数需要根据两个阶段的并行组配置进行重新分片,并将提示数据分发。在 RLHF 迭代 $i+1$ 中,3D-Hybrid Engine 收集在迭代 𝑖 中更新的 actor 模型参数(图 7 中的步骤 $\textcircled{1}$),以在每个微 DP 组内进行生成。然后,将一批提示加载到每个模型副本中(步骤 $\textcircled{2}$),并生成响应(RLHF 的生成阶段)。随后,3D-Hybrid Engine 在每个微 DP 组内对生成结果执行 all-gather 操作(步骤 $\textcircled{3}$),并根据 3D 并行性重新分区模型参数以进行 actor 训练(步骤 $\textcircled{4}$)。在正确重新分配模型权重、提示和响应后,计算 actor 模型的损失,并根据 RLHF 算法更新 actor 模型权重(步骤 $\circledcirc$)——迭代 $i+1$ 的 actor 训练阶段。


Figure 8. Model weights resharding. 2 machines each with 4 GPUs are used for actor training and generation.

图 8: 模型权重重新分片。使用两台机器,每台机器配备 4 个 GPU,用于 Actor 训练和生成。

5.3 Zero redundancy model resharding

5.3 零冗余模型重分片

Parallel grouping methods in 3D parallelism are typically as follows: PP and TP groups are formed by assigning consecutive ranks to pipeline stages and tensor shards, respectively; DP groups are constructed by selecting ranks at regular intervals, determined by the product of PP size and TP size. In Figure 8(a), actor training uses 3D parallel groups, 1-4-2: there is one PP group for all GPUs (for illustration clarify); the TP groups are [G1, G2, G3, G4], [G5, G6, G7, G8], and the DP groups are [G1, G5], [G2, G6], [G3, G7], [G4, G8]. Suppose the same parallel grouping methods are used but with different parallel sizes, e.g., 1-2-2-2 for generation in Figure 8(a). During the transition from training to generation, 3D-Hybrid Engine applies all-gather operations among the model parallel groups to aggregate all parameters, and then retain only a subset of model weights on each device for its generation, according to the parallel groups the device belongs to. On some GPUs (e.g., G2, G3, G6, G7), there is no overlap between training and generation model weights, and separate memory is needed to maintain weights for subsequent training as well (grey boxes in Figure 8(a)).We call the system HybridFlow-V, when 3D-Hybrid Engine uses the above vanilla parallel grouping methods in the two stages.

3D并行中的并行分组方法通常如下:PP组和TP组分别通过为流水线阶段和张量分片分配连续的rank来形成;DP组通过以固定间隔选择rank来构建,间隔由PP大小和TP大小的乘积决定。在图8(a)中,actor训练使用了3D并行分组,1-4-2:所有GPU有一个PP组(为了说明清晰);TP组是[G1, G2, G3, G4]、[G5, G6, G7, G8],DP组是[G1, G5]、[G2, G6]、[G3, G7]、[G4, G8]。假设使用相同的并行分组方法但并行大小不同,例如图8(a)中生成的1-2-2-2。在从训练过渡到生成时,3D-Hybrid Engine在模型并行组之间应用all-gather操作以聚合所有参数,然后根据设备所属的并行组,仅保留每个设备上的部分模型权重用于生成。在某些GPU上(例如G2、G3、G6、G7),训练和生成的模型权重没有重叠,还需要单独的内存来维护后续训练的权重(图8(a)中的灰色框)。当3D-Hybrid Engine在两个阶段使用上述vanilla并行分组方法时,我们称该系统为HybridFlow-V。

Table 2. Transition overhead between training $&$ generation

表 2. 训练和生成之间的转换开销

DS-Chat HybridFlow-V HybridFlow
通信量 (Comm.Vol) tpd-1M tpd tp-1M tp tp-tgPgM tgpgtp
峰值内存 (Peak Mem.) M M 1M tgPg
冗余 (Redundancy) M 0

We further design a new parallel grouping method for 3DHybrid Engine to use in the generation stage, that eliminates the redundancy in weights storage and leads to minimal memory footprint and communication due to actor model resharding between training and generation. Specifically, we form generation TP and PP groups by selecting ranks at regular intervals, determined by $\frac{t}{t_{g}}$ and $\frac{p}{p_{g}}$ , and construct micro DP groups by sequentially assigning ranks along the generation TP or PP dimensions. In Figure 8(b), 1-2-2-2 parallel groups are used in generation: the generation TP groups are [G1, G3], [G2, G4], [G5, G7], [G6, G8]; and the micro DP groups are [G1, G2], [G3, G4], [G5, G6], [G7, G8]. This strategic rearrangement of generation parallel groups leads to overlap between training and generation model weights on each device, enabling reuse of training weights during generation and zero redundancy in device memory usage due to model resharding. In addition, 3D-Hybrid Engine conducts several all-gather operations concurrently, one within each micro DP group, leading to significantly reduced communication overhead.

我们进一步为3DHybrid Engine设计了一种新的并行分组方法,用于生成阶段,该方法消除了权重存储中的冗余,并在训练和生成之间由于actor模型的重新分片而实现了最小的内存占用和通信开销。具体来说,我们通过按$\frac{t}{t_{g}}$和$\frac{p}{p_{g}}$的间隔选择rank来形成生成TP和PP组,并沿着生成TP或PP维度顺序分配rank来构建微DP组。在图8(b)中,生成阶段使用了1-2-2-2并行分组:生成TP组为[G1, G3]、[G2, G4]、[G5, G7]、[G6, G8];微DP组为[G1, G2]、[G3, G4]、[G5, G6]、[G7, G8]。这种生成并行组的战略性重新排列使得每个设备上的训练和生成模型权重重叠,从而在生成过程中重用训练权重,并且由于模型重新分片而实现了设备内存使用的零冗余。此外,3D-Hybrid Engine在每个微DP组内同时进行多个all-gather操作,显著减少了通信开销。

5.4 Transition overhead

5.4 转换开销

In Table 2, we compare communication overhead and memory footprint during the transition between training and generation stages, among different actor engine designs. We assume model size of the actor is $M$ and $N_{a}$ GPUs are used for its training and generation. The actor engine in DeepSpeedChat conducts an all-gather operation across all GPUs during transition; HybridFlow-V performs this all-gather within training TP and PP groups. The communication volumes for these operations are $\begin{array}{r}{\frac{\bar{N}{a}-1}{N{a}}M=\frac{t p d-1}{t p d}M}\end{array}$ for DeepSpeedChat and ${\frac{t p-1}{t p}}M$ for HybridFlow-V, calculated following [13]. Both engines aggregate all model parameters in each GPU’s memory before subsequently partitioning model states according to the generation parallel groups, resulting in a peak memory usage of model parameters $M$ . As they cannot reuse training weights during generation on some GPUs, training weights need to be maintained on them, amounting to𝑡𝑝1𝑑 and $\textstyle{\frac{1}{t p}}$ redundant memory consumption, respectively.

表 2 中,我们比较了不同执行引擎设计在训练和生成阶段转换期间的通信开销和内存占用情况。我们假设执行器模型的大小为 $M$,并使用 $N_{a}$ 个 GPU 进行训练和生成。DeepSpeedChat 中的执行引擎在转换期间在所有 GPU 之间执行全收集操作;HybridFlow-V 则在训练 TP 和 PP 组内执行此全收集操作。根据 [13],这些操作的通信量分别为 DeepSpeedChat 的 $\begin{array}{r}{\frac{\bar{N}{a}-1}{N{a}}M=\frac{t p d-1}{t p d}M}\end{array}$ 和 HybridFlow-V 的 ${\frac{t p-1}{t p}}M$。两个引擎在生成并行组之前都会在每个 GPU 的内存中聚合所有模型参数,从而导致模型参数的峰值内存使用量为 $M$。由于它们在某些 GPU 上无法在生成期间重复使用训练权重,因此需要在这些 GPU 上维护训练权重,分别导致 𝑡𝑝1𝑑 和 $\textstyle{\frac{1}{t p}}$ 的冗余内存消耗。

With our parallel grouping method for the generation stage, HybridFlow confines the all-gather operation within each micro DP group. The communication overhead is reduced to $\begin{array}{r}{\frac{d_{g}-1}{t_{f}}\bar{M}=\frac{t\bar{p}-t_{g}p_{g}}{t_{g}p_{g}t_{f}}M}\end{array}$ 𝑡𝑡𝑝−𝑝𝑡𝑔𝑡𝑝𝑝𝑔𝑀. Each GPU only needs to collect remote parameters within its micro DP group and can reuse the training weights in generation. Therefore, the peak memory usage of model parameters in HybridFlow precisely matches the model partition size on each GPU in generation, eliminating any redundancy in GPU memory usage.

通过我们在生成阶段的并行分组方法,HybridFlow 将 all-gather 操作限制在每个微 DP 组内。通信开销减少到 $\begin{array}{r}{\frac{d_{g}-1}{t_{f}}\bar{M}=\frac{t\bar{p}-t_{g}p_{g}}{t_{g}p_{g}t_{f}}M}\end{array}$ 𝑡𝑡𝑝−𝑝𝑡𝑔𝑡𝑝𝑝𝑔𝑀。每个 GPU 只需要收集其微 DP 组内的远程参数,并可以在生成过程中重用训练权重。因此,HybridFlow 中模型参数的峰值内存使用量与生成时每个 GPU 上的模型分区大小完全匹配,消除了 GPU 内存使用中的任何冗余。

6 Auto Device Mapping

6 自动设备映射

Our hybrid programming model requires users to input the following configurations, which are referred to as a mapping of the RLHF dataflow to the given devices: (a) device placement of the models in the dataflow; (b) the corresponding parallelism strategy for running each model in each stage.

我们的混合编程模型要求用户输入以下配置,这些配置被称为RLHF数据流到给定设备的映射:(a) 数据流中模型的设备放置;(b) 每个阶段中运行每个模型的相应并行策略。

We provide an efficient algorithm (Algorithm 1) for users to identify the optimized mapping of executing the RLHF dataflow on a given cluster of devices, that minimizes the end-to-end latency of each RLHF iteration. Given a dataflow $D$ , we first explore all possible placement plans $\mathcal{P}$ for the models in the given cluster (Line 3). For example, the PPO algorithm involves four models, resulting in 15 possible placements (from the Bell partition problem [10, 62]), ranging from a completely standalone placement where all models are placed on different devices (e.g., OpenRLHF’s placement) to colocating all models on the same set of devices (e.g., DeepSpeed-Chat’s placement). We refer to colocated models on the same set of GPUs as a colocated set. Models in a colocated set can employ different parallelism strategies across the same set of GPUs. We identify the smallest number of GPUs to be allocated to each of the colocated model sets, $A_{m i n}$ , based on memory consumption of colocated models, ensuring no out-of-memory errors (Line 9).

我们提供了一种高效的算法(算法1),用于帮助用户在给定设备集群上识别执行RLHF数据流的最优映射,从而最小化每次RLHF迭代的端到端延迟。给定一个数据流$D$,我们首先探索了在给定集群中所有可能的模型放置计划$\mathcal{P}$(第3行)。例如,PPO算法涉及四个模型,导致15种可能的放置(来自Bell分区问题[10, 62]),从完全独立的放置(所有模型放置在不同设备上,例如OpenRLHF的放置)到将所有模型放在同一组设备上(例如DeepSpeed-Chat的放置)。我们将放置在同一组GPU上的模型称为一个共置集。共置集中的模型可以在同一组GPU上采用不同的并行策略。我们根据共置模型的内存消耗,确定了分配给每个共置模型集的最小GPU数量$A_{min}$,以确保不会出现内存不足的错误(第9行)。

Next, starting from the minimal GPU allocation in $A_{m i n}$ we enumerate all feasible device allocations to each colocated model set (Lines 10-12). Given device allocation $A$ to the colocated set and computation workload 𝑊 of models in the set, we explore optimized parallelism strategies for each model in the auto parallel module, that minimizes model execution latency. The workload $W$ includes input and output shapes and computation (training, inference or generation) of each model. In auto parallel, we utilize a simulator module simu to estimate the latency of different parallel strategies, following previous research [42, 84, 90, 92] (outline in Appendix. C).

接下来,从 $A_{m i n}$ 中的最小 GPU 分配开始,我们枚举所有可行的设备分配到每个共置模型集(第 10-12 行)。给定共置集的设备分配 $A$ 和集中模型的计算工作量 𝑊,我们在自动并行模块中为每个模型探索优化的并行策略,以最小化模型执行延迟。工作量 $W$ 包括每个模型的输入和输出形状以及计算(训练、推理或生成)。在自动并行中,我们利用模拟器模块 simu 来估计不同并行策略的延迟,遵循先前的研究 [42, 84, 90, 92](见附录 C 的概述)。

The d_cost module estimates the end-to-end latency of the RLHF dataflow under given model placement and parallelism strategies, by iterating through all stages in the dataflow graph and summing up latencies of all stages (Lines 1 25). For models in the same colocated set and involving computation in the same stage (such as actor and critic both performing model update in RLHF training stage), their execution latencies are summed up (Line 32). For models in different colocated sets, their execution within the same stage can be parallel i zed, and the latency of the stage is determined by the maximum execution time among different sets (Line 33). We identify the best device placement of the models with their corresponding parallelism strategies, achieving minimal execution time per RLHF iteration (Lines 18-23).

d_cost模块通过遍历数据流图中的所有阶段并累加所有阶段的延迟,估计给定模型放置和并行策略下RLHF数据流的端到端延迟(第1-25行)。对于位于同一共址集并在同一阶段涉及计算的模型(例如actor和critic在RLHF训练阶段都执行模型更新),它们的执行延迟会被累加(第32行)。对于位于不同共址集的模型,它们在相同阶段的执行可以并行化,该阶段的延迟由不同集中最大执行时间决定(第33行)。我们确定模型的最佳设备放置及其相应的并行策略,以实现每次RLHF迭代的最小执行时间(第18-23行)。

算法 1: RLHF 数据流的设备映射
1: 输入: RLHF 数据流图 D, RLHF 数据流中的大语言模型 L=[l1, l2,..., lk], RLHF 数据流中大语言模型的工作负载 W, GPU 总数 N, 每个 GPU 的内存容量 Q
2: 输出: RLHF 数据流中模型的设备映射
3: P ← get_placements(D, L, N)
4: C*←8
5: best_mapping ←0
6: for all plm ∈ P do
7: Cplm ← 0 best_plm_alloc ← 0
8:
9: Amin ← get_min_alloc(plm, Q, N)
10: 11: for all A E enum_alloc(N,Amin) do →]
12: for all set E plm do
13: for all l E set do
14: ↑ ← auto_parallel(A, Amin, I, W)
15: L.append(I)
16: plm.update(L)
17: Calloc <← d_cost(D, plm, W)
18: if Calloc < Cplm then
19: Cplm ← Calloc
20: best_plm_alloc ←(plm,A)
21: if Cplm < C* then
22: C* ← Cplm
23: best_mapping ← best_plm_alloc
24: return best_mapping
25: 过程 d_cost(D,plm,W):
26: s←D 中的阶段数
27: c←[o] ×s // 将每个阶段的延迟初始化为 0
28: for all set Eplm do
29: Cg←[0] xs
30: for all i e {0,.., s -- 1} do
31: for all l e set do
32: cg[i] ← cg[i] + simu(l, W[i])
33: c[i] ←max{c[i],cg[i]}
34: return sum(c)

The complexity of Algorithm 1 is $\begin{array}{r}{O(\frac{(N-1)!}{(k-1)!(N-k)!})}\end{array}$ , where $k$ is the number of models in the dataflow and $N$ is the total number of devices to run the dataflow. This is the worst-case complexity for enumerating all possible device allocations for a placement strategy (i.e., the standalone placement), calculated by assigning $N$ devices to $k$ models (known as the integer partition problem [6]). For better efficiency, we cache parallelism strategies identified for each model on a number of devices $A$ , to eliminate redundant searches for the same parallelism strategies when the model is placed on different sets of $A$ GPUs in different placement strategies.

算法 1 的复杂度为 $\begin{array}{r}{O(\frac{(N-1)!}{(k-1)!(N-k)!})}\end{array}$,其中 $k$ 是数据流中的模型数量,$N$ 是运行数据流的总设备数量。这是枚举所有可能的设备分配策略(即独立放置)的最坏情况复杂度,通过将 $N$ 个设备分配给 $k$ 个模型(称为整数划分问题 [6])计算得出。为了提高效率,我们缓存了在 $A$ 个设备上为每个模型确定的并行策略,以消除在不同放置策略中,当模型放置在不同的 $A$ 个 GPU 集合上时,对相同并行策略的冗余搜索。

Though we assume $N$ homogeneous GPUs when running the auto mapping algorithm, Algorithm 1 can be readily extended for optimizing model mapping over heterogeneous devices, by considering heterogeneous devices in simu and auto parallel modules [88].

尽管我们在运行自动映射算法时假设了 $N$ 个同构 GPU,但通过在 simu 和 auto parallel 模块中考虑异构设备 [88],算法 1 可以轻松扩展为优化异构设备上的模型映射。

7 Implementation

7 实现

HybridFlow is implemented in around 12k lines of Python code (LoC).

HybridFlow 在约 12k 行 Python语言代码 (LoC) 中实现。

Hybrid programming model. The hierarchical APIs are implemented with $1.8\mathrm{k}$ LoC. The centralized single controller is built on top of Ray [50] and uses Remote Process Calls (RPC) to coordinate the execution order of different models and transfer data between models following the dataflow. These intermediate data are stored in TensorDict [57]. In our multi-controller paradigm for distributed computation, each model function runs on a separate process across various devices, with control messages relayed from each controller’s CPU process to the corresponding GPU. Our implementation supports Megatron-LM, PyTorch FSDP, and DeepSpeed as the LLM training and inference engines, and vLLM for autoregressive generation. In vLLM, we replace the centralized KVCache manager with a distributed manager to align with the multi-controller paradigm.

混合编程模型。分层 API 使用 $1.8\mathrm{k}$ 行代码实现。集中式单一控制器构建在 Ray [50] 之上,并使用远程进程调用 (RPC) 来协调不同模型的执行顺序,并按照数据流在模型之间传递数据。这些中间数据存储在 TensorDict [57] 中。在我们的分布式计算多控制器范式中,每个模型函数在不同的设备上运行在单独的进程中,控制消息从每个控制器的 CPU 进程传递到相应的 GPU。我们的实现支持 Megatron-LM、PyTorch FSDP 和 DeepSpeed 作为大语言模型训练和推理引擎,并使用 vLLM 进行自回归生成。在 vLLM 中,我们将集中式 KVCache 管理器替换为分布式管理器,以与多控制器范式保持一致。

3D-Hybrid Engine. Its main logic is implemented with 2.4k LoC on top of Megatron-LM and vLLM. We store actor model weights for training and generation stages on separate memory buffers, offload generation weights to the CPU memory during training, reload generation weights back to GPU memory during the transition, and use both buffers in generation. We use NCCL communication primitives [35] to collect and concatenate model parameters in each micro DP group during the transition between training and generation. We offload KVCache to CPU memory after generation and reload it back to GPU in the next iteration.

3D-Hybrid Engine。其主要逻辑在 Megatron-LM 和 vLLM 的基础上用 2.4k LoC 实现。我们将训练和生成阶段的 actor 模型权重存储在不同的内存缓冲区中,在训练期间将生成权重卸载到 CPU 内存,在过渡期间将生成权重重新加载到 GPU 内存,并在生成阶段同时使用两个缓冲区。在训练和生成之间的过渡期间,我们使用 NCCL 通信原语 [35] 来收集和连接每个微 DP 组中的模型参数。在生成之后,我们将 KVCache 卸载到 CPU 内存,并在下一次迭代中将其重新加载到 GPU。

Auto-Mapping Algorithm is implemented with 1.9k LoC, together with three simulators for training, inference, and generation workloads. The algorithm is run before starting the RLHF dataflow on CPU, to generate device mapping and parallelism strategies for dataflow initialization.

Auto-Mapping 算法以 1.9k 行代码实现,并附带三个用于训练、推理和生成工作负载的模拟器。该算法在 CPU 上启动 RLHF 数据流之前运行,以生成设备映射和并行策略用于数据流初始化。

8 Evaluation

8 评估

8.1 Experimental Setup

8.1 实验设置

Testbed. We deploy HybridFlow on a cluster of 16 machines (128 GPUs). Each machine is equipped with 8 NVIDIA A100- 80GB GPUs inter-connected with 600GB/s NVLink. The inter-machine bandwidth is 200Gbps. Our experiments use the following software versions: CUDA12.1, PyTorch 2.1.2, Megatron-core 0.6.0, NCCL 2.18.1, and vLLM 0.3.1.

测试平台。我们在一个由16台机器(128个GPU)组成的集群上部署了HybridFlow。每台机器配备了8个NVIDIA A100-80GB GPU,通过600GB/s的NVLink互连。机器间的带宽为200Gbps。我们的实验使用了以下软件版本:CUDA 12.1、PyTorch 2.1.2、Megatron-core 0.6.0、NCCL 2.18.1和vLLM 0.3.1。

Models and RLHF algorithms. We run the RLHF dataflow (Figure 1) of PPO [68], ReMax [43] and Safe-RLHF [19] algorithms. PPO is one of the most popular algorithms for RLHF [7, 55], consisting of actor, critic, reference policy, and reward models. Each model is a Llama [73] model with sizes ranging from 7B to 70B. Safe-RLHF has an additional cost model whose architecture and size are the same as the reward model and ReMax eliminates the critic model. We use mixed precision for actor and critic training, i.e., BF16 for model parameters and FP32 for gradient and optimizer states, with Adam [38] optimizer in all experiments. BF16 is used in model inference and auto-regressive generation. If not specified, the experiment results are obtained from PPO. Baselines. We compare HybridFlow with state-of-the-art RLHF systems including DeepSpeed-Chat [82] v0.14.0, OpenRLHF [30] v0.2.5, and NeMo-Aligner [17] v0.2.0 (detailed in Table 1). NeMo-Alginer doesn’t support ReMax algorithm. We do not compare HybridFlow to other frameworks such as Trlx [27], Hugging Face DDP [79], and Collosal-Chat [15] as they are less representative and slower than the above baselines (as reported in [82]).

模型与RLHF算法。我们运行了PPO [68]、ReMax [43] 和 Safe-RLHF [19] 算法的RLHF数据流(图1)。PPO是最流行的RLHF算法之一 [7, 55],由actor、critic、参考策略和奖励模型组成。每个模型都是Llama [73] 模型,大小从7B到70B不等。Safe-RLHF有一个额外的成本模型,其架构和大小与奖励模型相同,而ReMax则去掉了critic模型。我们在actor和critic训练中使用混合精度,即模型参数使用BF16,梯度和优化器状态使用FP32,所有实验均使用Adam [38] 优化器。模型推理和自回归生成使用BF16。如未特别说明,实验结果均来自PPO。基线。我们将HybridFlow与最先进的RLHF系统进行比较,包括DeepSpeed-Chat [82] v0.14.0、OpenRLHF [30] v0.2.5 和 NeMo-Aligner [17] v0.2.0(详见表1)。NeMo-Aligner不支持ReMax算法。我们没有将HybridFlow与其他框架如Trlx [27]、Hugging Face DDP [79] 和 Collosal-Chat [15] 进行比较,因为它们不如上述基线具有代表性且速度较慢(如[82] 中所述)。


Figure 11. Safe-RLHF throughput. Numbers in the parentheses are HybridFlow speedups compared with the baselines

图 11: Safe-RLHF 吞吐量。括号中的数字是 HybridFlow 相比于基线的加速比

We use RLHF throughput (tokens/sec) as the performance metric, computed by dividing the total number of tokens in prompts and responses in a global batch by one RLHF iteration time. All reported performance numbers are averaged over 5 training iterations after a warm-up of 10 iterations.

我们使用 RLHF 吞吐量 (tokens/sec) 作为性能指标,通过将全局批次中提示和响应的总 token 数除以一次 RLHF 迭代时间来计算。所有报告的性能数字均在 10 次迭代预热后,取 5 次训练迭代的平均值。

Datasets and hyper parameters. We perform RLHF on "Dahoas/ful-hh-rlhf" dataset [7] of Hugging Face, which is widely used for LLM alignment [64, 85]. As the baseline systems may not incorporate continuous-batching optimization [83] during generation, for a fair comparison, we enforce the same length on all responses to be generated. In each experiment, the input prompt length and the output response length are both 1024 and the global batch size of input prompts to the actor model is 1024. The number of PPO epochs is 1 and the number of PPO update iterations per epoch is 8, aligning with previous RLHF research [31, 55, 81].

数据集和超参数。我们在 Hugging Face 的 "Dahoas/ful-hh-rlhf" 数据集 [7] 上进行 RLHF,该数据集广泛用于大语言模型对齐 [64, 85]。由于基线系统在生成过程中可能未采用连续批处理优化 [83],为了公平比较,我们强制所有生成的响应长度相同。在每次实验中,输入提示长度和输出响应长度均为 1024,输入提示的全局批量大小为 1024。PPO 的 epoch 数为 1,每个 epoch 的 PPO 更新迭代次数为 8,与之前的 RLHF 研究一致 [31, 55, 81]。

8.2 End-to-End performance

8.2 端到端性能

Figures 9, 10, and 11 show RLHF throughput when running PPO, ReMax, and Safe-RLHF respectively. The actor, critic, reference, and reward models in this set of experiments are of the same size, following previous practice [7, 55, 82]. The number of GPUs used in experiments of different model sizes ranges from the smallest number of GPUs to run RLHF without OOM to 128 GPUs. We do not enable offloading optimizer states [61] in the experiments for fair comparison. Overall performance. We observe that HybridFlow consistently outperforms the baselines across all model scales. In

图 9、图 10 和图 11 分别展示了运行 PPO、ReMax 和 Safe-RLHF 时的 RLHF 吞吐量。本组实验中的 actor、critic、reference 和 reward 模型大小相同,遵循了之前的实践 [7, 55, 82]。实验中使用的 GPU 数量从运行 RLHF 而不发生 OOM 的最小 GPU 数量到 128 个 GPU 不等。为了公平比较,我们在实验中没有启用优化器状态卸载 [61]。总体性能。我们观察到,HybridFlow 在所有模型规模上始终优于基线。


Figure 12. Throughput of HybridFlow under different placements

图 12: 不同部署下 HybridFlow 的吞吐量

Figure 9 for PPO, HybridFlow outperforms DeepSpeed-Chat, OpenRLHF and NeMo-Aligner by $3.67\times$ (up to $7.84\times)$ , $3.25\times$ (up to $5.93\times$ ) and $12.52\times$ (up to $20.57\times)$ , respectively. This is mainly because HybridFlow effectively executes generation, inference, and training in all RLHF stages by sharding the models with different parallelism strategies to fit various computation workloads. HybridFlow achieves the highest average speedup of $9.64\times$ when training 70B models, as HybridFlow reduces the transition overhead by up to $71.2%$ and $89.1%$ compared to DeepSpeed-Chat and OpenRLHF, which also incurs large inter-machine communication when training with ZeRO-3. Due to the lack of KVCache in generation engine, NeMo-Aligner’s main performance bottleneck lies in the generation stage, which accounts for up to $81.2%$ of its RLHF iteration time. Similar results can be observed in Figures 10, 11 validating the efficiency of HybridFlow on running various RLHF algorithms.

图 9: 在 PPO 算法中,HybridFlow 的性能分别比 DeepSpeed-Chat、OpenRLHF 和 NeMo-Aligner 高出 $3.67\times$ (最高 $7.84\times)$、$3.25\times$ (最高 $5.93\times$) 和 $12.52\times$ (最高 $20.57\times$)。这主要是因为 HybridFlow 通过采用不同的并行策略对模型进行分片,以适配各种计算负载,从而在 RLHF 的所有阶段有效执行生成、推理和训练。在训练 70B 模型时,HybridFlow 实现了最高的平均加速比 $9.64\times$,因为与 DeepSpeed-Chat 和 OpenRLHF 相比,HybridFlow 将转换开销减少了高达 $71.2%$ 和 $89.1%$,而后者在使用 ZeRO-3 训练时还会产生大量的跨机器通信。由于生成引擎中缺乏 KVCache,NeMo-Aligner 的主要性能瓶颈在于生成阶段,其占 RLHF 迭代时间的比例高达 $81.2%$。类似的结论可以在图 10 和图 11 中观察到,验证了 HybridFlow 在运行各种 RLHF 算法时的效率。

S cal ability. HybridFlow achieves at least $2.09\times$ speedup on 8 GPUs. With increasing GPUs, the strong scaling efficiency of HybridFlow on various model scales is $66.8%$ , computed by divi ding throughput in smallest scale b throughput in largest scale by y $\frac{\operatorname*{max}.#\operatorname*{of}\mathrm{GPUs}}{\operatorname*{min}.#\operatorname*{of}\mathrm{GPUs}}$ [5], averaging over three algorithms and all model scales. Scaling to a large number of GPUs with a fixed global batch size results in smaller local batch sizes for each worker, potentially causing GPU under utilization. Running 7B models on 128 GPUs, HybridFlow still outperforms the best baseline OpenRLHF for $1.68\times$ , $1.53\times$ , and $1.71\times$ on PPO, ReMax, and Safe-RLHF respectively. This can be attributed to HybridFlow’s ability to adapt the best placement strategies for different models and cluster sizes to minimize RLHF time. OpenRLHF performs better in a larger GPU cluster but less efficiently on smaller ones.

可扩展性。HybridFlow 在 8 个 GPU 上至少实现了 $2.09\times$ 的加速。随着 GPU 数量的增加,HybridFlow 在各种模型规模上的强扩展效率为 $66.8%$,通过将最小规模下的吞吐量与最大规模下的吞吐量相除,再乘以 $\frac{\operatorname*{max}.#\operatorname*{of}\mathrm{GPUs}}{\operatorname*{min}.#\operatorname*{of}\mathrm{GPUs}}$ [5] 计算得出,该结果是对三种算法和所有模型规模的平均值。在固定全局批处理大小的情况下扩展到大量 GPU 会导致每个工作者的本地批处理大小变小,可能导致 GPU 利用率不足。在 128 个 GPU 上运行 7B 模型时,HybridFlow 仍然在 PPO、ReMax 和 Safe-RLHF 上分别以 $1.68\times$、$1.53\times$ 和 $1.71\times$ 的性能优于最佳基线 OpenRLHF。这可以归因于 HybridFlow 能够根据不同模型和集群大小适配最佳放置策略,以最小化 RLHF 时间。OpenRLHF 在较大的 GPU 集群中表现更好,但在较小的集群中效率较低。

8.3 Model Placement

8.3 模型部署

In this experiment, we implement various model placements of the PPO algorithm in HybridFlow, under the same model and cluster settings as in Sec. 8.2: (i) colocate, the placement strategy in DeepSpeed-Chat; (ii) standalone, that in OpenRLHF and; (iii) split, NeMo-Aligner’s colocation placement (actor and reference policy on the same set of devices and critic and reward model on another); (iv) hybridflow, the optimized placement obtained by Algorithm 1.

在本实验中,我们在与第8.2节相同的模型和集群设置下,在HybridFlow中实现了PPO算法的多种模型放置策略:(i) colocate,DeepSpeed-Chat中的放置策略;(ii) standalone,OpenRLHF中的策略;(iii) split,NeMo-Aligner的共置放置(演员和参考策略在同一组设备上,评论者和奖励模型在另一组设备上);(iv) hybridflow,通过算法1获得的优化放置策略。

Comparison of different model placements. Figure 12 reveals that optimized placement of HybridFlow under different numbers of GPUs varies. From 16 to 64 GPUs, colocating all models on the same set of devices yields the best performance. For 96 to 128 GPUs with 34B models and 96 GPUs with 13B models, the split strategy becomes optimal. The split strategy divides GPUs evenly between the two sets of models, as their sizes are equal. For 13B models on 128 GPUs, the standalone strategy achieves the highest throughput. In this case, HybridFlow allocates 64 GPUs for the actor, 32 for the critic, and 16 each for the reference and reward model. In smaller clusters, computation of all models can fully utilize GPU resources; the colocate strategy ensures maximum GPU usage in different RLHF stages. In larger clusters, RLHF throughput under colocate placement fails to scale up linearly as the batch size is fixed and the computation-tocommunication ratio decreases with a larger DP size on more GPUs. Standalone and split strategies place models on different devices with a smaller DP size for each model in larger clusters, facilitating parallel execution of different models in the same stages. In all cases, our Algorithm 1 produces the best placement with the highest training throughput.

不同模型放置策略的对比。图 12 揭示了在不同 GPU 数量下,HybridFlow 的优化放置策略有所不同。在 16 到 64 个 GPU 的情况下,将所有模型放置在同一组设备上能获得最佳性能。对于 96 到 128 个 GPU 的 34B 模型以及 96 个 GPU 的 13B 模型,拆分策略变得最优。拆分策略将 GPU 均匀分配给两组模型,因为它们的规模相等。对于 128 个 GPU 的 13B 模型,独立策略实现了最高的吞吐量。在这种情况下,HybridFlow 为 actor 分配 64 个 GPU,为 critic 分配 32 个 GPU,为参考模型和奖励模型各分配 16 个 GPU。在较小的集群中,所有模型的计算可以充分利用 GPU 资源;同置策略确保了在不同 RLHF 阶段的最大 GPU 使用率。在较大的集群中,由于批量大小固定且计算与通信的比率随着更多 GPU 上 DP 规模的增加而降低,同置放置下的 RLHF 吞吐量无法线性扩展。独立和拆分策略将模型放置在不同的设备上,每个模型在较大集群中具有较小的 DP 规模,从而促进了同一阶段中不同模型的并行执行。在所有情况下,我们的算法 1 都生成了具有最高训练吞吐量的最佳放置策略。

Larger critic and reward model. We further evaluate model placements when running PPO with a 13B actor and reference policy and 70B critic and reward models (larger critic and reward models are expected to produce better alignment [7]). Figure 13 shows that the colocate strategy still outperforms others by $44.8%$ on average with up to 64 GPUs. The split strategy achieves higher throughput with 96 GPUs. When scaling to 128 GPUs, the best placement obtained by Algorithm 1 colocates actor, reference, and reward models on 64 GPUs while allocating the remaining 64 GPUs to critic. On the same number of GPUs, actor and reference policy’s computation time is much smaller than critic and reward model, and colocating the reward model with actor and reference policy reduces the GPU idle time in the experience preparation stage. In general, distributing actor and critic on different devices for parallel execution in the training stage leads to higher throughput in large clusters.

更大的批评和奖励模型。我们进一步评估了在使用13B的actor和参考策略以及70B的critic和奖励模型运行PPO时的模型放置情况(更大的critic和奖励模型预计会产生更好的对齐效果[7])。图13显示,在最多64个GPU的情况下,colocate策略仍然平均比其他策略高出44.8%。split策略在96个GPU时实现了更高的吞吐量。当扩展到128个GPU时,算法1获得的最佳放置策略将actor、参考和奖励模型共置于64个GPU上,同时将剩余的64个GPU分配给critic。在相同数量的GPU上,actor和参考策略的计算时间远小于critic和奖励模型,将奖励模型与actor和参考策略共置减少了经验准备阶段的GPU空闲时间。总的来说,在大型集群中,将actor和critic分布在不同设备上以进行并行执行,在训练阶段会带来更高的吞吐量。

8.4 3D-Hybrid Engine

8.4 3D 混合引擎

Transition time comparison. Figure 14 shows the transition time between actor training and generation stages on various model scales, which is the time to reshard model weights from training to generation, under the same settings in $\S8.2$ . OpenRLHF’s transition time includes weight synchron iz ation time between two copies of the actor model on different devices. HybridFlow reduces the transition time by $55.2%$ (11.7s) on average and the transition overhead by up to $89.1%$ (78.2s) with 70B models, while maintaining consistent overhead across different cluster scales. This is attributed to our new parallel grouping method for the generation stage (§5.4). In baseline methods, all model parameters must be collected during transition, necessitating layer-by-layer collections multiple times to prevent OOM. HybridFlow enables zero memory redundancy during transition and requires only one all-gather operation per micro DP group.

转换时间比较。图 14 展示了在不同模型规模下,演员训练和生成阶段之间的转换时间,即在 $\S8.2$ 相同设置下,将模型权重从训练重新分片到生成的时间。OpenRLHF 的转换时间包括在不同设备上的演员模型两个副本之间的权重同步时间。HybridFlow 在 70B 模型上平均减少了 $55.2%$ (11.7s) 的转换时间,并将转换开销降低了多达 $89.1%$ (78.2s),同时在不同集群规模下保持了一致的开销。这归功于我们在生成阶段的新并行分组方法(§5.4)。在基线方法中,所有模型参数必须在转换期间收集,需要多次逐层收集以防止 OOM。HybridFlow 在转换期间实现了零内存冗余,并且每个微 DP 组只需要一次全收集操作。


Figure 14. Transition time between actor training and generation.

图 14: Actor训练和生成之间的过渡时间


Figure 15. Time breakdown on different generation parallel sizes of the actor model on 16 GPUs.

图 15: 在 16 个 GPU 上不同生成并行大小下 actor 模型的时间分解。

Transition and generation time We further validate the need to use different parallel sizes in actor training and generation in HybridFlow. In this experiment, all models are colocated on the same set of GPUs, and the KVCache for generation is allocated using the remaining GPU memory (i.e., best-effort allocation). Figure 15 gives the transition and generation time when running RLHF on 16 GPUs with 7B and 13B models, respectively, with training parallel groups 1-8-2 (following p-t-d convention) and varying generation TP group size $t_{g}$ from 1 to 8. The generation PP group size remains constant at ${p}{g}!=!1$ and the micro DP group size $d{g}$ is computed as $\frac{8}{t_{g}}$ . We observe that applying a smaller generation TP group size, $t_{g}{=}2$ , for 7B models and $t_{g}{=}4$ for 13B models reduces the generation latency by $60.3%$ and $36.4%$ , respectively. Conversely, using the same TP size as training $(t_{g}{=}8)$ , following the NeMo-Aligner approach, results in the largest generation latency due to GPU under utilization. Further reducing $t_{g}$ fails to achieve higher speedup, as a smaller $t_{g}$ necessitates maintaining a larger KVCache per GPU.

转换和生成时间我们进一步验证了在 HybridFlow 中需要在训练和生成阶段使用不同并行规模的必要性。在此实验中,所有模型共置在同一组 GPU 上,生成的 KVCache 使用剩余的 GPU 内存分配(即尽力而为分配)。图 15 展示了在 16 个 GPU 上分别运行 7B 和 13B 模型的 RLHF 时,训练并行组为 1-8-2(遵循 p-t-d 惯例)且生成 TP 组大小 $t_{g}$ 从 1 到 8 变化时的转换和生成时间。生成 PP 组大小保持为 ${p}{g}!=!1$,微 DP 组大小 $d{g}$ 计算为 $\frac{8}{t_{g}}$。我们观察到,对于 7B 模型应用较小的生成 TP 组大小 $t_{g}{=}2$,对于 13B 模型应用 $t_{g}{=}4$,分别将生成延迟减少了 $60.3%$ 和 $36.4%$。相反,使用与训练相同的 TP 大小 $(t_{g}{=}8)$(遵循 NeMo-Aligner 方法),由于 GPU 利用率不足,导致生成延迟最大。进一步减少 $t_{g}$ 未能实现更高的加速,因为较小的 $t_{g}$ 需要在每个 GPU 上维护更大的 KVCache。


Figure 16. Runtime of device mapping algorithm. The model size and $#$ of GPUs are simultaneously scaled.

图 16: 设备映射算法的运行时间。模型大小和 GPU 数量同时扩展。

8.5 Algorithm Runtime

8.5 算法运行时间

Figure 16 shows the running time of Algorithm 1, which is significantly shorter than days of actual RLHF training. A linear growth of running time is exhibited, revealing good s cal ability of the device mapping algorithm with model size and cluster size. Most of the running time is spent on estimating the execution latency of each model’s parallel strategies. More parallelism strategies are available for a larger model, requiring more simulations to identify the optimal one for each placement plan. Our caching of optimal parallelism strategies of the models to be reapplied across different placements reduces the search time for the best placement to at most half an hour.

图 16 展示了算法 1 的运行时间,其显著短于实际的 RLHF 训练天数。运行时间呈线性增长,表明设备映射算法在模型大小和集群大小方面具有良好的可扩展性。大部分运行时间用于估计每个模型的并行策略的执行延迟。较大的模型有更多的并行策略可供选择,需要更多的模拟来确定每个放置计划的最优策略。我们对模型的最优并行策略进行缓存,以便在不同的放置方案中重复应用,这将搜索最佳放置方案的时间减少到最多半小时。

9 Discussions

9 讨论

Fault Tolerance. HybridFlow is orthogonal to existing faulttolerance approaches [22, 34, 49, 76, 93] and already incorporates check pointing. Failures can be detected by NCCL errors and silent-data-corruption by checksums. Our programming model enables the single controller to coordinate checkpoint operations via RPC, allowing the saving of model states within each Par all Worker Group. This includes saving parameters of actor/critic models, dataloader IDs, and Random Number Generator (RNG) states to ensure systemwide consistency. Moreover, HybridFlow can also employ redundancy-based fault-tolerance methods, such as broadcast parameters and CPU checkpoint, for fast recovery if enough healthy model replicas are available [76, 93].

容错性

Placement Insights. We conclude three main insights for model placement and GPU allocation in RLHF training. 1) Allocating more GPUs to the actor model can reduce the timeconsuming generation latency, which cannot be parallel i zed with other models. 2) When each model computation can fully utilize GPU resources, colocating all the models is most effective when training on relatively small-scale clusters. 3) When scaling up to large-scale clusters (i.e., strong scaling), distributing the actor and critic models on different devices for parallel execution in the training and preparation stages would help achieve higher throughput.

模型放置洞察。我们总结了RLHF训练中模型放置和GPU分配的三个主要洞察。1) 为演员模型分配更多GPU可以减少耗时的生成延迟,因为生成过程无法与其他模型并行。2) 当每个模型的计算都能充分利用GPU资源时,在相对较小的集群上训练时,将所有模型放在一起是最有效的。3) 当扩展到大规模集群(即强扩展)时,将演员模型和评论模型分布在不同设备上并行执行训练和准备阶段,有助于实现更高的吞吐量。

Resource multiplexing. HybridFlow enables colocation of models on shared devices by utilizing time-sharing for GPU computation. Recent research in DNN task scheduling has developed fine-grained resource multiplexing techniques, primarily aimed at achieving the service-level objectives of individual tasks [8, 18, 26, 26, 47, 56, 77]. Although the Resource Pool implementation supports parallel execution of collocated models, HybridFlow generally adheres to sequential execution to prevent GPU resource contention or OOM issues as discussed in Section 2.3. Applying GPU sharing and heterogeneous resources in RLHF training poses distinct challenges, as it seeks to balance the computation workload and manage complex data dependencies among various tasks. Investigating fine-grained auto-mapping algorithms for GPU sharing in RLHF training, coupled with model offload optimization and integration of heterogeneous devices, would be a promising direction for future research. From alignment to reasoning. In RLHF for LLM alignment, the reward signal is generated by the reward model. Besides alignment tasks, similar algorithms (e.g., PPO and GRPO [70]) can be applied to other domains, such as code generation and mathematical reasoning. For these tasks, a ground truth may exist for each prompt, which can be determined by assessing the correctness of the output value for each code test case and verifying the accuracy of mathematical results. Therefore, the reward model can be replaced by non-neural-network reward modules, such as a sandbox environment [87] for evaluating generated code or a reward function [14, 65] to validate mathematical results. HybridFlow can seamlessly integrate these reward modules by wrapping them as remote functions and orchestrating their execution within the singleprocess script, providing a flexible and efficient framework for diverse reinforcement learning applications.

资源复用。HybridFlow 通过利用 GPU 计算的时分复用技术,实现了共享设备上模型的共存。最近在 DNN 任务调度方面的研究已经开发了细粒度的资源复用技术,主要旨在实现单个任务的服务级别目标 [8, 18, 26, 26, 47, 56, 77]。尽管资源池实现支持共存模型的并行执行,但 HybridFlow 通常遵循顺序执行,以防止 GPU 资源争用或 OOM 问题,如第 2.3 节所述。在 RLHF 训练中应用 GPU 共享和异构资源带来了独特的挑战,因为它需要平衡计算工作负载并管理各种任务之间的复杂数据依赖关系。研究 RLHF 训练中的细粒度自动映射算法,结合模型卸载优化和异构设备的集成,将是未来研究的一个有前景的方向。从对齐到推理。在用于大语言模型对齐的 RLHF 中,奖励信号由奖励模型生成。除了对齐任务外,类似的算法(例如 PPO 和 GRPO [70])可以应用于其他领域,如代码生成和数学推理。对于这些任务,每个提示可能存在一个基本事实,可以通过评估每个代码测试用例的输出值的正确性以及验证数学结果的准确性来确定。因此,奖励模型可以被非神经网络的奖励模块取代,例如用于评估生成代码的沙盒环境 [87],或用于验证数学结果的奖励函数 [14, 65]。HybridFlow 可以通过将这些奖励模块封装为远程函数并在单进程脚本中编排它们的执行,无缝集成这些奖励模块,为多样化的强化学习应用提供灵活高效的框架。

10 Related Work

10 相关工作

RL frameworks. There have been plenty of frameworks for RL, ranging from general-purpose RL systems design for small-scale DNNs [12, 25, 28, 39, 45, 46] to RLHF systems specifically optimized for LLMs [15, 17, 30, 80, 82]. We have thoroughly examined closely related work in $\S2$ and we discuss more RL frameworks in this section. These RL frameworks [12, 25, 28, 39, 74], similar to recent RLHF systems, use a hodgepodge of multi-controller frameworks to implement their algorithms. They establish multiple long-running distributed programs with each component coordinating the execution order with hard-coded data synchronization. Gear [74] further optimized the experience replay segment of the RL pipeline. However, all these frameworks fail to support LLM training, inference, and generation in RLHF.

强化学习 (Reinforcement Learning, RL) 框架。已有许多用于 RL 的框架,从针对小规模深度神经网络 (Deep Neural Networks, DNNs) 的通用 RL 系统设计 [12, 25, 28, 39, 45, 46],到专门为大语言模型 (Large Language Models, LLMs) 优化的 RLHF 系统 [15, 17, 30, 80, 82]。我们在 $\S2$ 中详细检查了密切相关的工作,并将在本节中讨论更多 RL 框架。这些 RL 框架 [12, 25, 28, 39, 74] 与最近的 RLHF 系统类似,使用多控制器的混合框架来实现其算法。它们建立了多个长期运行的分布式程序,每个组件通过硬编码的数据同步来协调执行顺序。Gear [74] 进一步优化了 RL 管道的经验回放部分。然而,所有这些框架都无法支持 RLHF 中的 LLM 训练、推理和生成。

LLM training and serving systems. TorchDDP [57] and Horovod [69] support data parallel training. Byte Scheduler [58] and DeepSpeed [60] extend data parallelism with communication and memory optimization s. Numerous systems [23, 36, 48, 54, 71, 75, 89] optimized large model training through model parallelisms such as tensor parallelism and pipeline parallelism to partition models across devices. LLM serving systems [3, 16, 40, 72, 83, 92] also adopts data and model parallelism to accelerate auto-regressive generation with specialized optimization s like continuous-batching [83] and chunked-prefill [3]. Note that all the above frameworks adopt multi-controller paradigm for efficient computation.

大语言模型训练与服务系统。TorchDDP [57] 和 Horovod [69] 支持数据并行训练。Byte Scheduler [58] 和 DeepSpeed [60] 通过通信和内存优化扩展了数据并行性。许多系统 [23, 36, 48, 54, 71, 75, 89] 通过张量并行和管道并行等模型并行技术优化了大模型训练,将模型分割到多个设备上。大语言模型服务系统 [3, 16, 40, 72, 83, 92] 也采用了数据和模型并行技术,通过连续批处理 [83] 和分块预填充 [3] 等专门优化加速自回归生成。需要注意的是,上述所有框架都采用多控制器范式以实现高效计算。

Dataflow systems. Dataflow systems like MapReduce [21], Spark [86], Dryad [33], and Naiad [51] are popular for analytics and ML workloads but they lack support for dynamic task graphs. Ray [50] unifies task-parallel and actor programming models in a single dynamic task graph and implements a scalable distributed scheduler and a global control store, which is adopted by many RL frameworks [45, 46]. Pathways [9], a closed-source project for TPUs, are designed to easily express complex parallelism patterns and fine-grain control flow within a single DNN model, such as pipeline parallelism and Mixture-of-Experts with sparse computation. It employs an asynchronous distributed dataflow design that enables parallel control plane execution despite data dependencies, reducing the dispatch overhead from single-controller paradigm. Its main focus lies on single-model training, requiring complex compilations of each sub-network of a DNN model. HybridFlow can integrate Pathways as a submodule to implement the computation of models in the RLHF dataflow.

数据流系统。MapReduce [21]、Spark [86]、Dryad [33] 和 Naiad [51] 等数据流系统在分析和机器学习 (ML) 工作负载中很受欢迎,但它们缺乏对动态任务图的支持。Ray [50] 在单个动态任务图中统一了任务并行和 Actor 编程模型,并实现了可扩展的分布式调度器和全局控制存储,许多强化学习 (RL) 框架都采用了它 [45, 46]。Pathways [9] 是一个针对 TPU 的闭源项目,旨在轻松表达单个深度神经网络 (DNN) 模型中的复杂并行模式和控制流,例如管道并行和稀疏计算的专家混合模型。它采用异步分布式数据流设计,尽管存在数据依赖关系,仍能并行执行控制平面,从而减少单控制器范式中的调度开销。其主要关注点在于单模型训练,需要对 DNN 模型的每个子网络进行复杂编译。HybridFlow 可以将 Pathways 作为子模块集成,以实现在 RLHF 数据流中的模型计算。

11 Conclusion

11 结论

HybridFlow is an RLHF framework that enables flexible represent ation and efficient execution of diverse RLHF algorithms. We propose a hybrid programming model that allows users to easily build RLHF dataflow in a few lines of code by encapsulating distributed computation of different LLMs into primitive APIs and hiding the complexity of data resharding among nodes. Our 3D-Hybrid Engine ensures efficient execution of training and generation of the actor model, with zero memory redundancy and significantly reduced communication overhead for model parameter resharding. Furthermore, our effective mapping algorithm optimizes GPU allocation and placement of models in the RLHF dataflow. Extensive experiments demonstrate that HybridFlow achieves $1.53\times$ to $20.57\times$ speedup compared to state-of-the-art RLHF systems under various model sizes and cluster scales.

HybridFlow 是一个 RLHF 框架,能够灵活表示并高效执行多种 RLHF 算法。我们提出了一种混合编程模型,通过将不同大语言模型的分布式计算封装到基础 API 中,并隐藏节点间数据重分片的复杂性,使用户能够轻松地用几行代码构建 RLHF 数据流。我们的 3D-Hybrid 引擎确保了演员模型的训练和生成的高效执行,实现了零内存冗余,并显著减少了模型参数重分片的通信开销。此外,我们的有效映射算法优化了 GPU 分配和 RLHF 数据流中模型的放置。大量实验表明,在不同模型大小和集群规模下,HybridFlow 相较于最先进的 RLHF 系统实现了 $1.53\times$ 到 $20.57\times$ 的加速。

Acknowledgments

致谢

We would like to thank our shepherd Y. Charlie Hu and the anonymous reviewers for their constructive feedback. We thank Xin Liu, Yangrui Chen, and Ningxin Zheng for their insightful feedback on this project. This work was supported in part by a ByteDance Research Collaboration Project, and grants from Hong Kong RGC under the contracts HKU 17204423 and C7004-22G (CRF).

我们感谢我们的指导者 Y. Charlie Hu 以及匿名评审人员提供的建设性反馈。我们感谢 Xin Liu、Yangrui Chen 和 Ningxin Zheng 对本项目的深刻反馈。本工作部分得到了字节跳动研究合作项目的支持,以及香港 RGC 根据合同 HKU 17204423 和 C7004-22G (CRF) 提供的资助。

References

参考文献

Table 3. The transfer protocols in HybridFlow.

表 3: HybridFlow 中的传输协议。

传输协议 分发功能 收集功能 使用场景
ONE_TO_ALL 将数据广播到所有 rank。 从所有 rank 收集数据。 所有工作节点方法具有相同的输入并运行相同的代码,例如模型初始化。
3D_PROTO 拆分数据,分散到所有数据并行 (DP) rank 并在组内广播。 从所有 DP 组中的 p=-1, t=0 工作节点收集并拼接数据。 模型在每个数据并行组内的多个工作节点之间分片。模型的输出仅存在于最后一个流水线阶段,并在数据并行组之间复制。这是 Megatron-LM、Deepspeed 等中 3D 并行训练的典型场景。
3D_ALL_MICRO_DP 按微 DP 大小拆分数据,分散到所有微 DP 组并在组内广播。 从所有微 DP 组中的 local_rank=0 工作节点收集并拼接数据。 与 HybridEngine 一起使用。用于处理策略模型的 3D 并行方案,在训练和推理之间切换时使用。
3D_PP_ONLY 将数据广播到所有 rank。 从所有 PP 组中的 t=0, d=0 工作节点收集并拼接数据。 用于检查权重名称,因为它们在 TP 和 DP 组中是相同的。
DP_PROTO 将数据拆分为批次并分散到所有 DP rank。 从所有 DP rank 收集并拼接数据。 在数据并行模式下训练模型。
ALL_TO_ALL 无操作。 从所有 rank 收集数据。 调试时使用。用户可以手动定义每个工作节点的输入并分别检查其输出。

A Primitive APIs in HybridFlow

混合流中的原始 API

In HybridFlow, we implemented the primitive of each model in RLHF training by inheriting the 3 D Parallel Worker, FSDP Worker and ZeROWorker. The functions of these model classes are designed to decouple the distributed computation code and provide fundamental operations in RLHF for the users. This primitive design is compatible with the auto-regressive generation, forward pass, backward pass, and model update operations in the existing distributed inference and training frameworks. Users can easily customize the RLHF training dataflow (by adapting the numerical computation in the provided functions) according to the algorithm’s design and benefit from reusing the underlying distributed computation implementation. We illustrate the meaning and the actual computations of these APIs in Table 4.

在 HybridFlow 中,我们通过继承 3D Parallel Worker、FSDP Worker 和 ZeROWorker 来实现 RLHF 训练中每个模型的原始功能。这些模型类的功能旨在解耦分布式计算代码,并为用户提供 RLHF 中的基本操作。这种原始设计与现有分布式推理和训练框架中的自回归生成 (auto-regressive generation)、前向传播 (forward pass)、反向传播 (backward pass) 以及模型更新操作兼容。用户可以根据算法设计轻松定制 RLHF 训练数据流(通过调整提供函数中的数值计算),并受益于底层分布式计算实现的重用。我们在表 4 中说明了这些 API 的含义和实际计算。

B Transfer Protocols

B 传输协议

We implemented transfer protocols that cover all common use cases of data resharding between models in RLHF dataflow Users can utilize these pre-defined protocols to generate any RLHF dataflow. Moreover, Users can easily define their own transfer protocols by implementing a collect function and a distribute function. Transfer protocols decoupled the complicated data resharding and distributed training. We denote $\mathcal{P};$ , t, d as the rank of the worker in pipeline-, tensor- and dataparallel group respectively. We illustrate these predefined protocols in Table 3.

我们实现了涵盖RLHF数据流中模型间数据重分片所有常见用例的传输协议。用户可以利用这些预定义的协议生成任何RLHF数据流。此外,用户可以通过实现一个收集函数和一个分发函数来轻松定义自己的传输协议。传输协议将复杂的数据重分片和分布式训练解耦。我们用$\mathcal{P};$、t、d分别表示流水线并行、张量并行和数据并行组中的工作者等级。我们在表3中展示了这些预定义的协议。

C Auto-Parallelism Algorithm

C 自动并行化算法

Algorithm 2 outlines the search process of the optimal parallelism strategy of each model. Starting from the minimal model parallelism size of each model (to prevent OOM when colocating with multiple workers), we enumerate all feasible parallel configurations based on the number of GPUs and the number of GPUs per machine $U$ . The default number of $U$ is set to 8. We use simu module to estimate the latency of each model based on their workload. This module

算法 2 概述了每个模型的最优并行策略搜索过程。从每个模型的最小并行大小开始(以防止与多个工作节点共存时出现 OOM),我们基于 GPU 数量以及每台机器的 GPU 数量 $U$ 枚举所有可行的并行配置。默认的 $U$ 值设置为 8。我们使用 simu 模块根据工作负载估计每个模型的延迟。该模块

Algorithm 2 Auto Parallelism Algorithm

算法 2: 自动并行算法

includes three simulators for training, inference, and generation workload, all are analytical models following previous research [42, 84, 92]. The training and inference workload is compute-bound while the generation workload is memorybound. For the actor model, we first find the parallelism strategy for training and record the memory usage in the training stage. During actor generation, KVCache requirements are calculated using the batch size and max sequence length. If the model-parallel size for the generation stage cannot accommodate both parameters and KVCache, we increase it. Then, we seek the optimal strategy with corresponding KVCache allocation by comparing the latency estimation. Developing a comprehensive auto regressive generation simulator that accounts for variable KVCache sizes could further enhance the auto-mapping process in RLHF research.

包括用于训练、推理和生成工作负载的三个模拟器,所有模拟器均为遵循先前研究 [42, 84, 92] 的分析模型。训练和推理工作负载是计算密集型,而生成工作负载是内存密集型。对于Actor模型,我们首先找到训练的并行策略,并记录训练阶段的内存使用情况。在Actor生成期间,使用批量大小和最大序列长度计算KVCache需求。如果生成阶段的模型并行大小无法同时容纳参数和KVCache,我们会增加它。然后,通过比较延迟估计,我们寻找具有相应KVCache分配的最佳策略。开发一个全面的自回归生成模拟器,考虑可变的KVCache大小,可以进一步增强RLHF研究中的自动映射过程。

Table 4. Key functions provided in each model class. The users can use these provided functions to construct various RLHF algorithms in a few lines of code.

表 4: 每个模型类提供的关键功能。用户可以使用这些提供的函数通过几行代码构建各种 RLHF 算法。

模型 API 计算 解释
Actor generate_sequence 自回归生成 基于一批提示,Actor 模型生成一批序列。
compute_log-prob 前向传播 Actor 模型计算提示和响应中每个 Token 的对数概率。该对数概率与使用相同模型精度生成时的返回对数概率相同。(在 PPO 中可选)
compute_loss 前向传播 Actor 模型基于预训练数据集计算预训练损失 [7, 19, 55]。
update_actor 前向、反向传播和模型更新 基于优势、回报(从 compute_advantage 计算)和预训练损失,Actor 模型计算训练损失并更新其权重。我们为多种 RLHF 算法实现了各种损失,包括 PPO [55]、Safe-RLHF [19]、ReMax [43]、GRPO [70] 等。
Critic compute_values 前向传播 Critic 模型计算每个提示和响应的值。基于这些值和回报,Critic 计算平方误差损失。
update_critic 前向、反向传播和模型更新 更新其权重。我们还为多种 RLHF 算法实现了 Critic 损失,包括 PPO [55]、Safe-RLHF [19]、ReMax [43]、GRPO [70] 等。
Reference Policy compute_ref_log-prob 前向传播 参考模型计算提示和响应中每个 Token 的参考对数概率。该对数概率用作评估 Actor 模型差异并约束其学习过程的基准。
Reward compute_reward 前向传播 奖励模型进行前向计算以计算给定提示和响应集的分数。奖励可以是 Token 级别或样本级别。
compute_advantage 数值计算 基于价值模型和奖励模型的值奖励以及当前策略模型的响应。该计算不涉及模型前向传播。
阅读全文(20积分)