LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models

LMFlow: 一个用于大基础模型微调和推理的可扩展工具包

Shizhe Diao♡∗, Rui $\mathbf{Pan}^{\heartsuit* }$ , Hanze Dong♡∗, KaShun Shum♡, Jipeng Zhang♡, Wei Xiong♠, Tong Zhang♠ ♡The Hong Kong University of Science and Technology ♠University of Illinois Urbana-Champaign {sdiaoaa, rpan, hdongaj}@ust.hk tozhang@illinois.edu

Shizhe Diao♡∗, Rui $\mathbf{Pan}^{\heartsuit* }$, Hanze Dong♡∗, KaShun Shum♡, Jipeng Zhang♡, Wei Xiong♠, Tong Zhang♠ ♡香港科技大学 ♠伊利诺伊大学厄巴纳-香槟分校 {sdiaoaa, rpan, hdongaj}@ust.hk tozhang@illinois.edu

Abstract

摘要

Foundation models have demonstrated a great ability to achieve general human-level intelligence far beyond traditional approaches. As the technique keeps attracting attention from the AI community, an increasing number of foundation models are becoming publicly accessible. However, a significant shortcoming of most of these models lies in their performance in specialized-domain and task-specific applications, necessitating domain- and taskaware fine-tuning to develop effective scientific language models. As the number of available foundation models and specialized tasks keeps growing, the job of training scientific language models becomes highly nontrivial. In this paper, we initiate steps to tackle this issue. We introduce an extensible and lightweight toolkit, LMFlow, which aims to simplify the domainand task-aware finetuning of general foundation models. LMFlow offers a complete fine- tuning workflow for a foundation model to support specialized training with limited computing resources. Furthermore, it supports contin- uous pre training, instruction tuning, parameterefficient finetuning, alignment tuning, inference acceleration, long context generalization, model customization, and even multimodal finetuning, along with carefully designed and extensible APIs. This toolkit has been thoroughly tested and is available at github.com/Optimal Scale/LMFlow.1

基础模型展现出远超传统方法的通用人类水平智能潜力。随着该技术持续吸引AI社区关注，越来越多的基础模型逐步开放使用。然而这些模型普遍存在专业领域和特定任务性能不足的缺陷，需要针对领域和任务进行微调才能构建有效的科学语言模型。随着可用基础模型和专项任务数量持续增长，科学语言模型的训练工作变得异常复杂。本文针对该问题提出了解决方案——我们推出可扩展的轻量级工具包LMFlow，旨在简化通用基础模型的领域与任务感知微调流程。该工具包提供完整的基础模型微调工作流，支持在有限算力下进行专业化训练，同时涵盖持续预训练、指令微调、参数高效微调、对齐调优、推理加速、长上下文泛化、模型定制乃至多模态微调等功能模块，并配备精心设计且可扩展的API接口。该工具包已通过全面测试，开源地址为https://github.com/OptimalScale/LMFlow.1

1 Introduction

1 引言

Foundation models (FMs), and in particular large language models (LLMs), have demonstrated general abilities to perform different tasks beyond what was possible previously. While a number of pretrained large models, including GPT-J (Wang and Komatsu zak i, 2021), Bloom (Scao et al., 2022), LLaMA (Touvron et al., 2023a,b), etc., are publicly available and have already been incorporated into the Hugging Face model repository (Hugging face, 2022), there is no publicly available toolkit that can be easily used to perform finetuning and inference for these different models. For specialized domains or tasks, it is necessary to further finetune such LLMs to achieve improved performance on such domains or tasks. The purpose of this package is to offer a simple-to-use and lightweight toolkit so that developers and researchers can perform efficient finetuning and inference of scientific language models with limited resources. The typical processes to train a scientific language model are shown in Figure 1, which include:

基础模型 (FMs) ，尤其是大语言模型 (LLMs) ，已展现出执行不同任务的通用能力，超越了以往的可能性。虽然包括 GPT-J (Wang and Komatsu zak i, 2021) 、Bloom (Scao et al., 2022) 、LLaMA (Touvron et al., 2023a,b) 等在内的多个预训练大模型已公开并纳入 Hugging Face 模型库 (Hugging face, 2022) ，但目前尚无公开可用的工具包能轻松对这些不同模型进行微调和推理。针对特定领域或任务，需要进一步微调此类大语言模型以提升其性能。本工具包旨在提供一个简单易用且轻量级的工具，使开发者和研究人员能够在有限资源下高效完成科学语言模型的微调与推理。训练科学语言模型的典型流程如图 1 所示，包括：

LMFlow enhances and streamlines the aforementioned fine-tuning procedures, enabling the efficient and effective training of a scientific language model. We focus on improving training speed. For example, it only takes one Nvidia 3090 GPU and five hours to train a medical LLaMA comparable to ChatGPT, based on a 7-billion-parameter LLaMA model. In addition to speed, we also aspire to achieve higher model performance. We used this framework to train medical LLaMA, a series of models with 7-billion, 13-billion, 33-billion, and

LMFlow优化并简化了上述微调流程，实现了科学语言模型的高效训练。我们重点提升训练速度，例如基于70亿参数的LLaMA模型，仅需一张Nvidia 3090显卡和五小时即可训练出媲美ChatGPT的医疗版LLaMA。除速度外，我们还致力于实现更优的模型性能。通过该框架训练了医疗版LLaMA系列模型，参数量分别为70亿、130亿、330亿及...

Table 1: Comparison with competing packages. Cont. PT: continuous pre training. FT: finetuning. RLHF: reinforcement learning from human feedback. Deploy.: deployment. Adapt.: domain/task adaptation. Acc.: acceleration techniques for finetuning and inference. LC: long context generalization. VE: vocabulary extension. MM: multimodal training.

Transformers (Wolf et al., 2020)
Accelerate (Gugger et al., 2022)
Deepspeed (Rasley et al., 2020)
Trl (von Werra et al., 2020)
LMFlow (ours)

表 1: 与竞争方案的对比。Cont. PT: 持续预训练。FT: 微调。RLHF: 基于人类反馈的强化学习。Deploy.: 部署。Adapt.: 领域/任务适配。Acc.: 微调与推理加速技术。LC: 长上下文泛化。VE: 词表扩展。MM: 多模态训练。

65-billion parameters, on a single machine and have released the model weights for academic research. Using LMFlow, anyone can train their own scientific or personalized language models. Each person can choose the appropriate foundation model according to their available resources, for tasks such as question answering, companionship, and expert consultations in various domains. The larger the model and data size, the longer the training time and the better the results. Compared with existing packages, LMFlow encompasses a multitude of features that are absent in others, such as the support for long context generalization, as shown in Table 1. Most importantly, LMFlow stands out as a comprehensive, full-cycle foundation model adaptation toolkit. While other packages excel in specific areas like finetuning, they lack functionalities like RLHF and others. To our knowledge, LMFlow is the first to offer a complete pipeline that integrates all these processes. This holistic toolkit allows for more robust and adaptable language model training and inference, setting a new standard in the field of natural language processing.

65亿参数，在单台机器上训练并发布了模型权重供学术研究使用。通过LMFlow，任何人都能训练自己的科学或个性化语言模型。用户可根据可用资源选择合适的基础模型，用于问答、陪伴、跨领域专家咨询等任务。模型规模和数据量越大，训练时间越长，效果也越好。如表1所示，与现有工具包相比，LMFlow具备长文本泛化支持等多项独有特性。最重要的是，LMFlow是首个覆盖全周期的基础模型适配工具包。虽然其他工具在微调等特定环节表现优异，但缺乏RLHF等完整功能。据我们所知，LMFlow首次实现了全流程整合，为自然语言处理领域树立了新标准，使语言模型训练与推理更具鲁棒性和适应性。

2 相关工作

In recent years, the finetuning of large language models (LLMs) has gained significant attention, especially for scientific domain applications. The necessity of adapting these general-purpose models to specific domains or tasks has led to the de- velopment of various scientific language models. Lehman et al. (2023) conducted an extensive empirical analysis on the performance of various language models in clinical tasks and found that specialized clinical models, even smaller in size, significantly outperform larger general-domain models when finetuned on domain-specific data. This emphasizes the importance of domain specialization in achieving higher accuracy in safety-critical fields like healthcare. Therefore, a series of scientific large models have emerged, including but not limited to: language models for Science (Beltagy et al., 2019; Luu et al., 2021; Taylor et al., 2022), Mathe- matics (Yue et al., 2023; Yu et al., 2023; Gao et al., 2023), Physics (Nguyen et al., 2023; Zheng et al., 2023b; Perkowski et al., 2024), Chemistry and Materials Science (Cao et al., 2023; Shetty et al., 2023; Rubungo et al., 2023), Biology and Medicine (Lee et al., 2020; Zhang et al., 2023; Singhal et al., 2023; Wu et al., 2023; Han et al., 2023; Wang et al., 2023; Yang et al., 2024), and Information Retrieval (Lassance et al., 2023) We recommend readers to refer to a paper list of scientific language models 2, which includes a more comprehensive range of works related to scientific language models. Among these works, LMFlow has successfully helped in training AstroLLaMA-Chat (Perkowski et al., 2024) and MarineGPT (Zheng et al., 2023b). The Medical LLaMA trained in the medical domain within this paper also demonstrates the effectiveness of LMFlow. In summary, our proposed LMFlow offers a comprehensive toolkit for efficient and effective finetuning of foundation models across various specialized domains.

近年来，大语言模型 (LLM) 的微调技术受到广泛关注，尤其在科学领域应用中表现突出。为了将通用模型适配到特定领域或任务，各类科学语言模型应运而生。Lehman 等人 (2023) 通过临床任务的实证分析发现：经过领域数据微调后，专业临床模型即便体积较小，其表现也显著优于大型通用领域模型。这凸显了在医疗等安全关键领域实现高精度时领域专业化的重要性。因此，一系列科学大模型相继涌现，主要包括但不限于：科学领域语言模型 (Beltagy 等人, 2019; Luu 等人, 2021; Taylor 等人, 2022) 、数学领域 (Yue 等人, 2023; Yu 等人, 2023; Gao 等人, 2023) 、物理学领域 (Nguyen 等人, 2023; Zheng 等人, 2023b; Perkowski 等人, 2024) 、化学与材料科学领域 (Cao 等人, 2023; Shetty 等人, 2023; Rubungo 等人, 2023) 、生物医学领域 (Lee 等人, 2020; Zhang 等人, 2023; Singhal 等人, 2023; Wu 等人, 2023; Han 等人, 2023; Wang 等人, 2023; Yang 等人, 2024) 以及信息检索领域 (Lassance 等人, 2023) 。我们建议读者参阅科学语言模型论文列表2以获取更全面的相关研究。其中，LMFlow 已成功助力训练 AstroLLaMA-Chat (Perkowski 等人, 2024) 和 MarineGPT (Zheng 等人, 2023b) ，本文在医疗领域训练的 Medical LLaMA 同样验证了 LMFlow 的有效性。综上所述，我们提出的 LMFlow 为跨领域基础模型的高效微调提供了完整工具包。

3 Toolkit Overview

3 工具包概述

3.1 System Design

3.1 系统设计

An illustration of the LMFlow system design is shown in Figure 1. There are four stages for improving the performance of a publicly available foundation model. The first stage is domain adaptation, which involves modifying the model to better handle a specific domain by training the model on that domain. The second stage is task adaptation, which involves adapting the model to perform a specific task, such as sum mari z ation, questionanswering, and translation. The third stage is instruction finetuning, which involves adjusting the model’s parameters based on instructional questionanswer pairs. The final stage is reinforcement learning with human feedback, which involves using human feedback to further align the model to human preference. LMFlow provides a complete finetuning workflow for these four stages, supporting large language models’ specialized training with limited computing resources. Especially, LMFlow supports the following key features:

图1展示了LMFlow系统设计的示意图。提升公开可用基础模型性能包含四个阶段：第一阶段是领域适应(domain adaptation)，通过在该领域数据上训练模型，使其更擅长处理特定领域；第二阶段是任务适应(task adaptation)，使模型适应摘要生成、问答和翻译等具体任务；第三阶段是指令微调(instruction finetuning)，基于指令问答对调整模型参数；最后阶段是基于人类反馈的强化学习(RLHF)，利用人类反馈进一步使模型符合人类偏好。LMFlow为这四个阶段提供完整的微调工作流，支持在有限算力下对大语言模型进行专项训练，具体包含以下关键特性：

Figure 1: The system design of LMFlow. Starting from a publicly available foundation model, there are four possible stages including (1) domain adaptation, (2) task adaptation, (3) instruction finetuning, and (4) reinforcement learning with human feedback.

图 1: LMFlow的系统设计。从公开可用的基础模型开始，包含四个可能阶段：(1) 领域适配 (domain adaptation)、(2) 任务适配 (task adaptation)、(3) 指令微调 (instruction finetuning) 和 (4) 基于人类反馈的强化学习 (reinforcement learning with human feedback)。

3.2 Installation

3.2 安装

LMFlow has been fully tested on Linux OS (Ubuntu 20.04) and can be installed by executing the following commands.

LMFlow已在Linux操作系统( Ubuntu 20.04 )上完成全面测试，可通过执行以下命令安装。

3.3 Data Format

3.3 数据格式

LMFlow accepts several .json files as input. Users can provide a list of .json files under a specified dataset directory. For example,

LMFlow 接受多个 .json 文件作为输入。用户可以在指定数据集目录下提供 .json 文件列表。例如，

I	\|- path_ to_ dataset
2	-( data_ 1.json
3	data_ 2.json
4	\|- another_ data.json
5	\|-

| I | |- path_ to_ dataset |
| 2 | -( data_ 1.json |
| 3 | data_ 2.json |
| 4 | |- another_ data.json |
| 5 | |- |

Each json file shall have the following format (three instances with four keys for example),

每个json文件应具有以下格式（例如包含四个键的三个实例），

14		"KEY_ 4"："VALUE_ 2.4"
15
16	{
17	"KEY_ 1":	"VALUE_3.1"
18	"KEY_ 2"	"VALUE_3.2"
19	"KEY_3	"VALUE_3.3
20	"KEY_ 4":	"VALUE_3.4"
21
22
23

| 14 | | "KEY_ 4": "VALUE_ 2.4" |
| 15 | | |
| 16 | { | |
| 17 | "KEY_ 1": | "VALUE_3.1" |
| 18 | "KEY_ 2" | "VALUE_3.2" |
| 19 | "KEY_3 | "VALUE_3.3 |
| 20 | "KEY_ 4": | "VALUE_3.4" |
| 21 | | |
| 22 | | |
| 23 | | |

where the TYPE indicates the dataset type and defines the set of keys { KEY_ 1, KEY_ 2, ... } and their corresponding interpretations. Two supported .json formats are detailed as follows.

其中 TYPE 表示数据集类型，并定义键集合 { KEY_ 1, KEY_ 2, ... } 及其对应解释。支持的两种 .json 格式如下。

TextOnly This is the most common dataset type, which only contains raw texts in each sample. This type of dataset can be used as the training set for text decoder models, or the input of decoder models / encoder-decoder models. Its format is as follows (three instances, for example),

纯文本这是最常见的数据集类型，每个样本仅包含原始文本。此类数据集可用作文本解码器模型的训练集，或作为解码器模型/编码器-解码器模型的输入。其格式如下（以三个样本为例）

{
2 type instances":	text_ only"
	{ "text" "SAMPLE_ TEXT_ 1"
4	{ "text SAMPLE_ TEXT_ 2" {
5	{ text"： "SAMPLE_ TEXT_3" }
	7
7
8 了

{
2 "类型实例":
"纯文本"
{ "文本": "示例文本_ 1"
4
{ "文本": "示例文本_ 2" {
5
{ "文本": "示例文本_3" }
7
7
8
}

Text2Text This is the dataset type mostly used for in ferenc ing, which contains a pair of texts in each sample. This type of dataset can be used as the training set for text encoder-decoder models, or question-answer pairs for evaluating model inferences. Its format is as follows (three instances for example),

Text2Text 这是最常用于推理的数据集类型，每个样本包含一对文本。此类数据集可用作文本编码器-解码器模型的训练集，或作为评估模型推理的问答对。其格式如下（例如三个实例），

{
2	"type":"text2text",
3	"instances":[
4	C
5	"input":"SAMPLE_iNPUT_ 1"
6	"output": "SAMPLE_ OUTPUT_ 1"
7
8
9	"input":"SAMPLE_iNPUT_ 2",
10	"output"： "SAMPLE_ OUTPUT_ 2",
11
12
13	"input": "SAMPLE_iNPUT_3",
14	"output": "SAMPLE_ OUTPUT_3""
15
16
17 C

{
2 | "type":"text2text",
3 | "instances":[
4 | C
5 | "input":"SAMPLE_iNPUT_ 1"
6 | "output": "SAMPLE_ OUTPUT_ 1"
7 | 
8 | 
9 | "input":"SAMPLE_iNPUT_ 2",
10 | "output": "SAMPLE_ OUTPUT_ 2",
11 | 
12 | 
13 | "input": "SAMPLE_iNPUT_3",
14 | "output": "SAMPLE_ OUTPUT_3""
15 | 
16 | 
17 C |

3.4 Continuous Pre training

3.4 持续预训练

The endeavor to bridge the divide between pretraining domains and downstream domains has led to the adoption of a prevalent approach, known as continuous pre training (Beltagy et al., 2019; Alsentzer et al., 2019; Huang et al., 2019; Lee et al., 2020), which involves the ongoing pre training on an extensive collection of unlabeled data that is specific to a given domain. LMFlow supports continuous pre training natively, which is an effective way to adapt LLMs to a specific domain. Users just need to collect a set of unlabeled data and prepare them to TextOnly data format. The following process will be handled by auto regressive training.

为弥合预训练领域与下游领域之间的鸿沟，业界普遍采用持续预训练 (continuous pre training) 方法 (Beltagy et al., 2019; Alsentzer et al., 2019; Huang et al., 2019; Lee et al., 2020)，即针对特定领域持续在大量无标注数据上进行预训练。LMFlow 原生支持持续预训练，这是将大语言模型适配到特定领域的有效方式。用户只需收集一组无标注数据并将其转换为纯文本 (TextOnly) 格式，后续过程将由自回归训练自动处理。

3.5 Instruction Tuning

3.5 指令微调

Instruction tuning (Sanh et al.; Wei et al.; Chung et al., 2022; Mu en nigh off et al., 2022; Wang et al., 2022), also called supervised finetuning, is an approach to enhance the performance of language models by training them to follow natural language instructions. This involves training the model on a small set of task-specific data, most of which are in prompt-answer format, including positive or negative examples, prompts, constraints, and other elements commonly present in human language. Instruction tuning enables LLMs to provide more accurate and relevant responses to user queries, making them more effective conversational agents.

指令微调 (Sanh等人; Wei等人; Chung等人, 2022; Muennighoff等人, 2022; Wang等人, 2022) ，也称为监督微调，是一种通过训练语言模型遵循自然语言指令来提升其性能的方法。该方法通过在少量任务特定数据上训练模型来实现，这些数据大多采用提示-答案格式，包含正负例、提示词、约束条件等人类语言中常见的要素。指令微调使大语言模型能够更精准地响应用户查询，从而成为更高效的对话智能体。

3.6 RLHF as Finetuning

3.6 基于人类反馈的强化学习 (RLHF) 微调

There is a growing need to explore alternative pretraining objectives that can guide LLMs to generate text that aligns with human preferences. By doing so, we can ensure that LLMs produce text that is more helpful, honest, and harmless for humans, which are called ‘HHH’ rules (Askell et al., 2021). Ouyang et al. (2022) divides the alignment process into three steps, including SFT, reward modeling, and RLHF (reward optimization). We have integrated all of these steps into our LMFlow framework. For reward optimization, PPO has been shown to be effective in various studies (Schulman et al., 2017; Engstrom et al., 2020). However, it relies on a trial-and-error approach through interaction with the environment, making it less stable and efficient than supervised learning (Choshen et al., 2019). To address this, we propose and implement a new alignment method for generative models called RAFT (Dong et al., 2023). RAFT utilizes a reward model to rank the output of the generative model, allowing us to continue training using supervised finetuning (SFT)-like techniques with the selected samples. This approach encourages the generative model to prioritize samples with higher rewards and offers significant computational advantages over PPO, resulting in substantial savings in memory and gradient computations. Moreover, due to the stability of SFT-like training, our approach demonstrates lower sample complexity and requires fewer learnable parameters, making it easily adaptable to any generative model. We believe our novel alignment algorithm represents a competitive and innovative approach that contributes to the well-behaved behavior of generative models.

探索能够引导大语言模型(LLM)生成符合人类偏好文本的替代预训练目标的需求日益增长。通过这种方式，我们可以确保大语言模型生成的文本对人类更有帮助、诚实且无害，即遵循"HHH"规则 (Askell et al., 2021)。Ouyang et al. (2022) 将对齐过程分为三个步骤，包括监督微调(SFT)、奖励建模和基于人类反馈的强化学习(RLHF)。我们已将所有这些步骤集成到LMFlow框架中。

在奖励优化方面，近端策略优化(PPO)已被多项研究证明是有效的 (Schulman et al., 2017; Engstrom et al., 2020)。然而，它依赖于与环境的试错交互方式，导致其稳定性和效率低于监督学习 (Choshen et al., 2019)。为此，我们提出并实现了一种名为RAFT (Dong et al., 2023) 的新型生成模型对齐方法。RAFT利用奖励模型对生成模型的输出进行排序，使我们能够使用类似监督微调(SFT)的技术继续训练选定的样本。这种方法鼓励生成模型优先选择奖励更高的样本，与PPO相比具有显著的计算优势，可大幅节省内存和梯度计算资源。此外，由于类似SFT训练的稳定性，我们的方法表现出更低的样本复杂度，需要更少的可学习参数，使其易于适配任何生成模型。我们相信这一创新的对齐算法代表了一种具有竞争力的新方法，有助于提升生成模型的良好行为表现。

Table 2: The performance on Massive Multitask Language Understanding (MMLU) benchmark. Bold represents the best among each dataset.

MODEL	anatomy	clinical knowledge	college biology	college medicine	medical genetics	professional medicine	Average
LLaMA33B	39.2	40.3	44.4	32.9	36.0	43.0	39.3
Galactica30B	32.5	26.0	30.5	25.4	39.0	23.1	29.4
Galactica120B	58.5	59.2	68.7	57.2	68.0	59.6	61.9
OPT175B	28.9	21.9	30.6		35.0	27.9
BLOOM176B	37.0	29.8	28.5		36.0	25.4
Gopher 280B	56.3	67.2	70.8	60.1	69.0	64.0	64.6
GPT3.5	56.3	69.8	72.2	61.3	70.0	70.2	66.6
Task-tuned LLaMA 33B (LoRA)	51.8	65.2	70.1	58.3	65.6	66.5	62.9

表 2: 大规模多任务语言理解 (MMLU) 基准测试性能。加粗数据表示各数据集中的最佳结果。

MODEL	anatomy	clinical knowledge	college biology	college medicine	medical genetics	professional medicine	Average
LLaMA33B	39.2	40.3	44.4	32.9	36.0	43.0	39.3
Galactica30B	32.5	26.0	30.5	25.4	39.0	23.1	29.4
Galactica120B	58.5	59.2	68.7	57.2	68.0	59.6	61.9
OPT175B	28.9	21.9	30.6		35.0	27.9
BLOOM176B	37.0	29.8	28.5		36.0	25.4
Gopher 280B	56.3	67.2	70.8	60.1	69.0	64.0	64.6
GPT3.5	56.3	69.8	72.2	61.3	70.0	70.2	66.6
Task-tuned LLaMA 33B (LoRA)	51.8	65.2	70.1	58.3	65.6	66.5	62.9

Table 3: The overall performance of task-tuned LLaMA models and the comparison with human and existing models on three medical datasets. PubMedQA and MedMCQA are evaluated on in-domain tests and MedQA-USMLE is evaluated on the out-of-domain test. Bold represents the best among each dataset.

MODEL	PubMedQA (ID)「	MedQA-USMLE (OOD)	MedMCQA(ID)	Average
Human (pass) Human (expert)	78.0	60.0 87.0	50.0 90.0	85.0
InstructGPT-175B ChatGPT LLaMA-7B LLaMA-33B	73.2 63.9 5.2 1.8	46.0 57.0 27.1 43.4	44.0 44.7 24.3 30.3	54.4 55.2 18.9 25.2
Task-tuned LLaMA-7B (full) Task-tunedLLaMA-33B(LoRA)	75.1 74.0	44.5 51.3	49.9 50.2	56.5 58.5

表 3: 任务调优后的LLaMA模型整体表现及与人类和其他现有模型在三个医学数据集上的对比。PubMedQA和MedMCQA采用领域内测试评估，MedQA-USMLE采用领域外测试评估。加粗数字代表各数据集中的最佳结果。

模型	PubMedQA (ID)	MedQA-USMLE (OOD)	MedMCQA(ID)	平均分
Human (pass) Human (expert)	78.0	60.0 87.0	50.0 90.0	85.0
InstructGPT-175B ChatGPT LLaMA-7B LLaMA-33B	73.2 63.9 5.2 1.8	46.0 57.0 27.1 43.4	44.0 44.7 24.3 30.3	54.4 55.2 18.9 25.2
Task-tuned LLaMA-7B (full) Task-tunedLLaMA-33B(LoRA)	75.1 74.0	44.5 51.3	49.9 50.2	56.5 58.5

3.7 Efficient Tuning

3.7 高效调优

LMFlow supports low-rank adaptation (LoRA) (Hu et al.) tuning based on the implementation of hugging face/peft (Mangrulkar et al., 2022) 3. LoRA is an efficient tuning method that involves freezing the weights of the pretrained model and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture. This approach significantly reduces the number of trainable parameters. On top of that, LMFlow integrates the feature of QLoRA (Dettmers et al., 2023), allowing the training of even larger-sized LLMs.

LMFlow 支持基于 hugging face/peft (Mangrulkar et al., 2022) 实现的低秩自适应 (LoRA) (Hu et al.) 微调。LoRA 是一种高效微调方法，通过冻结预训练模型的权重并在 Transformer 架构的每一层加入可训练的秩分解矩阵，显著减少可训练参数量。此外，LMFlow 还集成了 QLoRA (Dettmers et al., 2023) 功能，支持训练更大规模的大语言模型。

3.8 Inference

3.8 推理

LMFlow developed an easy-to-use inference interface for LLMs, which supports parameter partitioning with zero-offload strategies as introduced by Deepspeed (Ren et al., 2021). In LMFlow, the inference interface is provided by an inferencer class. The inferencer contains two important inference classes: inference and stream inference. The distinction lies in whether the output is printed word by word in real-time. Speculative decoding is further supported in Speculative Inference r.

LMFlow为大语言模型开发了一个易于使用的推理接口，支持采用Deepspeed (Ren et al., 2021) 提出的零卸载 (zero-offload) 策略进行参数分区。该推理接口通过inferencer类实现，其中包含两个重要推理类：inference（常规推理）和stream inference（流式推理），区别在于是否实时逐词输出结果。此外，Speculative Inference r还支持推测解码 (speculative decoding) 技术。

4 API Documentation

4 API文档

Please refer to https://optimal scale.github. io/LMFlow/autoapi/index.html for the details of API documentation.

请参考 https://optimal scale.github.io/LMFlow/autoapi/index.html 查看API文档详情。

5 Results

5 结果

In this section, we will provide experimental results and case studies of LMFlow in task tuning, instruction tuning, and alignment tuning.

在本节中，我们将展示LMFlow在任务调优、指令调优和对齐调优方面的实验结果与案例研究。

5.1 Task Tuning

5.1 任务调优

The aim of task tuning is to enhance the proficiency of a language model in a specific field, such as the medical or financial domain, by imparting domainspecific information that allows it to better adapt to the target subject matter. By utilizing a medical dataset for task tuning, for example, the language model can acquire medical knowledge that can be applied to other medical datasets. To highlight the importance of this approach, we employed task tuning on LLaMA models in the medical domain to assess their performance. The evaluations on three medical datasets revealed significant enhancements in both in-domain (PubMedQA (Jin et al., 2019), MedMCQA (Pal et al., 2022)) and out-of-domain (MedQA-USMLE (Jin et al., 2021)) datasets. The results are shown in Table 3. The LLaMA-33B (LoRA) performance is achieved with only about 16 hours finetuning on the training split of Pub

任务调优的目标是通过向语言模型传授特定领域的信息，使其能更好地适应目标主题，从而提升模型在专业领域（如医疗或金融）的熟练度。例如，使用医疗数据集进行任务调优后，语言模型可以掌握可迁移至其他医疗数据的专业知识。为验证该方法的有效性，我们在医疗领域对LLaMA模型进行了任务调优测试。在三个医疗数据集上的评估显示，模型在领域内（PubMedQA (Jin et al., 2019)、MedMCQA (Pal et al., 2022)）和跨领域（MedQA-USMLE (Jin et al., 2021)）表现均获得显著提升，结果如表 3 所示。其中LLaMA-33B（LoRA）仅需在PubMedQA训练集上微调约16小时即可达成所述性能。

Table 4: Performance on Hugging face Open LLM Leader board. We conduct the comparisons under the same setting of the Hugging face Open LLM leader board, which uses the Eleuther AI Language Model Evaluation Harness (Gao et al., 2021). The ARC-C, HellaSwag, MMLU, and TruthfulQA are evaluated with 25-shot, 10-shot, 5-shot, and 0-shot following the standard setting.

MODEL	ARC-C\|]	HellaSwag		」MMLU」TruthfulQA」Average
7B
LLaMA-7B (Touvron et al., 2023a)	46.6	75.6	34.2	34.1	47.6
Baize-7B-v2 (Xu et al., 2023)	44.5	73.3	35.6	40.8	48.6
MPT-7B (Team,2023)	47.7	77.7	35.6	33.4	48.6
Falcon-7B (Penedo et al.,2023)	47.9	78.1	35.0	34.3	48.8
Robin-7B-v2	49.4	74.6	39.8	43.0	51.7
13B
Alpaca-13B (Taori et al.,2023)	51.9	77.6	37.6	39.6	51.7
LLaMA-13B (Touvron et al.,2023a)	50.8	78.9	37.7	39.9	51.8
Vicuna-13B (Zheng et al., 2023a)	47.4	75.2	39.6	49.8	53.7
Baize-13B-v2 (Xu et al.,2023)	50.3	77.1	39.4	48.3	53.8
Robin-13B-v2	56.5	80.4	48.8	50.8	59.1
>30B
LLaMA-33B (Touvron et al.,2023a)	57.1	82.6	45.7	42.3	56.9
LLaMA-65B (Touvron et al.,2023a)	57.8	84.2	48.8	42.3	58.3
Falcon-40B (Penedo et al.,2023)	61.9	85.3	52.7	41.7	60.4
Guanaco-65B-merged (Dettmers et al., 2023)	60.2	84.6	52.7	51.3	62.2
Falcon-40B-instruct (Penedo et al., 2023)	61.6	84.4	54.1	52.5	63.2
Robin-33B-v2	62.5	84.3	57.8	51.9	64.1
Robin-65B-v2	61.9	84.6	62.6	51.8	65.2

表 4: Hugging Face Open LLM 排行榜性能对比。我们在 Hugging Face Open LLM 排行榜的相同设置下进行比较，该排行榜使用 Eleuther AI 语言模型评估工具包 (Gao et al., 2021)。ARC-C、HellaSwag、MMLU 和 TruthfulQA 分别按照标准设置采用 25样本、10样本、5样本和零样本进行评估。

MODEL	ARC-C	HellaSwag	MMLU	TruthfulQA	Average
* * 7B* *
LLaMA-7B (Touvron et al., 2023a)	46.6	75.6	34.2	34.1	47.6
Baize-7B-v2 (Xu et al., 2023)	44.5	73.3	35.6	40.8	48.6
MPT-7B (Team, 2023)	47.7	77.7	35.6	33.4	48.6
Falcon-7B (Penedo et al., 2023)	47.9	78.1	35.0	34.3	48.8
Robin-7B-v2	49.4	74.6	39.8	43.0	51.7
* * 13B* *
Alpaca-13B (Taori et al., 2023)	51.9	77.6	37.6	39.6	51.7
LLaMA-13B (Touvron et al., 2023a)	50.8	78.9	37.7	39.9	51.8
Vicuna-13B (Zheng et al., 2023a)	47.4	75.2	39.6	49.8	53.7
Baize-13B-v2 (Xu et al., 2023)	50.3	77.1	39.4	48.3	53.8
Robin-13B-v2	56.5	80.4	48.8	50.8	59.1
* * >30B* *
LLaMA-33B (Touvron et al., 2023a)	57.1	82.6	45.7	42.3	56.9
LLaMA-65B (Touvron et al., 2023a)	57.8	84.2	48.8	42.3	58.3
Falcon-40B (Penedo et al., 2023)	61.9	85.3	52.7	41.7	60.4
Guanaco-65B-merged (Dettmers et al., 2023)	60.2	84.6	52.7	51.3	62.2
Falcon-40B-instruct (Penedo et al., 2023)	61.6	84.4	54.1	52.5	63.2
Robin-33B-v2	62.5	84.3	57.8	51.9	64.1
Robin-65B-v2	61.9	84.6	62.6	51.8	65.2

Table 5: Results on HH-RLHF dataset. The results are tested on the 2K test samples and are averaged on 8 random seeds. The LLaMA-7B-SFT is the SFT-aligned model. Reward and PPL denote the mean reward and perplexity, respectively. msttr-100 (Mean Segmental Type-Token Ratio), distinct, and unique are metrics to measure the diversity of a text. Pred. Length is the average length of predictions.

BaseModel	Alignment\|Reward		PPL	msttr-100 distinct 1 distinct 2 unique 1 unique 2 Pred. Length
LLaMA-7B		-0.435 4.781		0.579	0.032	0.258	7651	96071	119.9
LLaMA-7B	SFT	0.772	3.781	0.597	0.031	0.250	8198	110759	145.4
LLaMA-7B-SFT	PPO	2.077	4.156	0.597	0.033	0.262	7370	102437	127.8
LLaMA-7B-SFT	RAFT	2.294	4.031	0.611	0.032	0.258	8691	123576	156.2

表 5: HH-RLHF数据集上的结果。测试基于2K测试样本，并在8个随机种子下取平均值。LLaMA-7B-SFT为经过SFT对齐的模型。Reward和PPL分别表示平均奖励和困惑度。msttr-100 (Mean Segmental Type-Token Ratio)、distinct和unique是衡量文本多样性的指标。Pred. Length为预测的平均长度。

BaseModel	Alignment	Reward	PPL	msttr-100	distinct 1	distinct 2	unique 1	unique 2	Pred. Length
LLaMA-7B		-0.435	4.781	0.579	0.032	0.258	7651	96071	119.9
LLaMA-7B	SFT	0.772	3.781	0.597	0.031	0.250	8198	110759	145.4
LLaMA-7B-SFT	PPO	2.077	4.156	0.597	0.033	0.262	7370	102437	127.8
LLaMA-7B-SFT	RAFT	2.294	4.031	0.611	0.032	0.258	8691	123576	156.2

MedQA and MedMCQA with a single $8\times\mathrm{Al00}$ server. Furthermore, we conducted experiments on Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020) to further confirm the out-of-domain robustness of the task tuning. The results are shown in Table 2.

使用单台 $8\times\mathrm{Al00}$ 服务器在MedQA和MedMCQA上进行测试。此外，我们在Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020) 上进行了实验，以进一步验证任务调优的跨领域鲁棒性。结果如表 2 所示。

5.2 Instruction Tuning

5.2 指令微调 (Instruction Tuning)

Following previous work in instruction tuning (Wang et al., 2022; Taori et al., 2023; Zheng et al., 2023a), we finetune the model with a combination of ShareGPT 4, GPT-4-LLM (Peng et al., 2023), and BELLE (Ji et al., 2023a,b). This data fusion takes the Chinese and English data balance into consideration. Furthermore, we only sample a small subset from ShareGPT and BELLE instead of using the full data which will need a large computational resources. We call our instruction-tuned model Robin 5. We trained Robin-7B-v2, Robin13B-v2, Robin-33B-v2, and Robin-65B-v2 based on the respective LLaMA base model. The delta weights of Robin are released at https://github. com/Optimal Scale/LMFlow#model-zoo. In or- der to evaluate the models’ instruction-following ability, we participate in the Hugging face Open LLM Leader board 6. The performance is shown in Table 4. Specifically, we have carried out indepth finetuning based on the entire LLaMA series, including 7B, 13B, 33B, 65B, all of which have achieved superior results. Robin-7B-v2 scored 51.7 in the OpenLLM standard test, and Robin-13B even reached as high as 59.1, ranking sixth, surpassing many 33B models. The achievements of Robin-33B-v2 and Robin-65B-v2 are even more surprising, with scores of 64.1 and 65.2 respectively, firmly securing the top positions.

遵循指令微调 (instruction tuning) 的先前工作 (Wang et al., 2022; Taori et al., 2023; Zheng et al., 2023a)，我们采用 ShareGPT 4、GPT-4-LLM (Peng et al., 2023) 和 BELLE (Ji et al., 2023a,b) 的组合数据对模型进行微调。该数据融合考虑了中英文数据的平衡性。此外，我们仅从 ShareGPT 和 BELLE 中采样小规模子集，而非使用需要大量计算资源的完整数据集。我们将指令微调后的模型命名为 Robin 5，并基于不同规模的 LLaMA 基础模型训练了 Robin-7B-v2、Robin-13B-v2、Robin-33B-v2 和 Robin-65B-v2。Robin 的增量权重已发布于 https://github.com/OptimalScale/LMFlow#model-zoo。为评估模型遵循指令的能力，我们参与了 Hugging Face Open LLM 排行榜 6，性能结果如表 4 所示。具体而言，我们对整个 LLaMA 系列 (包括 7B、13B、33B、65B) 进行了深度微调，均取得卓越成果：Robin-7B-v2 在 OpenLLM 标准测试中获得 51.7 分，Robin-13B 更以 59.1 分排名第六，超越众多 33B 模型；Robin-33B-v2 和 Robin-65B-v2 分别以 64.1 和 65.2 的成绩稳居榜首，表现尤为突出。

5.3 Alignment Tuning

5.3 对齐调校

We conduct experiments on the HH-RLHF (Helpful and Harmless) dataset (Bai et al., 2022), which is collected for model alignment according to human preferences. The performance is reported in Table 5. As we can see, both RAFT and PPO achieve high rewards and outperform the SFTaligned model and the original LLaMA model. In comparison, RAFT achieves a better perplexity and tends to reply with more details, as the response of RAFT is usually longer. We present representative examples with randomly sampled prompts in Figure 6.

我们在HH-RLHF（有益无害）数据集（Bai等人，2022）上进行了实验，该数据集是为根据人类偏好进行模型对齐而收集的。性能结果如表5所示。可以看出，RAFT和PPO都获得了较高的奖励分数，优于经过SFT对齐的模型和原始LLaMA模型。相比之下，RAFT实现了更低的困惑度，并且倾向于给出更详细的回复，因为RAFT的响应通常更长。我们在图6中展示了随机采样提示的代表性示例。

6 Conclusion

6 结论

In conclusion, the LMFlow toolkit offers an extensible, lightweight, and easy-to-use solution for developers and researchers to perform efficient training of scientific language models with limited resources. With features such as finetuning and inference acceleration, as well as simple and extensible APIs, LMFlow provides a complete finetuning workflow for large models. Moreover, with the ability to customize training and achieve comparable or even better performance than ChatGPT, LMFlow represents a significant step forward in the development of large scientific models and their application to specialized tasks.

总之，LMFlow工具包为开发者和研究人员提供了一套可扩展、轻量级且易于使用的解决方案，使其能够在有限资源下高效训练科学领域的大语言模型。该工具集不仅具备微调 (finetuning) 和推理加速等特性，还通过简洁可扩展的API为大模型提供了完整的微调工作流。此外，LMFlow支持定制化训练，并能达到与ChatGPT相当甚至更优的性能，这标志着大科学模型开发及其在专业任务应用方面迈出了重要一步。

Acknowledgements

致谢

We thank the anonymous reviewers for their valuable suggestions and comments. Shizhe Diao and Rui Pan were supported by the Hong Kong Ph.D. Fellowship Scheme (HKPFS).

我们感谢匿名评审提出的宝贵建议和意见。Shizhe Diao 和 Rui Pan 的研究工作得到香港博士研究生奖学金计划 (HKPFS) 的支持。

Broader Impact and Responsible Use

更广泛的影响与负责任的使用

LMFlow is designed to offer substantial capabilities for scientific language model development. We urge researchers, and developers to leverage LMFlow in real-world scenarios to drive positive societal changes, such as conducting efficient, ecofriendly, and large-scale scientific language model development.

LMFlow旨在为科学语言模型开发提供强大能力。我们呼吁研究者和开发者将LMFlow应用于实际场景，推动积极的社会变革，例如开展高效、环保、大规模的科学语言模型开发。

Despite these benefits, there is a potential for misuse of LMFlow. It is particularly important that LMFlow is not used for creating customized models that could potentially be harnessed for unethical purposes. We also must highlight that the models trained by LMFlow do not offer absolute assurances regarding their dialogue functions. Users may encounter inaccuracies or biases in predictions. Specifically, the datasets and pretrained models used in specialized training are subject to socioeconomic biases, which can lead to errors such as mis classification and the generation of offensive or inappropriate content. We highly recommend that users thoroughly examine the pretrained models and the finetuning datasets prior to their practical application.

尽管LMFlow具有这些优势，但其仍存在被滥用的可能。尤其重要的是，LMFlow不得用于创建可能被用于不道德目的的定制模型。我们还必须强调，LMFlow训练的模型无法对其对话功能提供绝对保证，用户可能会遇到预测不准确或存在偏见的情况。具体而言，专业训练中使用的数据集和预训练模型会受到社会经济偏见的影响，可能导致错误分类或生成冒犯性/不当内容等错误。我们强烈建议用户在实际应用前，彻底检查预训练模型和微调数据集。

We are committed to the continuous improvement of LMFlow. Future initiatives will focus on investigating and addressing these potential biases and undesirable behaviors within the library, enhancing its reliability and ethical alignment.

我们致力于持续改进LMFlow。未来的工作将重点研究和解决该库中潜在的偏见与不良行为，以提升其可靠性和伦理一致性。

References

参考文献

Emily Alsentzer, John Murphy, William Boag, WeiHung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. 2019. Publicly Available Clin- ical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78.

Emily Alsentzer、John Murphy、William Boag、WeiHung Weng、Di Jindi、Tristan Naumann 和 Matthew McDermott。2019. 公开可用的临床 BERT 嵌入 (Publicly Available Clinical BERT Embeddings)。见《第二届临床自然语言处理研讨会论文集》，第72–78页。

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas

Amanda Askell、Yuntao Bai、Anna Chen、Dawn Drain、Deep Ganguli、Tom Henighan、Andy Jones、Nicholas

Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.

Joseph、Ben Mann、Nova DasSarma 等. 2021. 通用语言助手作为对齐研究的实验平台. arXiv 预印本 arXiv:2112.00861.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan等. 2022. 基于人类反馈强化学习训练有益无害的助手. arXiv预印本 arXiv:2204.05862.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3606–3611.

Iz Beltagy、Kyle Lo 和 Arman Cohan. 2019. SciBERT: 面向科学文本的预训练语言模型. 见《2019年自然语言处理经验方法会议暨第九届自然语言处理国际联合会议论文集》(EMNLP-IJCNLP), 第3606–3611页.

He Cao, Zijing Liu, Xingyu Lu, Yuan Yao, and Yu Li. 2023. Instruct mol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208.

何操、刘子靖、卢星宇、姚远和李昱。2023。InstructMol：多模态整合构建药物发现中多功能可靠分子助手。arXiv预印本arXiv:2311.16208。

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.

Shouyuan Chen、Sherman Wong、Liangjian Chen 和 Yuandong Tian。2023。通过位置插值扩展大语言模型的上下文窗口。arXiv预印本 arXiv:2306.15595。

Leshem Choshen, Lior Fox, Zohar Aizenbud, and Omri Abend. 2019. On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752.

Leshem Choshen、Lior Fox、Zohar Aizenbud 和 Omri Abend。2019. 论强化学习在神经机器翻译中的局限性。arXiv预印本 arXiv:1907.01752。

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.

Hyung Won Chung、Le Hou、Shayne Longpre、Barret Zoph、Yi Tay、William Fedus、Eric Li、Xuezhi Wang、Mostafa Dehghani、Siddhartha Brahma 等. 2022. 规模化指令微调语言模型. arXiv预印本 arXiv:2210.11416.

Tri Dao. 2023. Flash attention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.

Tri Dao. 2023. FlashAttention-2：更快的注意力机制与更好的并行性及工作划分。arXiv预印本 arXiv:2307.08691。

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flash attention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.

Tri Dao、Dan Fu、Stefano Ermon、Atri Rudra 和 Christopher Ré。2022。FlashAttention：具有 I/O 感知的快速且内存高效的精确实注意力机制。Neural Information Processing Systems 进展，35:16344–16359。

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Z ett le moyer. 2023. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.

Tim Dettmers、Artidoro Pagnoni、Ari Holtzman 和 Luke Zettlemoyer。2023。QLoRA：量化大语言模型的高效微调。arXiv预印本 arXiv:2305.14314。

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. 2023. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research.

Hanze Dong、Wei Xiong、Deepanshu Goyal、Yihan Zhang、Winnie Chow、Rui Pan、Shizhe Diao、Jipeng Zhang、KaShun SHUM 和 Tong Zhang。2023。RAFT：基于奖励排序微调的生成式基础模型对齐方法。Transactions on Machine Learning Research。

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. 2020. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729.

Logan Engstrom、Andrew Ilyas、Shibani Santurkar、Dimitris Tsipras、Firdaus Janoos、Larry Rudolph和Aleksander Madry。2020。深度策略梯度中的实现细节至关重要：基于PPO和TRPO的案例研究。arXiv预印本arXiv:2005.12729。

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. 2023. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370.

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, 等. 2023. G-LLaVA: 基于多模态大语言模型的几何问题求解. arXiv预印本 arXiv:2312.11370.

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Mu en nigh off, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation. Zenodo.

Leo Gao、Jonathan Tow、Stella Biderman、Sid Black、Anthony DiPofi、Charles Foster、Laurence Golding、Jeffrey Hsu、Kyle McDonell、Niklas Muennighoff、Jason Phang、Laria Reynolds、Eric Tang、Anish Thite、Ben Wang、Kevin Wang 和 Andy Zou。2021。少样本语言模型评估框架。Zenodo。

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. 2022. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/hugging face/ accelerate.

Sylvain Gugger、Lysandre Debut、Thomas Wolf、Philipp Schmid、Zachary Mueller、Sourab Mangrulkar、Marc Sun 和 Benjamin Bossan。2022. Accelerate：让大规模训练与推理变得简单、高效且可适配。https://github.com/huggingface/accelerate。

Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. 2023. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.

Tianyu Han、Lisa C Adams、Jens-Michalis Papaioannou、Paul Grundmann、Tom Oberhauser、Alexander Löser、Daniel Truhn 和 Keno K Bressem。2023。MedAlpaca——一个开源的医疗对话AI模型及训练数据集合。arXiv预印本 arXiv:2304.08247。

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. In International Conference on Learning Representations.

Dan Hendrycks、Collin Burns、Steven Basart、Andy Zou、Mantas Mazeika、Dawn Song 和 Jacob Steinhardt。2020。大规模多任务语言理解评估。发表于国际学习表征会议。

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.

Edward J Hu、Phillip Wallis、Zeyuan Allen-Zhu、Yuanzhi Li、Shean Wang、Lu Wang、Weizhu Chen等。LoRA：大语言模型的低秩自适应方法。发表于国际学习表征会议。

Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. 2019. Clinic alBERT: Modeling Clinical Notes and Predicting Hospital Read mission. arXiv preprint arXiv:1904.05342.

Kexin Huang、Jaan Altosaar 和 Rajesh Ranganath。2019。ClinicALBERT：临床笔记建模与再入院预测。arXiv预印本 arXiv:1904.05342。

Hugging face. 2022. Hugging face. hugging face.co.

Hugging Face. 2022. Hugging Face. https://huggingface.co.

Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, and Xiangang Li. 2023a. Belle: Be everyone’s large language model engine. https: //github.com/Lian jia Tech/BELLE.

Yunjie Ji、Yong Deng、Yan Gong、Yiping Peng、Qiang Niu、Baochang Ma 和 Xiangang Li。2023a。BELLE：全民大语言模型引擎。https://github.com/LianjiaTech/BELLE。

Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Lei Zhang, Baochang Ma, and Xiangang Li. 2023b. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742.

Yunjie Ji、Yong Deng、Yan Gong、Yiping Peng、Qiang Niu、Lei Zhang、Baochang Ma 和 Xiangang Li。2023b。探索指令数据规模对大语言模型的影响：基于真实场景的实证研究。arXiv预印本 arXiv:2303.14742。

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. $A p\cdot$ - plied Sciences, 11(14):6421.

Di Jin、Eileen Pan、Nassim Oufattole、Wei-Hung Weng、Hanyi Fang 和 Peter Szolovits。2021。这位患者患有何种疾病？基于医学考试的大规模开放域问答数据集。Applied Sciences, 11(14):6421。

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter national Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577.

Qiao Jin、Bhuwan Dhingra、Zhengping Liu、William Cohen和Xinghua Lu。2019。Pubmedqa: 生物医学研究问答数据集。载于《2019年自然语言处理经验方法会议暨第九届自然语言处理国际联合会议(EMNLP-IJCNLP)论文集》，第2567–2577页。

Carlos Lassance, Hervé Dejean, and Stéphane Clinchant. 2023. An experimental study on pre training transformers from scratch for ir. In European Conference on Information Retrieval, pages 504–520. Springer.

Carlos Lassance、Hervé Dejean 和 Stéphane Clinchant。2023。从头预训练 Transformer 用于信息检索的实验研究。见：欧洲信息检索会议，第 504–520 页。Springer。

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics, 36(4):1234–1240.

Jinhyuk Lee、Wonjin Yoon、Sungdong Kim、Donghyeon Kim、Sunkyu Kim、Chan Ho So 和 Jaewoo Kang。2020。BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。《生物信息学》，36(4):1234–1240。

Eric Lehman, Evan Hernandez, Diwakar Mahajan, Jonas Wulff, Micah J Smith, Zachary Ziegler, Daniel Nadler, Peter Szolovits, Alistair Johnson, and Emily Alsentzer. 2023. Do we still need clinical language models? arXiv preprint arXiv:2302.08091.

Eric Lehman、Evan Hernandez、Diwakar Mahajan、Jonas Wulff、Micah J Smith、Zachary Ziegler、Daniel Nadler、Peter Szolovits、Alistair Johnson 和 Emily Alsentzer。2023。我们还需要临床语言模型吗？arXiv预印本 arXiv:2302.08091。

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.

Yaniv Leviathan、Matan Kalman 和 Yossi Matias。2023。基于推测解码的 Transformer 快速推理。见《国际机器学习会议》，第 19274–19286 页。PMLR。

Kelvin Luu, Xinyi Wu, Rik Koncel-Kedziorski, Kyle Lo, Isabel Cachola, and Noah A Smith. 2021. Explaining relationships between scientific documents. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2130–2144.

Kelvin Luu、Xinyi Wu、Rik Koncel-Kedziorski、Kyle Lo、Isabel Cachola和Noah A Smith。2021。科学文献间关系的解释。见《第59届计算语言学协会年会暨第11届自然语言处理国际联合会议论文集（第一卷：长论文）》，第2130–2144页。

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, and Sayak Paul. 2022. Peft: Stateof-the-art parameter-efficient fine-tuning methods. https://github.com/hugging face/peft.

Sourab Mangrulkar、Sylvain Gugger、Lysandre Debut、Younes Belkada 和 Sayak Paul。2022。PEFT: 最先进的参数高效微调方法。https://github.com/huggingface/peft。

Niklas Mu en nigh off, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Cross lingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.

Niklas Muennighoff、Thomas Wang、Lintang Sutawika、Adam Roberts、Stella Biderman、Teven Le Scao、M Saiful Bari、Sheng Shen、Zheng-Xin Yong、Hailey Schoelkopf等。2022。通过多任务微调实现跨语言泛化。arXiv预印本arXiv:2211.01786。

Tuan Dung Nguyen, Yuan-Sen Ting, Ioana Ciuca, Charlie O'Neill, Ze-Chang Sun, Maja Jablonska, Sandor Kruk, Ernest Perkowski, Jack Miller, Jason Li, et al. 2023. Astrollama: Towards specialized foundation models in astronomy. arXiv preprint arXiv:2309.06126.

Tuan Dung Nguyen、Yuan-Sen Ting、Ioana Ciuca、Charlie O'Neill、Ze-Chang Sun、Maja Jablonska、Sandor Kruk、Ernest Perkowski、Jack Miller、Jason Li 等。2023。Astrollama: 天文学专用基础模型探索。arXiv预印本 arXiv:2309.06126。

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.

Long Ouyang、Jeffrey Wu、Xu Jiang、Diogo Almeida、Carroll Wainwright、Pamela Mishkin、Chong Zhang、Sandhini Agarwal、Katarina Slama、Alex Ray 等。2022。通过人类反馈训练语言模型遵循指令。神经信息处理系统进展，35:27730–27744。

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankara sub bu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.

Ankit Pal、Logesh Kumar Umapathi和Malaikannan Sankara sub bu。2022。MedMCQA：一个面向医疗领域问答的大规模多学科多选题数据集。见于《健康、推理与学习会议》，第248–260页。PMLR。

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.

Guilherme Penedo、Quentin Malartic、Daniel Hesslow、Ruxandra Cojocaru、Alessandro Cappelli、Hamza Alobeidli、Baptiste Pannier、Ebtesam Almazrouei 和 Julien Launay。2023。Falcon大语言模型的RefinedWeb数据集：仅用网络数据超越精选语料库。arXiv预印本 arXiv:2306.01116。

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Gal- ley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.

Baolin Peng、Chunyuan Li、Pengcheng He、Michel Galley 和 Jianfeng Gao。2023。使用 GPT-4 进行指令微调。arXiv 预印本 arXiv:2304.03277。

Ernest Perkowski, Rui Pan, Tuan Dung Nguyen, YuanSen Ting, Sandor Kruk, Tong Zhang, Charlie O’Neill, Maja Jablonska, Zechang Sun, Michael J Smith, et al. 2024. Astrollama-chat: Scaling astrollama with convers at ional and diverse datasets. Research Notes of the AAS, 8(1):7.

Ernest Perkowski、Rui Pan、Tuan Dung Nguyen、YuanSen Ting、Sandor Kruk、Tong Zhang、Charlie O'Neill、Maja Jablonska、Zechang Sun、Michael J Smith等。2024。Astrollama-chat：利用对话与多样化数据集扩展Astrollama。《AAS研究笔记》，8(1):7。

Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, and Lingpeng Kong Tong Zhang. 2023. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167.

Renjie Pi、Jiahui Gao、Shizhe Diao、Rui Pan、Hanze Dong、Jipeng Zhang、Lewei Yao、Jianhua Han、Hang Xu、Lingpeng Kong 和 Tong Zhang。2023。DetGPT: 通过推理检测所需内容。arXiv预印本 arXiv:2305.14167。

Jeff Rasley, Samyam Raj bh and ari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.

Jeff Rasley、Samyam Rajbhandari、Olatunji Ruwase和Yuxiong He。2020。DeepSpeed：系统优化支持训练超千亿参数的深度学习模型。在《第26届ACM SIGKDD知识发现与数据挖掘国际会议论文集》中，第3505-3506页。

Jie Ren, Samyam Raj bh and ari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, and Minjia Zhang. 2021. Zero-offload: Democratizing billion-scale model training.

Jie Ren、Samyam Rajbhandari、Reza Yazdani Aminabadi、Olatunji Ruwase、Shuangyan Yang 和 Minjia Zhang。2021。Zero-Offload：民主化十亿级模型训练。

Andre Niyongabo Rubungo, Craig Arnold, Barry P Rand, and Adji Bousso Dieng. 2023. Llm-prop: Predicting physical and electronic properties of crystalline solids from their text descriptions. arXiv preprint arXiv:2310.14029.

Andre Niyongabo Rubungo、Craig Arnold、Barry P Rand 和 Adji Bousso Dieng。2023。LLM-Prop：基于文本描述预测晶体材料的物理与电子特性。arXiv预印本 arXiv:2310.14029。

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.

Victor Sanh、Albert Webson、Colin Raffel、Stephen Bach、Lintang Sutawika、Zaid Alyafeai、Antoine Chaffin、Arnaud Stiegler、Arun Raja、Manan Dey等。多任务提示训练实现零样本任务泛化。收录于国际学习表征会议。

Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. Bloom: A 176bparameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé等. 2022. Bloom: 一个1760亿参数的开源多语言大语言模型. arXiv预印本 arXiv:2211.05100.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

John Schulman、Filip Wolski、Prafulla Dhariwal、Alec Radford 和 Oleg Klimov。2017。近端策略优化算法 (Proximal Policy Optimization Algorithms)。arXiv预印本 arXiv:1707.06347。

Pranav Shetty, Arunkumar Chitteth Rajan, Chris Kuenneth, Sonakshi Gupta, Lakshmi Prerana Panch u marti, Lauren Holm, Chao Zhang, and Rampi Ramprasad. 2023. A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing. npj Computational Materials, 9(1):52.

Pranav Shetty、Arunkumar Chitteth Rajan、Chris Kuenneth、Sonakshi Gupta、Lakshmi Prerana Panchumarti、Lauren Holm、Chao Zhang和Rampi Ramprasad。2023。基于自然语言处理的大规模聚合物语料通用材料属性数据提取流程。npj Computational Materials，9(1):52。

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. 2023. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.

Karan Singhal、Tao Tu、Juraj Gottweis、Rory Sayres、Ellery Wulczyn、Le Hou、Kevin Clark、Stephen Pfohl、Heather Cole-Lewis、Darlene Neal 等. 2023. 基于大语言模型的专家级医学问答探索. arXiv预印本 arXiv:2305.09617.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. github.com/tatsu-lab/stanford alpaca.

Rohan Taori、Isha Gulrajani、Tianyi Zhang、Yann Dubois、Xuechen Li、Carlos Guestrin、Percy Liang 和 Tatsunori B. Hashimoto。2023。斯坦福Alpaca：一个遵循指令的Llama模型。https://github.com/tatsu-lab/stanford_alpaca。

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.

Ross Taylor、Marcin Kardas、Guillem Cucurull、Thomas Scialom、Anthony Hartshorn、Elvis Saravia、Andrew Poulton、Viktor Kerkez 和 Robert Stojnic。2022。Galactica：面向科学的大语言模型。arXiv预印本 arXiv:2211.09085。

MosaicML NLP Team. 2023. Introducing mpt-7b: A new standard for open-source, ly usable llms.

MosaicML NLP团队。2023。发布MPT-7B：开源、商业可用大语言模型的新标杆。

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Hugo Touvron、Thibaut Lavril、Gautier Izacard、Xavier Martinet、Marie-Anne Lachaux、Timothée Lacroix、Baptiste Rozière、Naman Goyal、Eric Hambro、Faisal Azhar 等人。2023a。Llama: 开放高效的基础语言模型。arXiv预印本 arXiv:2302.13971。

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

Hugo Touvron、Louis Martin、Kevin Stone、Peter Albert、Amjad Almahairi、Yasmine Babaei、Nikolay Bashlykov、Soumya Batra、Prajjwal Bhargava、Shruti Bhosale 等. 2023b. Llama 2: 开放基础与微调对话模型. arXiv预印本 arXiv:2307.09288.

Leandro von Werra, Younes Belkada, Lewis Tun- stall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. 2020. Trl: Transformer reinforcement learning. https://github. com/hugging face/trl.

Leandro von Werra、Younes Belkada、Lewis Tun-stall、Edward Beeching、Tristan Thrush、Nathan Lambert 和 Shengyi Huang。2020. Trl: Transformer 强化学习。https://github.com/huggingface/trl。

Ben Wang and Aran Komatsu zak i. 2021. GPT-J-6B: A 6 Billion Parameter Auto regressive Language Model.

Ben Wang 和 Aran Komatsu zak i. 2021. GPT-J-6B: 一个 60 亿参数的自回归语言模型。

Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975.

王浩春, 刘驰, 席女娲, 强泽文, 赵森栋, 秦兵, 刘挺. 2023. 华佗: 基于中文医学知识调优的Llama模型. arXiv预印本 arXiv:2304.06975.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, Hannaneh Hajishirzi. 2022. Self-Instruct: 通过自生成指令对齐语言模型. arXiv预印本 arXiv:2212.10560.

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.

Jason Wei、Maarten Bosma、Vincent Zhao、Kelvin Guu、Adams Wei Yu、Brian Lester、Nan Du、Andrew M Dai 和 Quoc V Le。微调语言模型是零样本学习者。收录于国际学习表征会议。

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.

Thomas Wolf、Lysandre Debut、Victor Sanh、Julien Chaumond、Clement Delangue、Anthony Moi、Pier-ric Cistac、Tim Rault、Rémi Louf、Morgan Funtowicz、Joe Davison、Sam Shleifer、Patrick von Platen、Clara Ma、Yacine Jernite、Julien Plu、Canwen Xu、Teven Le Scao、Sylvain Gugger、Mariama Drame、Quentin Lhoest 和 Alexander M. Rush。2020。Transformer：最先进的自然语言处理技术。载于《2020年自然语言处理实证方法会议：系统演示论文集》，第38-45页，在线会议。计算语言学协会。

Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. Pmc-llama: Towards building open-source language models for medicine. arXiv preprint arXiv:2305.10415.

Chaoyi Wu、Weixiong Lin、Xiaoman Zhang、Ya Zhang、Yanfeng Wang 和 Weidi Xie。2023。PMC-LLAMA: 构建开源医学语言模型的探索。arXiv预印本 arXiv:2305.10415。

Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.

Canwen Xu、Daya Guo、Nan Duan和Julian McAuley。2023。Baize：基于自对话数据参数高效调优的开源聊天模型。arXiv预印本arXiv:2304.01196。

Xianjun Yang, Junfeng Gao, Wenxin Xue, and Erik Alex andersson. 2024. Pllama: An open-source large language model for plant science. arXiv preprint arXiv:2401.01600.

Xianjun Yang、Junfeng Gao、Wenxin Xue 和 Erik Alex Andersson。2024。Pllama：一个面向植物科学的开源大语言模型。arXiv 预印本 arXiv:2401.01600。

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.

Longhui Yu、Weisen Jiang、Han Shi、Jincheng Yu、Zhengying Liu、Yu Zhang、James T Kwok、Zhenguo Li、Adrian Weller 和 Weiyang Liu。2023。Metamath：为大语言模型自举数学问题。arXiv预印本 arXiv:2309.12284。

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wen- hao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.

Xiang Yue、Xingwei Qu、Ge Zhang、Yao Fu、Wen-hao Huang、Huan Sun、Yu Su和Wenhu Chen。2023。Mammoth: 通过混合指令微调构建数学通用模型。arXiv预印本arXiv:2309.05653。

Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, et al. 2023. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075.

Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, 等. 2023. 华佗GPT: 驯化语言模型成为医生的探索. arXiv预印本 arXiv:2305.15075.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023a. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing等. 2023a. 用MT-Bench和Chatbot Arena评估大语言模型作为裁判的能力. arXiv预印本 arXiv:2306.05685.

Ziqiang Zheng, Jipeng Zhang, Tuan-Anh Vu, Shizhe Diao, Yue Him Wong Tim, and Sai-Kit Yeung. 2023b. Marinegpt: Unlocking secrets of ocean to the public. arXiv preprint arXiv:2310.13596.

郑子强，张继鹏，武团安，刁世哲，黄裕钦，杨世杰。2023b。MarineGPT：向公众揭开海洋奥秘。arXiv预印本arXiv:2310.13596。

A Appendix

附录 A

A.1 Video Demonstration

A.1 视频演示

Here are two video demos trained and deployed by LMFlow:

以下是 LMFlow 训练和部署的两个视频演示:

• Emotional Companion bot: https://www. youtube.com/watch?v $=$ BDSM E 4 f 2 A jU • Multimodal reasoning-based object detection bot: https://www.youtube.com/watch?v= Y X Ny h 6 b Gq yI

• 情感陪伴机器人：https://www.youtube.com/watch?v=BDSME4f2AjU
• 基于多模态推理的目标检测机器人：https://www.youtube.com/watch?v=YXNYh6bGqyI

Prompt 1:

提示 1:

Human: How can I get my girlfriend to cook more? Assistant: Have you tried reminding her of how nice the food tastes?

人：怎样才能让女朋友多做饭？
助手：你试过提醒她食物有多美味吗？

Human: I could do it more.

我可以做得更多。

Assistant:

助理：

Responses:

响应：

LLaMA: When you cook do you use a lot of spices?

LLaMA: 你做饭时会用很多香料吗？

MA-SFT: That’s an easy one, so it may not be the most effective strategy. Do you know of any other strategies?

MA-SFT: 这个问题很简单，所以它可能不是最有效的策略。你知道其他策略吗？

LLaMA-PPO: That might help. Also, make sure the kitchen is always clean and organized. This will help keep things fresh for cooking. Good luck!

LLaMA-PPO: 这可能会有帮助。另外，请确保厨房始终保持干净整洁。这样有助于保持食材新鲜，方便烹饪。祝你好运！

LLaMA-RAFT: That’s a good idea! You can also try complimenting her on her cooking, and offering to help her in the kitchen. She might enjoy cooking more if she feels appreciated. You could also suggest recipes that she might like, or help her create a menu. Cooking can be a lot of fun, and it’s a great way to spend time together.

LLaMA-RAFT: 这是个好主意！你也可以称赞她的厨艺，并主动提出在厨房帮忙。如果感受到被欣赏，她可能会更享受烹饪。你还可以推荐她可能喜欢的食谱，或帮她制定菜单。烹饪可以很有趣，也是共度时光的好方式。

Prompt 2:

提示词 2:

Human: How do you get a child to do homework?

如何让孩子做作业？

Assistant: There are many strategies, but you can try asking them why they should do it. Ask them if there’s anything they’d like to gain from doing their homework. Try asking if there’s anything you can do for them, and use those as motivation for them to do their homework. You can also remind them that they’ll be rewarded with extra privileges, like video games, if they do their homework. Alternatively, you can in centi viz e them by including them in your plans, and offering them rewards if they get good grades.

策略有很多，但你可以试着问他们为什么要做作业。问问他们是否想从做作业中获得什么。尝试询问你能为他们做些什么，并以此作为他们做作业的动力。你还可以提醒他们，如果完成作业，他们将获得额外特权，比如玩游戏。或者，你可以通过让他们参与你的计划来激励他们，并承诺如果他们取得好成绩就给予奖励。

Human: So make it so there is something in it for them?

让他们从中有所收获？

Assistant: There was another one, called the Great Hurricane of 1938, that was bigger and stronger.

1938年大飓风

Human: Are you sure the date is correct? I know there was a strong storm in 1935 that hit the Florida Keys but that was before they started naming storms. Is that the one you are referring to?

你确定日期正确吗？我知道1935年佛罗里达群岛曾遭遇强风暴，但那时尚未开始给风暴命名。你指的是那次吗？

Assistant:

助手：

Responses:

响应：

LLaMA: Yes. You can remind them that they can do anything they want, but if they do their homework, you can do things with them. You can also try to make your time together enjoyable if they do their homework and get good grades.

LLaMA: 是的。你可以提醒他们可以做任何想做的事，但如果完成作业，你就能陪他们一起玩。还可以尝试让共处时光变得愉快，前提是他们完成作业并取得好成绩。

LLaMA-SFT: Yes, exactly. In general, it’s important to be clear with your child about why they’re doing something. It’s also important to offer them incentives that are meaningful to them. For example, if you tell your child that they have to do their homework because you said so, that’s not a very good incentive for them. However, if you tell them that they’ll be able to use the

LLaMA-SFT: 是的, 完全正确。一般来说, 明确告诉孩子为什么要做某件事很重要。同时, 提供对他们有意义的激励也很关键。比如, 如果你只是对孩子说"必须做作业因为这是我说的", 这种激励效果就很差。但如果你告诉他们完成作业后就能...

LLaMA-PPO: Yes, that’s one strategy you can use. It can help motivate them to do their homework. Good luck with your students! I hope this helps. Have a great day!

LLaMA-PPO: 是的,这是一种可以采用的策略。它能激励学生完成作业。祝你和学生们相处顺利!希望这些建议有所帮助。祝你今天愉快!

LLaMA-RAFT: Yes, that’s a good strategy! Kids often think about what they want right now, so offering them something in exchange for doing their homework can be a powerful motivator. You might also want to offer them some kind of reward if they do well on their homework, to encourage them to do their best. Offering them some kind of incentive can help them want to do well, and also it can help you both stay on track.

LLaMA-RAFT: 这个策略很棒！孩子们通常只考虑眼前需求，因此用交换条件督促他们完成作业会非常有效。不妨承诺在他们认真完成作业后给予奖励，以此激励他们全力以赴。适当的激励措施既能激发孩子的学习动力，也能帮助你们共同保持进度。

Prompt 3:

提示 3:

Human: I am doing a project on weather disasters in the United States. Can you tell me what the strongest hurricane was to ever hit the U.S. was?

我正在做一个关于美国气象灾害的项目。你能告诉我美国有史以来遭遇的最强飓风是哪一场吗？

Assistant: I’m going to guess you’re thinking of the Hurricane Katrina in 2005.

我猜你想到的是2005年的卡特里娜飓风。

Human: Well, I’m not really thinking of anything in particular. Was Katrina the ssistant: There was another one, called the Great Hurricane of 1938, that was bigger and stronger.

人：嗯，我其实没特指什么。卡特里娜是...
助手：还有一场更大的飓风，叫1938年大飓风。

Human: Are you sure the date is correct? I know there was a strong storm in 1935 that hit the Florida Keys but that was before they started naming storms. Is that the one you are referring to?

你确定日期没错吗？我知道1935年佛罗里达群岛遭遇过一场强风暴，但那时还没开始给风暴命名。你指的是那次吗？

Assistant:

助手：

Responses:

响应：

LLaMA: Yeah, the Great Hurricane of 1938.

LLaMA：是的，1938年大飓风。

LLaMA-SFT: I could search for more information, but I’ll just share what I’ve found. Here’s a review of the conditions in 1935 and 1938. It looks like the storms had similar tracks, but 1938 was stronger in some ways. In 1938 there were more people living on the coast, and the hurricane was more intense.

LLaMA-SFT: 我可以搜索更多信息，但先分享现有发现。这是1935年和1938年飓风情况的对比分析。两次风暴路径相似，但1938年飓风在某些方面更强。当时沿海居住人口更多，且飓风强度更高。

LLaMA-PPO: Yes, the hurricane in 1935 was stronger than Katrina. The 1938 storm was bigger. I hope this helps. Good luck with your project!

LLaMA-PPO: 是的，1935年的飓风比卡特里娜更强。1938年的风暴规模更大。希望这些信息对你有帮助。祝你的项目顺利！

LLaMA-RAFT: Yes, the 1938 hurricane was the most intense storm that year, and the second-most intense storm ever recorded. It caused over 6,000 deaths. I’m not sure about the Florida Keys storm you mentioned, but it sounds like it was a major storm as well. It’s very interesting that there were two such intense storms in such a short time period.

LLaMA-RAFT: 是的，1938年飓风是当年最强烈的风暴，也是有记录以来第二强烈的风暴。它导致超过6000人死亡。我不确定你提到的佛罗里达群岛风暴，但听起来那也是一场重大风暴。如此短时间内接连发生两场如此强烈的风暴，这非常有趣。

Table 6: Representative examples of HH-RLHF experiments with randomly sampled prompts. These examples are directly taken from the RAFT paper (Dong et al., 2023).

表 6: 随机采样提示的HH-RLHF实验代表性示例。这些示例直接取自RAFT论文 (Dong et al., 2023)。

[论文翻译]LMFlow: 一个用于大基础模型微调和推理的可扩展工具包

原文地址：https://arxiv.org/pdf/2306.12420

LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models

LMFlow: 一个用于大基础模型微调和推理的可扩展工具包

Abstract

1 Introduction

1 引言

2 Related Work

2 相关工作

3 Toolkit Overview

3 工具包概述

3.1 System Design

3.1 系统设计

3.2 Installation

3.3 Data Format

3.3 数据格式

3.4 Continuous Pre training

3.4 持续预训练

3.5 Instruction Tuning

3.5 指令微调

3.6 RLHF as Finetuning

3.6 基于人类反馈的强化学习 (RLHF) 微调

3.7 Efficient Tuning

3.7 高效调优

3.8 Inference

3.8 推理

4 API Documentation

4 API文档

5 Results

5 结果

5.1 Task Tuning

5.1 任务调优

5.2 Instruction Tuning

5.2 指令微调 (Instruction Tuning)

5.3 Alignment Tuning

6 Conclusion

6 结论

Acknowledgements

Broader Impact and Responsible Use

更广泛的影响与负责任的使用

References

参考文献

A Appendix

附录 A

A.1 Video Demonstration

A.1 视频演示

Assistant:

Responses:

Assistant: