Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks

Taiyi: 面向多样化生物医学任务的双语精调大语言模型

Ling Luo* , Jinzhong Ning, Yingwen Zhao, Zhijun Wang, Zeyuan Ding, Peng Chen, Weiru Fu, Qinyu Han, Guangtao Xu, Yunzhi Qiu, Dinghao Pan, Jiru Li, Hao Li, Wenduo Feng, Senbo Tu, Yuqi Liu, Zhihao Yang, Jian Wang, Yuanyuan Sun, Hongfei Lin

School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China * Corresponding author: lingluo@dlut.edu.cn

大连理工大学计算机科学与技术学院，大连 116024

通讯作者：lingluo@dlut.edu.cn

Abstract

摘要

Objective: Most existing fine-tuned biomedical large language models (LLMs) focus on enhancing performance in monolingual biomedical question answering and conversation tasks. To investigate the effectiveness of the fine-tuned LLMs on diverse biomedical NLP tasks in different languages, We present Taiyi, a bilingual fine-tuned LLM for diverse biomedical tasks.

目标：现有大多数经过微调的生物医学大语言模型主要关注提升单语种生物医学问答和对话任务的性能。为探究微调后大语言模型在不同语言的多样化生物医学自然语言处理任务中的有效性，我们提出了双语生物医学任务微调模型Taiyi。

Materials and Methods: We first curated a comprehensive collection of 140 existing biomedical text mining datasets (102 English and 38 Chinese datasets) across over 10 task types. Subsequently, a two-stage strategy is proposed for supervised fine-tuning to optimize the model performance across varied tasks.

材料与方法：我们首先整理了140个现有生物医学文本挖掘数据集（102个英文和38个中文数据集），涵盖10余种任务类型。随后，提出了一种两阶段监督微调策略，以优化模型在不同任务中的性能。

Results: Experimental results on 13 test sets covering named entity recognition, relation extraction, text classification, question answering tasks demonstrate that Taiyi achieves superior performance compared to general LLMs. The case study involving additional biomedical NLP tasks further shows Taiyi's considerable potential for bilingual biomedical multi-tasking.

结果：在涵盖命名实体识别、关系抽取、文本分类、问答任务的13个测试集上的实验结果表明，Taiyi相比通用大语言模型展现出更优性能。针对生物医学NLP任务的案例研究进一步揭示了Taiyi在双语生物医学多任务处理方面的巨大潜力。

Conclusion: Leveraging rich high-quality biomedical corpora and developing effective fine-tuning strategies can significantly improve the performance of LLMs within the biomedical domain. Taiyi shows the bilingual multi-tasking capability through supervised fine-tuning. However, those tasks such as information extraction that are not generation tasks in nature remain challenging for LLM-based generative approaches, and they still under perform the conventional disc rim i native approaches of smaller language models.

结论：利用丰富的高质量生物医学语料库并开发有效的微调策略，可以显著提升大语言模型在生物医学领域的性能。Taiyi通过监督微调展现了双语多任务处理能力。然而，对于信息抽取等本质上非生成式的任务，基于大语言模型的生成式方法仍面临挑战，其表现仍逊于传统小语言模型的判别式方法。

Key words: natural language processing; large language model; supervised fine-tuning; biomedical multi-tasking.

关键词: 自然语言处理; 大语言模型; 监督微调; 生物医学多任务

INTRODUCTION

引言

Recently, the release of ChatGPT [1] and the subsequent launch of GPT-4 [2] have received widespread attention around the world, which promotes the development of large language models (LLMs) that have billions of parameters and trained with hundreds of billions of tokens. These LLMs (such as GPT-4, PaLM [3], LLaMA [4], and GLM [5]) have shown promising results and achieved new state-of-the-art performance in various natural language processing (NLP) tasks. Compared with previous pre-trained language models (such as BERT [6]), LLMs exhibit enhanced text generation and comprehension abilities. Moreover, their few-shot/zero-shot learning and generalization abilities address unseen or complicated tasks. Due to the advantages, LLMs have been explored to apply in various fields, such as law [7], education [8], finance [9], and bio medicine and health [10].

近来，ChatGPT[1]的发布及随后推出的GPT-4[2]引发了全球广泛关注，推动了大语言模型(LLMs)的发展——这些模型具有数十亿参数，并接受了数千亿token的训练。以GPT-4、PaLM[3]、LLaMA[4]和GLM[5]为代表的LLMs在各类自然语言处理(NLP)任务中展现出卓越性能，创造了新的技术标杆。与此前的预训练语言模型(如BERT[6])相比，LLMs显著提升了文本生成与理解能力，其少样本/零样本学习及泛化能力更能应对未知复杂任务。凭借这些优势，LLMs已在法律[7]、教育[8]、金融[9]、生物医药与健康[10]等领域展开应用探索。

LLMs in the biomedical and healthcare domain are essential and potential for handling the scale and complexity of biomedical data, as well as for providing more personalized and empathetic medical care, ultimately advancing the quality and effectiveness of healthcare practices. However, biomedical text exhibits some distinct features compared to text in the general domain, such as complex terminologies, ambiguous abbreviations, more complex syntax, and less common vocabulary. Owing to these domain characteristics, most existing LLMs trained on general domain data encounter difficulties when they are applied to the biomedical domain [11]. Therefore, some domain-special LLMs have been developed to facilitate the development of NLP methods for biomedical applications. For example, Google’s team recently developed Med-PaLM2 [12], a fine-tuned LLM using medical data based on PaLM [3], which has achieved a high performance on the medical QA datasets with an accuracy of over $80%$ . Apart from the non-open biomedical LLMs, there have been some efforts to develop open-source biomedical LLMs [13-20]. Most of the models are derived from some open-source general LLMs (such as LLaMA [4], BLOOMZ [21], and GLM [5]) and are fine-tuned with monolingual (e.g., English or Chinese) question answering (QA) and conversation data. They are primarily directed toward enhancing performance in biomedical QA and conversation-oriented tasks. However, the efficacy and suitability of the LLMs on diverse biomedical NLP tasks in different languages remains unclear and warrants further investigation.

生物医学和医疗健康领域的大语言模型对于处理生物医学数据的规模和复杂性、提供更具个性化和同理心的医疗服务至关重要，且具备巨大潜力，最终将提升医疗实践的质量与效率。然而，与通用领域文本相比，生物医学文本具有术语复杂、缩写歧义、句法结构更繁琐、词汇更生僻等显著特征。由于这些领域特性，大多数基于通用领域数据训练的大语言模型在应用于生物医学领域时会遇到困难[11]。为此，研究者开发了若干领域专用大语言模型以推动生物医学NLP方法的发展。例如，Google团队近期基于PaLM[3]利用医学数据微调出Med-PaLM2[12]，该模型在医学问答数据集上实现了超过$80%$的准确率。除未开源的生物医学大语言模型外，学界也涌现了一批开源模型[13-20]。这些模型大多衍生自开源通用大语言模型（如LLaMA[4]、BLOOMZ[21]和GLM[5]），并通过单语种（如英语或中文）问答和对话数据进行微调，主要致力于提升生物医学问答和对话任务的性能。但这些大语言模型在不同语言的多样化生物医学NLP任务中的有效性与适用性仍不明确，有待进一步探究。

To address these problems, we present Taiyi, a bilingual (English and Chinese) fine-tuned large language model for diverse biomedical tasks. First, a comprehensive collection including 102 English and 38 Chinese datasets is assembled, covering over 10 biomedical task types. To facilitate taskspecific requirements and enable consistent formatting across all datasets, standardized data schemas are designed and universally applied during dataset curation. Then, in the supervised fine-tuning (SFT) phase, we propose a two-stage fine-tuning strategy. In contrast to the simple single-stage fine-tuning, this strategy significantly optimizes model performance across a diversity of tasks. Finally, the evaluation of Taiyi is conducted on 13 Biomedical NLP test sets. Experimental results demonstrate the promising potential of Taiyi in bilingual multi-task learning. It achieves superior performance on bio medicine-specific tasks compared to general LLMs. However, Taiyi has an improvement room for current state-of-the-art models specialized for individual tasks.

为解决这些问题，我们推出了Taiyi——一个针对多样化生物医学任务进行双语（英语和中文）微调的大语言模型。首先，我们整合了包含102个英文和38个中文数据集的综合资源库，覆盖超过10种生物医学任务类型。为满足任务特定需求并确保所有数据集格式统一，在数据整理过程中设计并统一应用了标准化数据架构。随后，在有监督微调(SFT)阶段，我们提出了两阶段微调策略。与简单的单阶段微调相比，该策略显著优化了模型在多样化任务中的表现。最终，我们在13个生物医学自然语言处理测试集上对Taiyi进行评估。实验结果表明，Taiyi在双语多任务学习中展现出显著潜力。相较于通用大语言模型，其在生物医学专项任务中表现更优，但与当前单项任务专用最先进模型相比仍存在提升空间。

MATERIALS AND METHODS

材料与方法

Figure 1. Overview of the Taiyi framework. Note that the English in brackets is not part of the input and output; it is the translation of the Chinese.

图 1: Taiyi框架概览。注意括号中的英文并非输入输出内容，而是中文的翻译。

Figure 1 shows the overview of our framework. We first collected and curated a set of existing bilingual (Chinese and English) biomedical natural language processing (BioNLP) corpora. Then these corpora were converted to the instruction data used to fine-tune the general LLM. In the supervised fine-tuning phase, we propose a two-stage supervised instruction fine-tuning strategy. Finally, our Taiyi model can be applied to various BioNLP tasks and it is evaluated on 13 test sets covering 4 BioNLP task types.

图 1: 展示了我们框架的概览。我们首先收集并整理了一组现有的双语(中英文)生物医学自然语言处理(BioNLP)语料库。随后将这些语料库转换为用于微调通用大语言模型的指令数据。在有监督微调阶段，我们提出了一种两阶段有监督指令微调策略。最终，我们的Taiyi模型可应用于各类BioNLP任务，并在涵盖4种BioNLP任务类型的13个测试集上进行了评估。

Training data

训练数据

Dataset collection

数据集收集

This study aims to explore the capabilities of LLMs in handling a variety of bilingual BioNLP tasks by supervised fine-tuning. Therefore, we focus on manually annotated English and Chinese biomedical corpora. To fully utilize available BioNLP resources, we make efforts to aggregate comprehensive sets of open-source datasets in both English and Chinese. The data collection is primarily from two sources: existing English/Chinese BioNLP shared task datasets and the training data used for existing biomedical LLMs.

本研究旨在通过监督微调探索大语言模型(LLM)处理多种双语生物医学自然语言处理(BioNLP)任务的能力。为此，我们重点关注人工标注的英文和中文生物医学语料库。为充分利用现有BioNLP资源，我们致力于整合全面的开源英文和中文数据集。数据收集主要来自两个渠道：现有英文/中文BioNLP共享任务数据集，以及现有生物医学大语言模型使用的训练数据。

Figure 2. Overview of the dataset collection. In the tree map (left), the entire graph is represented by a large rectangle, representing the 140 bilingual open-source datasets collected. This large rectangle is divided into smaller rectangles of various colors, with each small rectangle representing a specific BioNLP task and listing the names of all datasets related to that task. In the scatter plot (top right), each data point represents a BioNLP task, with its size determined by the number of datasets associated with that task. The bar chart (bottom right) presents the scale of related datasets for different tasks, further illustrating the number of datasets for each task.

图 2: 数据集收集概览。在矩形树图(左)中，整个图形由一个大矩形表示，代表收集的140个双语开源数据集。这个大矩形被划分为多个不同颜色的小矩形，每个小矩形代表特定的BioNLP任务，并列出了与该任务相关的所有数据集名称。散点图(右上)中每个数据点代表一个BioNLP任务，其大小由该任务关联的数据集数量决定。条形图(右下)展示了不同任务相关数据集的规模，进一步呈现了每个任务的数据集数量。

Our data collection significantly benefited from two major previous efforts in aggregating biomedical text mining datasets – BigBio [27] and CBLUE [28]. The BigBio aggregates a large collection of English BioNLP datasets, while the CBLUE dataset assembles a wide range of Chinese biomedical natural language understanding datasets. In addition, we also collected some other relevant BioNLP datasets that are not included in BioBio and CBLUE. Finally, we have successfully assembled a total of 140 biomedical datasets. All datasets are categorized into 15 different BioNLP task types as shown in Figure 2: Named Entity Recognition/Normalization (NER/NEN), Relation Extraction (RE), Causal Relation Extraction (CRE), Event Extraction (EE), Co reference Resolution (COREF), Text Classification (TC), Question Answering-Multiple Choice (QA-mc), Question Answering-Simple Answer Questions (QA-sqa), Question Answering-Context-based Answer Questions (QA-cqa), Multi-Round Dialogue (MRD), Machine Translation (MT), Text Pairs-Semantic Similarity (TP-ss), Text Pairs-Textual Entailment (TP-te), Text to Text/Struct-Document Sum mari z ation (TT-ds) and Text/Struct-Text to Struct (TT-ts). Among these datasets, there are 38 Chinese datasets covering 10 different BioNLP tasks, and 102 English datasets spanning 12 BioNLP tasks.

我们的数据收集工作极大地受益于此前两项生物医学文本挖掘数据集整合的重要成果——BigBio [27]和CBLUE [28]。BigBio汇总了大量英文生物医学自然语言处理数据集，而CBLUE数据集则整合了广泛的中文生物医学自然语言理解数据集。此外，我们还收集了未包含在BigBio和CBLUE中的其他相关生物医学自然语言处理数据集。最终，我们成功整合了总计140个生物医学数据集。所有数据集按15种不同的生物医学自然语言处理任务类型分类，如图2所示：命名实体识别/标准化(NER/NEN)、关系抽取(RE)、因果关系抽取(CRE)、事件抽取(EE)、共指消解(COREF)、文本分类(TC)、选择题问答(QA-mc)、简答题问答(QA-sqa)、基于上下文的问答(QA-cqa)、多轮对话(MRD)、机器翻译(MT)、文本对-语义相似性(TP-ss)、文本对-文本蕴含(TP-te)、文本到文本/结构化-文档摘要(TT-ds)以及文本/结构化-文本到结构化(TT-ts)。这些数据集中包含38个中文数据集，覆盖10种不同的生物医学自然语言处理任务，以及102个英文数据集，涵盖12种生物医学自然语言处理任务。

For each dataset, we collated key metadata including task types, data size, task descriptions, and the links of the dataset and paper. This metadata facilitates full understanding and proper usage of each corpus. The collection of bilingual datasets across diverse biomedical language processing tasks facilitates the comprehensive evaluation and development of multilingual BioNLP models. Details can be accessed at the following link: https://github.com/DUTIR-BioNLP/TaiyiLLM/blob/main/data_ file/data set in f.md.

针对每个数据集，我们整理了关键元数据，包括任务类型、数据规模、任务描述以及数据集与论文的链接。这些元数据有助于全面理解并正确使用每个语料库。跨多种生物医学语言处理任务的双语数据集集合，为多语言BioNLP模型的全面评估与开发提供了便利。具体细节可通过以下链接访问：https://github.com/DUTIR-BioNLP/TaiyiLLM/blob/main/data_ file/data set in f.md。

Task schema harmonization

任务模式统一化

In the original data set, even for the same BioNLP task, there are many different data formats due to different sources and developers. For example, the annotation formats include BioC, CoNLL, and PubTator formats across NER datasets. Such inconsistencies introduce challenges for developing systems that can leverage diverse annotated datasets. Therefore, establishing a universal data format enables interoperability and is beneficial for cross-dataset integration. To be consistent with previous efforts, we extended the schema from BioBio schema to support all tasks in our collection. The complete unified task schema can be found at https://github.com/DUTIR-BioNLP/TaiyiLLM/blob/main/data_ file/Task schemas en.md.

在原始数据集中，即使是相同的BioNLP任务，由于来源和开发者不同也存在多种数据格式。例如，NER数据集中的标注格式包含BioC、CoNLL和PubTator等多种形式。这种不一致性给开发能利用多样化标注数据的系统带来了挑战。因此，建立通用数据格式有助于实现互操作性，并促进跨数据集整合。为与先前工作保持一致，我们扩展了BioBio模式的架构以支持我们收集的所有任务。完整的统一任务架构详见https://github.com/DUTIR-BioNLP/TaiyiLLM/blob/main/data_ file/Task schemas en.md。

Instruction-tuning data construction

Instruction-tuning数据构建

To enable the model to understand task instructions for performing multi-tasking, we constructed the instruction data for fine-tuning, which covers the tasks described in the above section. Moreover, we also added MOSS data (i.e., Moss-003-sft-data) [29] into the training data to enhance the general conversation and harmlessness ability of Taiyi. Moss-003-sft-data is well-aligned with the real-world distribution of user intents, covering finer-grained categories and more diverse harmlessness-related data. Previous studies [30 31] have demonstrated that the quality of the training data plays a critical role in the performance of downstream tasks. Inferior quality data adversely affects the performance of the LLMs. Therefore, we manually analyzed the datasets, then the high-quality datasets were selected. We filtered duplicated training data, and those overlapped documents in the training set if the documents also exist in the test set to accurately evaluate the model performance. The statistics of the final data used for instruction-tuning is presented in Table 1.

为了使模型能够理解执行多任务的任务指令，我们构建了用于微调的指令数据，涵盖上述章节描述的任务。此外，我们还向训练数据中添加了MOSS数据（即Moss-003-sft-data）[29]，以增强太乙的通用对话和无害化能力。Moss-003-sft-data与现实世界用户意图分布高度吻合，涵盖更细粒度的类别和更多样化的无害化相关数据。先前研究[30 31]表明，训练数据的质量对下游任务性能起着关键作用。低质量数据会对大语言模型的性能产生不利影响。因此，我们手动分析了数据集，随后筛选出高质量数据集。我们过滤了重复的训练数据，并剔除了训练集中与测试集重叠的文档，以准确评估模型性能。最终用于指令微调的数据统计如表1 所示。

Table 1. Statistics of our final instruction-tuning data

Task Type	English data size	Chinese data size
Named Entity Recognition	28,603	44,667
RelationExtraction	17,279	26,606
Event Extraction	2,022	2,992
TextClassification	40,339	37,624
Text Pair Task	11,237	45,548
Machine Translation	74,113
Biomedical Question Answering	57,962	129,562
Biomedical Multi-Round Dialogue	10,000	16,391
General DialogueData	560,000
OtherAdditionalTasks	9,370
Total	1,114,315

表 1: 最终指令微调数据统计

任务类型	英文数据量	中文数据量
命名实体识别	28,603	44,667
关系抽取	17,279	26,606
事件抽取	2,022	2,992
文本分类	40,339	37,624
文本对任务	11,237	45,548
机器翻译	74,113	-
生物医学问答	57,962	129,562
生物医学多轮对话	10,000	16,391
通用对话数据	560,000	-
其他附加任务	9,370	-
总计	1,114,315	-

To construct the instruction-tuning data, we design instructional templates for each task. For the QA and dialogue tasks, original questions are used as the model input and answers are used as the output. For other tasks, approximately 15 instruction templates were manually created for each task in English and Chinese, respectively. Some examples of the instruction data can be found in Supplementary Table S2.

为构建指令微调数据，我们为每项任务设计了指令模板。对于问答和对话任务，原始问题直接作为模型输入，答案作为输出。其他任务则分别用英语和中文人工创建了约15个指令模板。部分指令数据示例可参见补充表S2。

Furthermore, some complex NER and RE tasks are divided into multiple subtasks to reduce difficulty and increase diversity. For instance, the BC5CDR [32] task of chemical and disease entity recognition is separated into the chemical entity recognition and disease entity recognition subtasks. The original complex task is retrained while these additional subtasks are added to the instructiontuning data.

此外，一些复杂的命名实体识别(NER)和关系抽取(RE)任务被拆分为多个子任务以降低难度并增加多样性。例如，BC5CDR [32]的化学与疾病实体识别任务被拆分为化学实体识别和疾病实体识别两个子任务。在指令调优数据中既保留原始复杂任务，又新增了这些子任务。

Model training

模型训练

Base model

基础模型

Recently, Alibaba Cloud's pre-trained LLM Qwen series [24] have been developed and shown promising results in various NLP tasks. We chose the Qwen-7B-base version for supervised instruction fine-tuning. Qwen-7B is a Transformer-based pre-trained language model that obtains generalized language understanding capabilities through self-supervised learning on large-scale high-quality multilingual pre training corpora. Compared to models like BERT, the training data coverage of Qwen7B is more extensive, including web texts, academic books, code, and other resources. It has approximately 7 billion parameters and a vocabulary size of around 150,000.

近期，阿里云的预训练大语言模型Qwen系列[24]取得显著进展，在多种NLP任务中展现出优异性能。我们选用Qwen-7B-base版本进行监督式指令微调。Qwen-7B是基于Transformer架构的预训练语言模型，通过在大规模高质量多语种预训练语料上进行自监督学习，获得了通用语言理解能力。相较BERT等模型，Qwen7B的训练数据覆盖更广，包含网络文本、学术著作、代码等资源，参数量约70亿，词表规模约15万。

In public leader boards 1,2 of downstream English and Chinese tasks, Qwen-7B significantly outperforms models of similar size and even surpasses larger models on some tasks. This demonstrates Qwen-7B's strong language understanding and transfer learning abilities. The considerations for choosing Qwen-7B as our pre-trained model are as follows: 1) The model size is moderate, with high training efficiency that meets our computational resource constraints; 2) The training data coverage is extensive, encompassing the common language, domain-specific language, and code data required for our cross-domain task; 3) It achieves strong performance on public benchmarks, with sufficient language understanding capabilities to provide a solid foundation for our downstream task.

在英语和汉语下游任务的公开排行榜1,2中，Qwen-7B显著优于同等规模的模型，甚至在某些任务上超越更大规模的模型。这证明了Qwen-7B强大的语言理解和迁移学习能力。选择Qwen-7B作为预训练模型的考虑因素如下：1) 模型规模适中，训练效率高，符合我们的计算资源限制；2) 训练数据覆盖广泛，包含跨领域任务所需的通用语言、领域专用语言及代码数据；3) 在公开基准测试中表现优异，具备足够的语言理解能力，可为下游任务提供坚实基础。

Two-stage Supervised Instruction Fine-tuning

两阶段监督指令微调

During the SFT stage, we performed fine-tuning on our curated dataset of millions of examples. Since the instruction-tuning data involves dozens of datasets across over ten biomedical NLP tasks in both Chinese and English, performing fine-tuning in a simple single stage would result in task interference, preventing convergence on some of the more challenging NLP tasks. Therefore, we propose a twostage supervised instruction fine-tuning strategy shown in Figure 3.

在监督微调(SFT)阶段，我们对精心整理的数百万条样本数据集进行了微调。由于指令调优数据涉及十余种生物医学NLP任务的中英文数据集，采用简单的单阶段微调会导致任务干扰，阻碍模型在部分更具挑战性的NLP任务上收敛。为此，我们提出了如图3所示的两阶段监督指令微调策略。

Figure 3. The two-stage training process of Taiyi. Tasks in Type1 are in the blue background, and task2 in Type2 are in the yellow background.

图 3: Taiyi的两阶段训练过程。Type1中的任务以蓝色背景显示，Type2中的任务以黄色背景显示。

According to the task type and the size of the task dataset, we first manually categorized all the tasks into two types of the tasks: Type1 and Type2. In Type1, most tasks are not generation tasks in nature (e.g., NER is a sequence labeling task in nature rather than a generation task), or the size of the task dataset is related small. These tasks include information extraction, text classification, text pair tasks, machine translation and other additional tasks. In Type2, there are QA and dialogue tasks including biomedical QA, biomedical multi-round dialogue, and general dialogue tasks. These tasks are generation tasks in nature. In the first stage of model training, we first performed supervised instruction fine-tuning on the data from the Type1 tasks (around 340,000 instances). The best checkpoint was selected through a combination of human evaluation and automated metrics on the development sets for the second stage of training. In the second training stage, all the training data from the first stage are used as retrospective data, and all data in Type2 are mixed for incremental

根据任务类型和任务数据集规模，我们首先将所有任务手动划分为两类：Type1和Type2。Type1中的任务大多本质不属于生成式任务（例如NER本质上是序列标注任务而非生成任务），或任务数据集规模相对较小。这类任务包括信息抽取、文本分类、文本对任务、机器翻译及其他附加任务。Type2包含问答和对话任务，涉及生物医学问答、生物医学多轮对话以及通用对话任务，这些任务本质属于生成式任务。在模型训练的第一阶段，我们首先对Type1任务数据（约34万条实例）进行监督式指令微调，通过开发集上的人工评估与自动化指标综合选取最佳检查点用于第二阶段训练。在第二阶段训练中，第一阶段所有训练数据将作为回溯数据使用，并与Type2全部数据进行混合以实现增量训练。

training.

训练

We used 8 A40 GPUs for SFT. The model was trained with 5 epochs ( ${\sim}7$ hours per epoch) in stage 1, and 3 epochs ( ${\sim}26$ hours per epoch) in stage 2. To improve model training efficiency, we chose Qlora (Dettmers, et al., 2023), an efficient tuning algorithm, to conduct supervised instruction fine-tuning. The main hyper parameters for the training process are set as follows: Batch size per GPU of 12, learning rate of 0.0002, warmup ratio of 0.1, max length of 1024, lora rank of 64, lora alpha of 16, and lora dropout of 0.05.

我们使用8块A40 GPU进行监督微调(SFT)。第一阶段训练5个周期(每周期约7小时)，第二阶段训练3个周期(每周期约26小时)。为提高训练效率，我们采用高效调参算法Qlora (Dettmers等人，2023)进行监督指令微调。主要超参数设置如下：单GPU批次大小12，学习率0.0002，预热比例0.1，最大长度1024，lora秩64，lora alpha值16，lora dropout率0.05。

RESULTS

结果

Evaluation tasks, datasets and metrics

评估任务、数据集和指标

To investigate the capability of the Taiyi model for various bilingual BioNLP tasks, we selected four task types (i.e., NER, RE, TC, multiple-choice QA) as the metrics evaluation and other tasks for the case study. The statistics of the test sets are shown in Supplementary Table S3. Further details for each evaluation task are provided below.

为评估太乙模型在多种双语生物自然语言处理任务中的能力，我们选取了四种任务类型（即命名实体识别(NER)、关系抽取(RE)、文本分类(TC)和多项选择问答）作为指标评估对象，其他任务则用于案例分析。测试集统计数据详见补充材料表S3。各评估任务的具体细节如下：

Biomedical Named Entity Recognition (NER). Biomedical NER aims to identify predefined biomedical entities from text, such as diseases, drugs, genes, and proteins. Six biomedical NER datasets (i.e., BC5CDR-Chemcial [32], BC5CDR-disease [32], CHEMDNER [33], NCBI-Disease [34], BioRED [35] and CMeEE-dev [28]) are used for our NER evaluation. The exact match (the predicted entity text and entity type without span) micro F1-score was used as the evaluation metrics.

生物医学命名实体识别 (NER)。生物医学NER旨在从文本中识别预定义的生物医学实体，如疾病、药物、基因和蛋白质。我们使用了六个生物医学NER数据集（即BC5CDR-Chemcial [32]、BC5CDR-disease [32]、CHEMDNER [33]、NCBI-Disease [34]、BioRED [35]和CMeEE-dev [28]）进行NER评估。评估指标采用精确匹配（预测实体文本和实体类型，不考虑跨度）的微平均F1分数。

Biomedical Relation Extraction (RE). Biomedical RE aims to automatically extract predefined relations or associations between biomedical entities from text. In this task, we focus on extracting the entity relation triple (head entity, tail entity, relation type). We selected one English dataset BC5CDR and one Chinese dataset CMeEE-dev [28] for the RE evaluation. The micro F1-score was used as the RE metrics.

生物医学关系抽取(RE)。生物医学关系抽取旨在从文本中自动提取预定义的生物医学实体间关系或关联。本任务聚焦于抽取实体关系三元组(头实体, 尾实体, 关系类型)。我们选取了英文数据集BC5CDR和中文数据集CMeEE-dev [28] 进行关系抽取评估，采用微平均F1值作为评价指标。

Biomedical Text Classification (TC). Biomedical TC aims to automatically categorize texts into predefined biomedical class labels. Two manually annotated English multi-label document classification datasets (BC 7 Lit C ovid [36] and HoC [37]) and one Chinese dataset (KUAKE_ QIC [28]) are used for the TC evaluation. The micro F1-score was used as the metrics.

生物医学文本分类 (TC)
生物医学文本分类旨在将文本自动归类到预定义的生物医学类别标签中。评估使用了两个手动标注的英文多标签文档分类数据集 (BC7LitCovid [36] 和 HoC [37]) 以及一个中文数据集 (KUAKE_ QIC [28]) ，并以微平均 F1 分数作为评估指标。

Biomedical Multiple-choice Question Answering (QA-mc). Biomedical multiple-choice Question Answering task aims to answer multiple-choice questions related to bio medicine and healthcare. PubMedQA [38] and MedQA-USMLE (4-option) [39] are used to evaluate the model performance in English. MedQA-MCMLE (4-option) [39] is used to evaluate the model performance in Chinese. We used the accuracy as the metrics for this task.

生物医学多选题问答 (QA-mc)。生物医学多选题问答任务旨在回答与生物医学和医疗保健相关的选择题。PubMedQA [38] 和 MedQA-USMLE (4选项) [39] 用于评估模型在英语中的表现。MedQA-MCMLE (4选项) [39] 用于评估模型在中文中的表现。我们使用准确率作为该任务的评估指标。

In addition to the four primary evaluation tasks, the capabilities of the Taiyi model were also shown through examples on several supplementary biomedical NLP tasks, such as medical Report Generation, Biomedical Event Extraction, Biomedical Machine Translation, Biomedical Title Generation, Biomedical Text Semantic Similarity, Biomedical Question Answering and Chat. Details can be found in Supplementary Examples of Taiyi output on the supplementary biomedical NLP tasks.

除四项主要评估任务外，太乙模型的能力还通过若干辅助生物医学NLP任务的示例得以展示，例如医疗报告生成、生物医学事件抽取、生物医学机器翻译、生物医学标题生成、生物医学文本语义相似度、生物医学问答及对话。具体细节可参阅《太乙模型在辅助生物医学NLP任务中的输出补充示例》。

Effectiveness of the two-stage fine-tuning strategy

两阶段微调策略的有效性

In this experiment, we tested the effect of our two-stage fine-tuning strategy on the four tasks. For comparison, we simply combined all data of the four tasks to finetune the base model as the baseline. As the cost of LLMs is high, we randomly sampled 200 instances from the test set per dataset. The performances of the models using the simple one-stage and the two-stage strategies on the tasks are shown in Figure 4.

在本实验中，我们测试了两阶段微调策略对四项任务的效果。作为对比基线，我们直接将四项任务的所有数据合并用于微调基础模型。由于大语言模型成本较高，我们从每个数据集的测试集中随机抽取了200个样本。图4展示了使用单阶段简单策略和两阶段策略的模型在各项任务上的表现。

Figure 4. Performances of models using different fine-tuning strategies. One-stage denotes the model fine-tuned by combining all task datasets. Two-stage denotes the model fine-tuned by our two-stage strategy. (A) the results on the NER tasks. (B) the results on the RE tasks. (C) the results on the TC tasks. (D) the results on the QA-mc tasks.

图 4: 采用不同微调策略的模型性能对比。One-stage表示将所有任务数据集合并进行微调的模型，Two-stage表示采用我们提出的两阶段策略微调的模型。(A) NER任务结果。(B) RE任务结果。(C) TC任务结果。(D) QA-mc任务结果。

When we simply combined all task datasets as a one-stage strategy to fine-tune the LLM, the results show poor performances on all tasks. The main reason may be the challenges of converging different tasks with varying levels of difficulty and dataset sizes. Therefore, we first fine-tuned the model on those Type1 tasks described in the section of Two-stage Supervised Instruction Fine-tuning independently, then combined all data to continue fine-tuning the model in the second stage. The results show that our two-stage strategy outperforms the one-stage strategy on all English and Chinese tasks and achieves significant average improvement $\sim10%$ in metrics). The two-stage approach allows the model to first become specialized on tasks that are not generation tasks in nature before developing more generalized capabilities across tasks in the second stage.

当我们简单地合并所有任务数据集作为单阶段策略来微调大语言模型时，结果显示在所有任务上表现不佳。主要原因可能是难以收敛不同难度级别和数据规模的任务。因此，我们首先在两阶段监督指令微调章节描述的Type1任务上独立微调模型，然后在第二阶段合并所有数据继续微调模型。结果表明，我们的两阶段策略在所有英文和中文任务上都优于单阶段策略，并在指标上实现了显著的平均提升（$\sim10%$）。两阶段方法使模型能够在第二阶段开发跨任务的更通用能力之前，首先专注于本质上非生成类的任务。

Performance of Taiyi on the entire test sets

太乙模型在完整测试集上的表现

The previous experiment demonstrates our two-stage approach achieves significant improvements on the subset of the test sets compared to the simple one-stage training strategy. In this experiment, we evaluate the performance of our Taiyi on the entire test sets of the 13 tasks covering 4 BioNLP task types. The results of ChatGPT 3.5 (GPT-3.5-Turbo) and the state-of-the-art (SOTA) methods based on supervised pretrained language models for each corpus are provided for comparison.

先前实验表明，与简单单阶段训练策略相比，我们的两阶段方法在测试集子集上取得了显著提升。本实验中，我们评估了Taiyi在涵盖4类BioNLP任务的13个完整测试集上的性能，并提供ChatGPT 3.5 (GPT-3.5-Turbo)及各语料库基于监督预训练语言模型的当前最优(SOTA)方法结果作为对比。

Table 2. Performance comparison with other existing methods on the 13 BioNLP tasks

Task	Datasets	Taiyi	ChatGPT3.5	SOTA
NER	BC5CDR-Chem (en)	0.802	0.603 [40]	0.933 (PubMedBERT [40])
(Micro-F1)	BC5CDR-Dise (en)	0.691	0.518 [40]	0.856 (PubMedBERT [40])
	CHEMDNER (en)	0.799	0.365 [41]	0.924 (BioBERT [42])
	NCBI-Disease (en)	0.731	0.505 [40]	0.878 (PubMedBERT [40])
	CMeEE (zh)	0.657	0.470 [43]	0.740 CBLUE-Leaderboard [28] ]
RE	BC5CDR (en)	0.375	0.142	0.450 (BioGPT [44])
(Micro-F1)	CMelE (zh)	0.432	0.306 [43]	0.549 CBLUE-Leaderboard [28])
TC (Micro-F1)	BC7LitCoivd (en)	0.840	0.639 [45]	0.918 (Bioformer [46])
	HOC (en)	0.800	0.512 [40]	0.823 (PubMedBERT [40])
	KUAKE_ QIC (zh)	0.774	0.485 [43]	0.859 (CBLUE-Leaderboard [28])
QA	PubMedQA (en)	0.544	0.765 [40]	0.558 (PubMedBERT [40])
(Accuracy)	MedQA-USMLE (en)	0.371	0.513 [47]	0.367 (BioBERT-large [39])
	MedQA-MCMLE (zh)	0.648	0.582 [47]	0.701 (RoBERTA-large [39])
ALL	AVE	0.651	0.493	0.735

Note: For the results of ChatGPT3.5 and SOTA methods, we provided previous results on the published papers. Since the published result of ChatGPT 3.5 is not available for the relation extraction on the BC5CDR test set, we obtained the result of the ChatGPT using our same prompt via the OpenAI API. For the CMeEE, CMeIE and KUAKE_ QIC datasets, the results of Taiyi are reported on the development set since the gold standard test sets have not been released. Although they may not be directly compared, we still list them for reference.

表 2. 在13项BioNLP任务上与其他现有方法的性能对比

任务	数据集	Taiyi	ChatGPT3.5	SOTA
NER (Micro-F1)	BC5CDR-Chem (en)	0.802	0.603 [40]	0.933 (PubMedBERT [40])
	BC5CDR-Dise (en)	0.691	0.518 [40]	0.856 (PubMedBERT [40])
	CHEMDNER (en)	0.799	0.365 [41]	0.924 (BioBERT [42])
	NCBI-Disease (en)	0.731	0.505 [40]	0.878 (PubMedBERT [40])
	CMeEE (zh)	0.657	0.470 [43]	0.740 CBLUE-Leaderboard [28]
RE (Micro-F1)	BC5CDR (en)	0.375	0.142	0.450 (BioGPT [44])
	CMelE (zh)	0.432	0.306 [43]	0.549 CBLUE-Leaderboard [28])
TC (Micro-F1)	BC7LitCoivd (en)	0.840	0.639 [45]	0.918 (Bioformer [46])
	HOC (en)	0.800	0.512 [40]	0.823 (PubMedBERT [40])
	KUAKE_ QIC (zh)	0.774	0.485 [43]	0.859 (CBLUE-Leaderboard [28])
QA (Accuracy)	PubMedQA (en)	0.544	0.765 [40]	0.558 (PubMedBERT [40])
	MedQA-USMLE (en)	0.371	0.513 [47]	0.367 (BioBERT-large [39])
	MedQA-MCMLE (zh)	0.648	0.582 [47]	0.701 (RoBERTA-large [39])
ALL	AVE	0.651	0.493	0.735

注：对于ChatGPT3.5和SOTA方法的结果，我们提供了已发表论文中的先前结果。由于ChatGPT 3.5在BC5CDR测试集上的关系抽取结果尚未发布，我们通过OpenAI API使用相同的提示获得了ChatGPT的结果。对于CMeEE、CMeIE和KUAKE_ QIC数据集，Taiyi的结果是在开发集上报告的，因为黄金标准测试集尚未发布。尽管它们可能无法直接比较，但我们仍将其列出以供参考。

As shown in Table 2, our Taiyi outperforms ChatGPT3.5 on 11 out of 13 datasets, except for two English QA datasets. It shows that the fine-tuned LLM on domain-specific instruction data can improve performance on in-domain downstream tasks. However, the lower English QA results suggest that the rich domain knowledge may be learned from the pre-training stage of the LLMs, which is difficult to compensate via instruction fine-tuning. Compared with SOTA pre-trained language models, the Taiyi model achieves comparable results on the QA tasks. However, the performance of Taiyi still falls by a margin (average $\sim9%$ in the metrics) for the NER, RE, and TC tasks, where conventional disc rim i native methods outperform generative methods.

如表 2 所示，我们的 Taiyi 在 13 个数据集中的 11 个上表现优于 ChatGPT3.5，除了两个英文问答数据集。这表明针对特定领域指令数据微调的大语言模型可以提高领域内下游任务的性能。然而，较低的英文问答结果说明丰富的领域知识可能来自大语言模型的预训练阶段，这很难通过指令微调来弥补。与最先进的预训练语言模型相比，Taiyi 模型在问答任务上取得了相当的结果。但在命名实体识别 (NER) 、关系抽取 (RE) 和文本分类 (TC) 任务上，Taiyi 的性能仍存在一定差距 (指标平均下降约 9%) ，这些任务中传统的判别式方法通常优于生成式方法。

Performance of Taiyi on the new task

太乙在新任务上的表现

To investigate whether the Taiyi model can be applied to support the new task that is not previously seen in our fine-tuning data, we hold out a recently published biomedical corpus BioRED for evaluation. BioRED is a biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level. Here, we focus on the NER performance of Taiyi on the BioRED test set.

为了探究太乙模型能否应用于支持微调数据中未见过的新任务，我们使用最新发布的生物医学语料库BioRED进行评估。BioRED是一个文档级生物医学关系抽取数据集，包含多种实体类型（如基因/蛋白质、疾病、化学物质）和关系对（如基因-疾病、化学物质-化学物质）。本文主要关注太乙模型在BioRED测试集上的命名实体识别(NER)性能。

Table 3. Performance of the Taiyi model on the BioRED test set

Entity Type	P	R	F1
Chemical	0.717	0.566	0.633
Disease	0.829	0.533	0.649
Gene	0.931	0.490	0.642
Variant	0.633	0.585	0.608
Species	0.640	0.286	0.395
Celline	0.213	0.565	0.310
AlI	0.760	0.507	0.608

表 3: Taiyi模型在BioRED测试集上的性能表现

实体类型	P	R	F1
Chemical	0.717	0.566	0.633
Disease	0.829	0.533	0.649
Gene	0.931	0.490	0.642
Variant	0.633	0.585	0.608
Species	0.640	0.286	0.395
Celline	0.213	0.565	0.310
AlI	0.760	0.507	0.608

As shown in Table 3, Taiyi obtains promising results (overall performance of 0.608 in F1-score) on the new BioRED NER task. It is significantly higher than the performance of ChatGPT (0.391 in F1- score [41]) on the test set. Among these entity types, Taiyi achieves better performance to chemical, disease, gene, and variant entities, since these entities are seen in other related datasets (e.g., BC5CDR, GnormPlus, and tmVar) during the SFT stage. Moreover, it is noteworthy that even for the unseen entity types (i.e., species and cell lines) in the SFT stage, Taiyi still understands the entity recognition instruction and extracts some species and cell line entities. These results suggest that Taiyi is highly robust and general iz able.

如表 3 所示，Taiyi 在新的 BioRED NER 任务中取得了优异的结果 (F1 分数总体性能为 0.608)。该性能显著高于 ChatGPT 在测试集上的表现 (F1 分数为 0.391 [41])。在这些实体类型中，Taiyi 对化学物质、疾病、基因和变异实体表现更优，因为这些实体在 SFT 阶段已出现在其他相关数据集 (如 BC5CDR、GnormPlus 和 tmVar) 中。此外值得注意的是，即使对于 SFT 阶段未见的实体类型 (即物种和细胞系)，Taiyi 仍能理解实体识别指令并提取部分物种和细胞系实体。这些结果表明 Taiyi 具有高度鲁棒性和泛化能力。

Case study

案例研究

To further investigate the benefits of the in-domain instruction fine-tuning, we qualitatively analyzed model outputs on various BioNLP tasks and compared Taiyi to Qwen7B-Chat, a published LLM finetuned using general domain instruction data on the same base model Qwen7B. Table 4 shows answers produced by different models with the same questions on the Chinese chat. From the results, Taiyi generates more detailed and accurate responses to biomedical questions compared to the general domain model Qwen-Chat. More examples of different BioNLP tasks can be found in the Supplementary Case study.

为深入探究领域内指令微调的益处，我们对不同生物医学自然语言处理(BioNLP)任务的模型输出进行了定性分析，并将Taiyi与基于同款基础模型Qwen7B开发的通用领域指令微调大语言模型Qwen7B-Chat进行对比。表4展示了不同模型在中文问答场景下对相同问题的回答。结果显示，相较于通用领域模型Qwen-Chat，Taiyi针对生物医学问题能生成更详尽准确的回答。更多不同BioNLP任务的对比案例可参阅补充案例研究部分。

Input (1): 什么是鼠疫？(What is plague?)

输入 (1): 什么是鼠疫？ (What is plague?)

Taiyi:

太乙:

鼠疫是一种由鼠疫杆菌引起的传染病，主要通过跳蚤等媒介传播。鼠疫杆菌主要存在于啮齿类动物体内，当啮齿类动物死亡后，跳蚤等昆虫会吸食其体液，从而将鼠疫杆菌带入人体，引起鼠疫感染。鼠疫的症状包括高热、寒战、淋巴结肿大、咳嗽、呼吸急促等，严重时可导致死亡。鼠疫是一种古老的传染病，历史上曾多次引起大规模的流行病，如14 世纪的黑死病，。(Plague is an infectious disease caused by Yersinia pestis and is mainly spread through fleas and other vectors. Yersinia pestis mainly exists in rodents. When rodents die, fleas and other insects will suck their body fluids, thereby introducing Yersinia pestis into the human body, causing plague infection. Symptoms of plague include high fever, chills, swollen lymph nodes, cough, shortness of breath, etc., which can lead to death in severe cases. Plague is an ancient infectious disease that has caused large-scale epidemics many times in history, such as the Black Death in the 14th century, which caused huge human and economic losses to Europe.)

鼠疫是一种由鼠疫杆菌 (Yersinia pestis) 引起的传染病，主要通过跳蚤等媒介传播。鼠疫杆菌主要存在于啮齿类动物体内，当啮齿类动物死亡后，跳蚤等昆虫会吸食其体液，从而将鼠疫杆菌带入人体，引起鼠疫感染。鼠疫的症状包括高热、寒战、淋巴结肿大、咳嗽、呼吸急促等，严重时可导致死亡。鼠疫是一种古老的传染病，历史上曾多次引起大规模的流行病，如14世纪的黑死病 (Black Death) ，给欧洲造成了巨大的人口和经济损失。

Qwen-Chat:

鼠疫是一种由鼠疫杆菌引起的急性传染病，它可以在动物之间传播，并且可以通过接触染病动物的排泄物或携带病毒的动物及其分泌物传播给人类。该病的症状包括发热、寒战、淋巴结肿大、呼吸困难和出血等。(Plague is an acute infectious disease caused by Yersinia pestis, which can be transmitted between animals and can be transmitted to humans through contact with the excrement of infected animals or animals carrying the virus and their secretions. Symptoms of the disease include fever, chills, swollen lymph nodes, difficulty breathing and bleeding.)

鼠疫是一种由鼠疫杆菌 (Yersinia pestis) 引起的急性传染病，它可以在动物之间传播，并且可以通过接触染病动物的排泄物或携带病毒的动物及其分泌物传播给人类。该病的症状包括发热、寒战、淋巴结肿大、呼吸困难和出血等。

Input (2):

输入 (2):

有什么预防方法吗？(Are there any prevention methods?)

有什么预防方法吗？ (Are there any prevention methods?)

Taiyi:

太一:

是的，预防鼠疫的方法包括以下几点：

避免接触感染源：尽量避免接触啮齿类动物，如老鼠、土拨鼠等，不要在它们的巢穴附近逗留。
避免接触感染源：尽量避免接触啮齿类动物，如老鼠、土拨鼠等，不要在它们的巢穴附近逗留。
注意个人卫生：保持手部卫生，勤洗手，避免接触有感染风险的物品，如被感染的衣物、被褥等
注意个人卫生：保持手部卫生，勤洗手，避免接触有感染风险的物品，如被感染的衣物、被褥等

. 食用安全：避免食用被感染的动物或其制品，如鼠肉、鼠肉制品等。

食用安全：避免食用被感染的动物或其制品，如鼠肉、鼠肉制品等。

居环境：保持家居环境清洁卫生，定期消毒，特别是鼠洞、鼠粪等易滋生细菌的地方

居住环境：保持家庭环境清洁卫生，定期消毒，尤其是鼠洞、鼠粪等易滋生细菌的区域

预防接种：对于有鼠疫高发区的人群，可以接种鼠疫疫苗，提高免疫力。总之，预防鼠疫需要从多个方面入手，保持个人卫生、注意饮食安全、保持家居环境清洁卫生等都是预防鼠疫的有效方法。
预防接种：对于有鼠疫高发区的人群，可以接种鼠疫疫苗，提高免疫力。总之，预防鼠疫需要从多个方面入手，保持个人卫生、注意饮食安全、保持家居环境清洁卫生等都是预防鼠疫的有效方法。

(Yes, ways to prevent plague include the following:

预防鼠疫的方法包括：

Avoid contact with sources of infection: Try to avoid contact with rodents, such as mice, prairie dogs, etc., and do not stay near their nests.
避免接触感染源：尽量避免接触啮齿类动物，如老鼠、草原犬鼠等，不要在其巢穴附近逗留。
Pay attention to personal hygiene: maintain hand hygiene, wash hands frequently, and avoid contact with items with risk of infection, such as infected clothing, bedding, etc.
注意个人卫生：保持手部清洁，勤洗手，避免接触有感染风险的物品，如被污染的衣物、被褥等。
Food safety: Avoid eating infected animals or their products, such as rat meat and rat meat products.
食品安全：避免食用受感染动物或其制品，如鼠肉及鼠肉制品。
Home environment: Keep the home environment clean and sanitary, and disinfect it regularly, especially rat holes, rat droppings and other places where bacteria can easily breed.
家庭环境：保持家庭环境清洁卫生，定期消毒，特别是鼠洞、鼠粪等易滋生细菌的地方。
Vaccination: People in areas with a high incidence of plague can be vaccinated against plague to improve their immunity. In short, the prevention of plague requires many aspects. Maintaining personal hygiene, paying attention to food safety, and keeping the home environment clean and hygienic are all effective ways to prevent plague.)
疫苗接种: 疫区人群可通过接种鼠疫疫苗提高免疫力。简而言之，鼠疫防控需多管齐下。保持个人卫生、注重食品安全、维持家居环境清洁卫生都是有效的预防措施。

Qwen-chat:

预防鼠疫的最有效方法是避免接触携带鼠疫杆菌的动物和它们的排泄物，以及不触摸被污染的物品和表面。另外，应保持个人卫生，经常洗手，避免进食未煮熟的食物，以及及时接种疫苗来预防鼠疫的发生。(The most effective way toprevent plague is to avoid contact with animals that carry Yersinia pestis and their excrement, as well as not touching contaminated objects and surfaces. In addition, you should maintain personal hygiene, wash your hands frequently, avoid eating undercooked food, and get vaccinated in time to prevent the occurrence of plague.)

预防鼠疫的最有效方法是避免接触携带鼠疫杆菌 (Yersinia pestis) 的动物和它们的排泄物，以及不触摸被污染的物品和表面。另外，应保持个人卫生，经常洗手，避免进食未煮熟的食物，以及及时接种疫苗来预防鼠疫的发生。

Note that the English in brackets is not part of the input and output; it is the translation of the Chinese.

注意：括号内的英文并非输入输出内容，而是中文的翻译说明。

DISCUSSION

讨论

Due to the complexity of biomedical data and knowledge gaps across tasks, general LLMs often generate meaningless or completely unrelated responses to a given bio medicine input, or the users need to design professional prompts to improve the performance of LLMs on the biomedical tasks. To mitigate such issues, we elaborately fine-tuned the LLM on bilingual biomedical data across multiple biomedical tasks. For the LLM optimization to bio medicine, it is crucial to establish a data-centric infrastructure and fine-tuning procedure. In our study, we collected a huge amount of in-domain training datasets, filtered out high-quality subsets and optimized reasonable ratios for each dataset. Including the tasks that are not generation tasks in nature (e.g., NER and RE), all tasks are converted into generation tasks via instruction templates to train the model. The proposed two-stage SFT strategy is shown to effectively improve the model performance. Owing to these efforts, Taiyi can understand bilingual biomedical task instructions and has strong robustness and generalization capabilities on a variety of tasks.

由于生物医学数据的复杂性和跨任务间的知识鸿沟，通用大语言模型往往会

[论文翻译]Taiyi: 面向多样化生物医学任务的双语精调大语言模型

原文地址：https://arxiv.org/pdf/2311.11608

Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks

Taiyi: 面向多样化生物医学任务的双语精调大语言模型

Abstract

INTRODUCTION

引言

MATERIALS AND METHODS

材料与方法

Training data

训练数据

Dataset collection

数据集收集

Task schema harmonization

Instruction-tuning data construction

Instruction-tuning数据构建

Model training

模型训练

Base model

基础模型

Two-stage Supervised Instruction Fine-tuning

两阶段监督指令微调

RESULTS

Evaluation tasks, datasets and metrics

Effectiveness of the two-stage fine-tuning strategy

Performance of Taiyi on the entire test sets

Performance of Taiyi on the new task

太乙在新任务上的表现

Case study

案例研究

Taiyi:

Qwen-Chat:

Input (2):

Taiyi:

Qwen-chat:

DISCUSSION

[论文翻译]Taiyi: 面向多样化生物医学任务的双语精调大语言模型

原文地址：https://arxiv.org/pdf/2311.11608

Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks

Taiyi: 面向多样化生物医学任务的双语精调大语言模型

Abstract

INTRODUCTION

引言

RELATED WORK

MATERIALS AND METHODS

材料与方法

Training data

训练数据

Dataset collection

数据集收集

Task schema harmonization

Instruction-tuning data construction

Instruction-tuning数据构建

Model training

模型训练

Base model

基础模型

Two-stage Supervised Instruction Fine-tuning

两阶段监督指令微调

RESULTS

Evaluation tasks, datasets and metrics

Effectiveness of the two-stage fine-tuning strategy

Performance of Taiyi on the entire test sets

Performance of Taiyi on the new task

太乙在新任务上的表现

Case study

案例研究

Taiyi:

Qwen-Chat:

Input (2):

Taiyi:

Qwen-chat:

DISCUSSION