[论文翻译]Taiyi: 面向多样化生物医学任务的双语精调大语言模型


原文地址:https://arxiv.org/pdf/2311.11608


Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks

Taiyi: 面向多样化生物医学任务的双语精调大语言模型

Ling Luo* , Jinzhong Ning, Yingwen Zhao, Zhijun Wang, Zeyuan Ding, Peng Chen, Weiru Fu, Qinyu Han, Guangtao Xu, Yunzhi Qiu, Dinghao Pan, Jiru Li, Hao Li, Wenduo Feng, Senbo Tu, Yuqi Liu, Zhihao Yang, Jian Wang, Yuanyuan Sun, Hongfei Lin

Ling Luo* , Jinzhong Ning, Yingwen Zhao, Zhijun Wang, Zeyuan Ding, Peng Chen, Weiru Fu, Qinyu Han, Guangtao Xu, Yunzhi Qiu, Dinghao Pan, Jiru Li, Hao Li, Wenduo Feng, Senbo Tu, Yuqi Liu, Zhihao Yang, Jian Wang, Yuanyuan Sun, Hongfei Lin

School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China * Corresponding author: lingluo@dlut.edu.cn

大连理工大学计算机科学与技术学院,大连 116024

  • 通讯作者:lingluo@dlut.edu.cn

Abstract

摘要

Objective: Most existing fine-tuned biomedical large language models (LLMs) focus on enhancing performance in monolingual biomedical question answering and conversation tasks. To investigate the effectiveness of the fine-tuned LLMs on diverse biomedical NLP tasks in different languages, We present Taiyi, a bilingual fine-tuned LLM for diverse biomedical tasks.

目标:现有大多数经过微调的生物医学大语言模型主要关注提升单语种生物医学问答和对话任务的性能。为探究微调后大语言模型在不同语言的多样化生物医学自然语言处理任务中的有效性,我们提出了双语生物医学任务微调模型Taiyi。

Materials and Methods: We first curated a comprehensive collection of 140 existing biomedical text mining datasets (102 English and 38 Chinese datasets) across over 10 task types. Subsequently, a two-stage strategy is proposed for supervised fine-tuning to optimize the model performance across varied tasks.

材料与方法:我们首先整理了140个现有生物医学文本挖掘数据集(102个英文和38个中文数据集),涵盖10余种任务类型。随后,提出了一种两阶段监督微调策略,以优化模型在不同任务中的性能。

Results: Experimental results on 13 test sets covering named entity recognition, relation extraction, text classification, question answering tasks demonstrate that Taiyi achieves superior performance compared to general LLMs. The case study involving additional biomedical NLP tasks further shows Taiyi's considerable potential for bilingual biomedical multi-tasking.

结果:在涵盖命名实体识别、关系抽取、文本分类、问答任务的13个测试集上的实验结果表明,Taiyi相比通用大语言模型展现出更优性能。针对生物医学NLP任务的案例研究进一步揭示了Taiyi在双语生物医学多任务处理方面的巨大潜力。

Conclusion: Leveraging rich high-quality biomedical corpora and developing effective fine-tuning strategies can significantly improve the performance of LLMs within the biomedical domain. Taiyi shows the bilingual multi-tasking capability through supervised fine-tuning. However, those tasks such as information extraction that are not generation tasks in nature remain challenging for LLM-based generative approaches, and they still under perform the conventional disc rim i native approaches of smaller language models.

结论:利用丰富的高质量生物医学语料库并开发有效的微调策略,可以显著提升大语言模型在生物医学领域的性能。Taiyi通过监督微调展现了双语多任务处理能力。然而,对于信息抽取等本质上非生成式的任务,基于大语言模型的生成式方法仍面临挑战,其表现仍逊于传统小语言模型的判别式方法。

Key words: natural language processing; large language model; supervised fine-tuning; biomedical multi-tasking.

关键词: 自然语言处理; 大语言模型; 监督微调; 生物医学多任务

INTRODUCTION

引言

Recently, the release of ChatGPT [1] and the subsequent launch of GPT-4 [2] have received widespread attention around the world, which promotes the development of large language models (LLMs) that have billions of parameters and trained with hundreds of billions of tokens. These LLMs (such as GPT-4, PaLM [3], LLaMA [4], and GLM [5]) have shown promising results and achieved new state-of-the-art performance in various natural language processing (NLP) tasks. Compared with previous pre-trained language models (such as BERT [6]), LLMs exhibit enhanced text generation and comprehension abilities. Moreover, their few-shot/zero-shot learning and generalization abilities address unseen or complicated tasks. Due to the advantages, LLMs have been explored to apply in various fields, such as law [7], education [8], finance [9], and bio medicine and health [10].

近来,ChatGPT[1]的发布及随后推出的GPT-4[2]引发了全球广泛关注,推动了大语言模型(LLMs)的发展——这些模型具有数十亿参数,并接受了数千亿token的训练。以GPT-4、PaLM[3]、LLaMA[4]和GLM[5]为代表的LLMs在各类自然语言处理(NLP)任务中展现出卓越性能,创造了新的技术标杆。与此前的预训练语言模型(如BERT[6])相比,LLMs显著提升了文本生成与理解能力,其少样本/零样本学习及泛化能力更能应对未知复杂任务。凭借这些优势,LLMs已在法律[7]、教育[8]、金融[9]、生物医药与健康[10]等领域展开应用探索。

LLMs in the biomedical and healthcare domain are essential and potential for handling the scale and complexity of biomedical data, as well as for providing more personalized and empathetic medical care, ultimately advancing the quality and effectiveness of healthcare practices. However, biomedical text exhibits some distinct features compared to text in the general domain, such as complex terminologies, ambiguous abbreviations, more complex syntax, and less common vocabulary. Owing to these domain characteristics, most existing LLMs trained on general domain data encounter difficulties when they are applied to the biomedical domain [11]. Therefore, some domain-special LLMs have been developed to facilitate the development of NLP methods for biomedical applications. For example, Google’s team recently developed Med-PaLM2 [12], a fine-tuned LLM using medical data based on PaLM [3], which has achieved a high performance on the medical QA datasets with an accuracy of over $80%$ . Apart from the non-open biomedical LLMs, there have been some efforts to develop open-source biomedical LLMs [13-20]. Most of the models are derived from some open-source general LLMs (such as LLaMA [4], BLOOMZ [21], and GLM [5]) and are fine-tuned with monolingual (e.g., English or Chinese) question answering (QA) and conversation data. They are primarily directed toward enhancing performance in biomedical QA and conversation-oriented tasks. However, the efficacy and suitability of the LLMs on diverse biomedical NLP tasks in different languages remains unclear and warrants further investigation.

生物医学和医疗健康领域的大语言模型对于处理生物医学数据的规模和复杂性、提供更具个性化和同理心的医疗服务至关重要,且具备巨大潜力,最终将提升医疗实践的质量与效率。然而,与通用领域文本相比,生物医学文本具有术语复杂、缩写歧义、句法结构更繁琐、词汇更生僻等显著特征。由于这些领域特性,大多数基于通用领域数据训练的大语言模型在应用于生物医学领域时会遇到困难[11]。为此,研究者开发了若干领域专用大语言模型以推动生物医学NLP方法的发展。例如,Google团队近期基于PaLM[3]利用医学数据微调出Med-PaLM2[12],该模型在医学问答数据集上实现了超过$80%$的准确率。除未开源的生物医学大语言模型外,学界也涌现了一批开源模型[13-20]。这些模型大多衍生自开源通用大语言模型(如LLaMA[4]、BLOOMZ[21]和GLM[5]),并通过单语种(如英语或中文)问答和对话数据进行微调,主要致力于提升生物医学问答和对话任务的性能。但这些大语言模型在不同语言的多样化生物医学NLP任务中的有效性与适用性仍不明确,有待进一步探究。

To address these problems, we present Taiyi, a bilingual (English and Chinese) fine-tuned large language model for diverse biomedical tasks. First, a comprehensive collection including 102 English and 38 Chinese datasets is assembled, covering over 10 biomedical task types. To facilitate taskspecific requirements and enable consistent formatting across all datasets, standardized data schemas are designed and universally applied during dataset curation. Then, in the supervised fine-tuning (SFT) phase, we propose a two-stage fine-tuning strategy. In contrast to the simple single-stage fine-tuning, this strategy significantly optimizes model performance across a diversity of tasks. Finally, the evaluation of Taiyi is conducted on 13 Biomedical NLP test sets. Experimental results demonstrate the promising potential of Taiyi in bilingual multi-task learning. It achieves superior performance on bio medicine-specific tasks compared to general LLMs. However, Taiyi has an improvement room for current state-of-the-art models specialized for individual tasks.

为解决这些问题,我们推出了Taiyi——一个针对多样化生物医学任务进行双语(英语和中文)微调的大语言模型。首先,我们整合了包含102个英文和38个中文数据集的综合资源库,覆盖超过10种生物医学任务类型。为满足任务特定需求并确保所有数据集格式统一,在数据整理过程中设计并统一应用了标准化数据架构。随后,在有监督微调(SFT)阶段,我们提出了两阶段微调策略。与简单的单阶段微调相比,该策略显著优化了模型在多样化任务中的表现。最终,我们在13个生物医学自然语言处理测试集上对Taiyi进行评估。实验结果表明,Taiyi在双语多任务学习中展现出显著潜力。相较于通用大语言模型,其在生物医学专项任务中表现更优,但与当前单项任务专用最先进模型相比仍存在提升空间。

RELATED WORK

相关工作

Recent advancements in LLMs, such as GPT-4 [2] and PaLM [3], have attracted considerable attention due to instruction-following and producing human-like responses. Subsequently, the researchers attempt to duplicate the GPT series to develop open-source foundation models like LLaMA [4], Bloom [22], Falcon [23], GLM [5] and Qwen [24]. These LLMs present strong performances on various NLP tasks, including zero- and few-shot learning scenarios. The promising capabilities of LLMs have sparked interest and potential applications in various fields, particularly in the biomedical domain.

近年来,大语言模型(LLM)如GPT-4 [2]和PaLM [3]因其指令跟随能力和类人响应生成能力引发了广泛关注。随后,研究者们尝试复现GPT系列模型,开发出LLaMA [4]、Bloom [22]、Falcon [23]、GLM [5]和Qwen [24]等开源基础模型。这些大语言模型在各种自然语言处理(NLP)任务中展现出强大性能,包括零样本和少样本学习场景。大语言模型的卓越能力激发了多个领域的应用兴趣,尤其在生物医学领域展现出巨大潜力。

When applied to specific domains like bio medicine, large-scale models often perform sub-optimal. To improve the performance of the models in biomedical tasks, there have been some efforts in training LLMs specifically for the biomedical domain. For example, MedAlpaca [14] builds upon medical data to fine-tune Stanford Alpaca for applications related to medical question-answering and dialogue. ChatDoctor [19] is designed to simulate a conversation between a doctor and a patient, fine-tuning LLaMA with medical literature. Additionally, Med-PaLM [25] has shown promising performance on the MedQA exam based on clinical corpora and human feedback. Meanwhile, aiming at the Chinese medical domain, Chinese LLMs such as BenTsao [16], DoctorGLM [17], and HuatuoGPT [20], are developed on the Chinese medical dialogue data. And more recently Zhongjing [18] and ChiMed-GPT [26] adopted full pipeline training from pre-training, SFT, to reinforcement learning with human feedback. The overview of existing LLMs in the biomedical domain can be found in Supplementary Table S1. Most existing open-source LLMs focus on fine-tuning with monolingual QA and conversation data. Different from the above monolingual LLMs, our Taiyi embarks on the bilingual (English and Chines) biomedical large language model, aiming to explore the capabilities of large models in handling a variety of bilingual NLP tasks in the biomedical field.

规则:

  • 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
  • 不要输出与英文内容无关的内容。
  • 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
  • 人名不翻译
  • 同时要保留引用的论文,例如 [20] 这样的引用。
  • 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
  • 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
  • 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
  • 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
  • Transformer -> Transformer
  • Token -> Token
  • LLM/Large Language Model -> 大语言模型
  • Zero-shot -> 零样本
  • Few-shot -> 少样本
  • AI Agent -> AI智能体
  • AGI -> 通用人工智能
  • Python -> Python语言

策略:

分三步进行翻译工作:

  1. 不翻译无法识别的特殊字符和公式,原样返回
  2. 将HTML表格格式转换成Markdown表格格式
  3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息

最终只返回Markdown格式的翻译结果,不要回复无关内容。

现在请按照上面的要求开始翻译以下内容为简体中文:当应用于生物医学等特定领域时,大规模模型往往表现欠佳。为提高大语言模型在生物医学任务中的性能,已有一些针对该领域专门训练大语言模型的尝试。例如,MedAlpaca [14] 基于医学数据对 Stanford Alpaca 进行微调,用于医疗问答和对话相关应用。ChatDoctor [19] 旨在模拟医患对话,通过医学文献对 LLaMA 进行微调。此外,Med-PaLM [25] 在基于临床语料库和人类反馈的 MedQA 考试中展现出优异表现。同时,针对中文医疗领域,基于中文医疗对话数据开发了诸如 BenTsao [16]、DoctorGLM [17] 和 HuatuoGPT [20] 等中文大语言模型。近期推出的 Zhongjing [18] 和 ChiMed-GPT [26] 则采用了从预训练、监督微调(SFT)到基于人类反馈的强化学习的完整训练流程。现有生物医学领域大语言模型的概述可参见补充材料表 S1。现有大多数开源大语言模型主要集中于单语言问答和对话数据的微调。与上述单语言大语言模型不同,我们的 Taiyi 致力于开发双语(中英文)生物医学大语言模型,旨在探索大模型处理生物医学领域多种双语自然语言处理任务的能力。

MATERIALS AND METHODS

材料与方法


Figure 1. Overview of the Taiyi framework. Note that the English in brackets is not part of the input and output; it is the translation of the Chinese.


图 1: Taiyi框架概览。注意括号中的英文并非输入输出内容,而是中文的翻译。

Figure 1 shows the overview of our framework. We first collected and curated a set of existing bilingual (Chinese and English) biomedical natural language processing (BioNLP) corpora. Then these corpora were converted to the instruction data used to fine-tune the general LLM. In the supervised fine-tuning phase, we propose a two-stage supervised instruction fine-tuning strategy. Finally, our Taiyi model can be applied to various BioNLP tasks and it is evaluated on 13 test sets covering 4 BioNLP task types.

图 1: 展示了我们框架的概览。我们首先收集并整理了一组现有的双语(中英文)生物医学自然语言处理(BioNLP)语料库。随后将这些语料库转换为用于微调通用大语言模型的指令数据。在有监督微调阶段,我们提出了一种两阶段有监督指令微调策略。最终,我们的Taiyi模型可应用于各类BioNLP任务,并在涵盖4种BioNLP任务类型的13个测试集上进行了评估。

Training data

训练数据

Dataset collection

数据集收集

This study aims to explore the capabilities of LLMs in handling a variety of bilingual BioNLP tasks by supervised fine-tuning. Therefore, we focus on manually annotated English and Chinese biomedical corpora. To fully utilize available BioNLP resources, we make efforts to aggregate comprehensive sets of open-source datasets in both English and Chinese. The data collection is primarily from two sources: existing English/Chinese BioNLP shared task datasets and the training data used for existing biomedical LLMs.

本研究旨在通过监督微调探索大语言模型(LLM)处理多种双语生物医学自然语言处理(BioNLP)任务的能力。为此,我们重点关注人工标注的英文和中文生物医学语料库。为充分利用现有BioNLP资源,我们致力于整合全面的开源英文和中文数据集。数据收集主要来自两个渠道:现有英文/中文BioNLP共享任务数据集,以及现有生物医学大语言模型使用的训练数据。


Figure 2. Overview of the dataset collection. In the tree map (left), the entire graph is represented by a large rectangle, representing the 140 bilingual open-source datasets collected. This large rectangle is divided into smaller rectangles of various colors, with each small rectangle representing a specific BioNLP task and listing the names of all datasets related to that task. In the scatter plot (top right), each data point represents a BioNLP task, with its size determined by the number of datasets associated with that task. The bar chart (bottom right) presents the scale of related datasets for different tasks, further illustrating the number of datasets for each task.

图 2: 数据集收集概览。在矩形树图(左)中,整个图形由一个大矩形表示,代表收集的140个双语开源数据集。这个大矩形被划分为多个不同颜色的小矩形,每个小矩形代表特定的BioNLP任务,并列出了与该任务相关的所有数据集名称。散点图(右上)中每个数据点代表一个BioNLP任务,其大小由该任务关联的数据集数量决定。条形图(右下)展示了不同任务相关数据集的规模,进一步呈现了每个任务的数据集数量。

Our data collection significantly benefited from two major previous efforts in aggregating biomedical text mining datasets – BigBio [27] and CBLUE [28]. The BigBio aggregates a large collection of English BioNLP datasets, while the CBLUE dataset assembles a wide range of Chinese biomedical natural language understanding datasets. In addition, we also collected some other relevant BioNLP datasets that are not included in BioBio and CBLUE. Finally, we have successfully assembled a total of 140 biomedical datasets. All datasets are categorized into 15 different BioNLP task types as shown in Figure 2: Named Entity Recognition/Normalization (NER/NEN), Relation Extraction (RE), Causal Relation Extraction (CRE), Event Extraction (EE), Co reference Resolution (COREF), Text Classification (TC), Question Answering-Multiple Choice (QA-mc), Question Answering-Simple Answer Questions (QA-sqa), Question Answering-Context-based Answer Questions (QA-cqa), Multi-Round Dialogue (MRD), Machine Translation (MT), Text Pairs-Semantic Similarity (TP-ss), Text Pairs-Textual Entailment (TP-te), Text to Text/Struct-Document Sum mari z ation (TT-ds) and Text/Struct-Text to Struct (TT-ts). Among these datasets, there are 38 Chinese datasets covering 10 different BioNLP tasks, and 102 English datasets spanning 12 BioNLP tasks.

我们的数据收集工作极大地受益于此前两项生物医学文本挖掘数据集整合的重要成果——BigBio [27]和CBLUE [28]。BigBio汇总了大量英文生物医学自然语言处理数据集,而CBLUE数据集则整合了广泛的中文生物医学自然语言理解数据集。此外,我们还收集了未包含在BigBio和CBLUE中的其他相关生物医学自然语言处理数据集。最终,我们成功整合了总计140个生物医学数据集。所有数据集按15种不同的生物医学自然语言处理任务类型分类,如图2所示:命名实体识别/标准化(NER/NEN)、关系抽取(RE)、因果关系抽取(CRE)、事件抽取(EE)、共指消解(COREF)、文本分类(TC)、选择题问答(QA-mc)、简答题问答(QA-sqa)、基于上下文的问答(QA-cqa)、多轮对话(MRD)、机器翻译(MT)、文本对-语义相似性(TP-ss)、文本对-文本蕴含(TP-te)、文本到文本/结构化-文档摘要(TT-ds)以及文本/结构化-文本到结构化(TT-ts)。这些数据集中包含38个中文数据集,覆盖10种不同的生物医学自然语言处理任务,以及102个英文数据集,涵盖12种生物医学自然语言处理任务。

For each dataset, we collated key metadata including task types, data size, task descriptions, and the links of the dataset and paper. This metadata facilitates full understanding and proper usage of each corpus. The collection of bilingual datasets across diverse biomedical language processing tasks facilitates the comprehensive evaluation and development of multilingual BioNLP models. Details can be accessed at the following link: https://github.com/DUTIR-BioNLP/TaiyiLLM/blob/main/data_ file/data set in f.md.

针对每个数据集,我们整理了关键元数据,包括任务类型、数据规模、任务描述以及数据集与论文的链接。这些元数据有助于全面理解并正确使用每个语料库。跨多种生物医学语言处理任务的双语数据集集合,为多语言BioNLP模型的全面评估与开发提供了便利。具体细节可通过以下链接访问:https://github.com/DUTIR-BioNLP/TaiyiLLM/blob/main/data_ file/data set in f.md。

Task schema harmonization

任务模式统一化

In the original data set, even for the same BioNLP task, there are many different data formats due to different sources and developers. For example, the annotation formats include BioC, CoNLL, and PubTator formats across NER datasets. Such inconsistencies introduce challenges for developing systems that can leverage diverse annotated datasets. Therefore, establishing a universal data format enables interoperability and is beneficial for cross-dataset integration. To be consistent with previous efforts, we extended the schema from BioBio schema to support all tasks in our collection. The complete unified task schema can be found at https://github.com/DUTIR-BioNLP/TaiyiLLM/blob/main/data_ file/Task schemas en.md.

在原始数据集中,即使是相同的BioNLP任务,由于来源和开发者不同也存在多种数据格式。例如,NER数据集中的标注格式包含BioC、CoNLL和PubTator等多种形式。这种不一致性给开发能利用多样化标注数据的系统带来了挑战。因此,建立通用数据格式有助于实现互操作性,并促进跨数据集整合。为与先前工作保持一致,我们扩展了BioBio模式的架构以支持我们收集的所有任务。完整的统一任务架构详见https://github.com/DUTIR-BioNLP/TaiyiLLM/blob/main/data_ file/Task schemas en.md。

Instruction-tuning data construction

Instruction-tuning数据构建

To enable the model to understand task instructions for performing multi-tasking, we constructed the instruction data for fine-tuning, which covers the tasks described in the above section. Moreover, we also added MOSS data (i.e., Moss-003-sft-data) [29] into the training data to enhance the general conversation and harmlessness ability of Taiyi. Moss-003-sft-data is well-aligned with the real-world distribution of user intents, covering finer-grained categories and more diverse harmlessness-related data. Previous studies [30 31] have demonstrated that the quality of the training data plays a critical role in the performance of downstream tasks. Inferior quality data adversely affects the performance of the LLMs. Therefore, we manually analyzed the datasets, then the high-quality datasets were selected. We filtered duplicated training data, and those overlapped documents in the training set if the documents also exist in the test set to accurately evaluate the model performance. The statistics of the final data used for instruction-tuning is presented in Table 1.

为了使模型能够理解执行多任务的任务指令,我们构建了用于微调的指令数据,涵盖上述章节描述的任务。此外,我们还向训练数据中添加了MOSS数据(即Moss-003-sft-data)[29],以增强太乙的通用对话和无害化能力。Moss-003-sft-data与现实世界用户意图分布高度吻合,涵盖更细粒度的类别和更多样化的无害化相关数据。先前研究[30 31]表明,训练数据的质量对下游任务性能起着关键作用。低质量数据会对大语言模型的性能产生不利影响。因此,我们手动分析了数据集,随后筛选出高质量数据集。我们过滤了重复的训练数据,并剔除了训练集中与测试集重叠的文档,以准确评估模型性能。最终用于指令微调的数据统计如 表1 所示。

Table 1. Statistics of our final instruction-tuning data

Task TypeEnglish data sizeChinese data size
Named Entity Recognition28,60344,667
RelationExtraction17,27926,606
Event Extraction2,0222,992
TextClassification40,33937,624
Text Pair Task11,23745,548
Machine Translation74,113
Biomedical Question Answering57,962129,562
Biomedical Multi-Round Dialogue10,00016,391
General DialogueData560,000
OtherAdditionalTasks9,370
Total1,114,315

表 1: 最终指令微调数据统计

任务类型 英文数据量 中文数据量
命名实体识别 28,603 44,667
关系抽取 17,279 26,606
事件抽取 2,022 2,992
文本分类 40,339 37,624
文本对任务 11,237 45,548
机器翻译 74,113 -
生物医学问答 57,962 129,562
生物医学多轮对话 10,000 16,391
通用对话数据 560,000 -
其他附加任务 9,370 -
总计 1,114,315 -

To construct the instruction-tuning data, we design instructional templates for each task. For the QA and dialogue tasks, original questions are used as the model input and answers are used as the output. For other tasks, approximately 15 instruction templates were manually created for each task in English and Chinese, respectively. Some examples of the instruction data can be found in Supplementary Table S2.

为构建指令微调数据,我们为每项任务设计了指令模板。对于问答和对话任务,原始问题直接作为模型输入,答案作为输出。其他任务则分别用英语和中文人工创建了约15个指令模板。部分指令数据示例可参见补充表S2。

Furthermore, some complex NER and RE tasks are divided into multiple subtasks to reduce difficulty and increase diversity. For instance, the BC5CDR [32] task of chemical and disease entity recognition is separated into the chemical entity recognition and disease entity recognition subtasks. The original complex task is retrained while these additional subtasks are added to the instructiontuning data.

此外,一些复杂的命名实体识别(NER)和关系抽取(RE)任务被拆分为多个子任务以降低难度并增加多样性。例如,BC5CDR [32]的化学与疾病实体识别任务被拆分为化学实体识别和疾病实体识别两个子任务。在指令调优数据中既保留原始复杂任务,又新增了这些子任务。

Model training

模型训练

Base model

基础模型

Recently, Alibaba Cloud's pre-trained LLM Qwen series [24] have been developed and shown promising results in various NLP tasks. We chose the Qwen-7B-base version for supervised instruction fine-tuning. Qwen-7B is a Transformer-based pre-trained language model that obtains generalized language understanding capabilities through self-supervised learning on large-scale high-quality multilingual pre training corpora. Compared to models like BERT, the training data coverage of Qwen7B is more extensive, including web texts, academic books, code, and other resources. It has approximately 7 billion parameters and a vocabulary size of around 150,000.

近期,阿里云的预训练大语言模型Qwen系列[24]取得显著进展,在多种NLP任务中展现出优异性能。我们选用Qwen-7B-base版本进行监督式指令微调。Qwen-7B是基于Transformer架构的预训练语言模型,通过在大规模高质量多语种预训练语料上进行自监督学习,获得了通用语言理解能力。相较BERT等模型,Qwen7B的训练数据覆盖更广,包含网络文本、学术著作、代码等资源,参数量约70亿,词表规模约15万。

In public leader boards 1,2 of downstream English and Chinese tasks, Qwen-7B significantly outperforms models of similar size and even surpasses larger models on some tasks. This demonstrates Qwen-7B's strong language understanding and transfer learning abilities. The considerations for choosing Qwen-7B as our pre-trained model are as follows: 1) The model size is moderate, with high training efficiency that meets our computational resource constraints; 2) The training data coverage is extensive, encompassing the common language, domain-specific language, and code data required for our cross-domain task; 3) It achieves strong performance on public benchmarks, with sufficient language understanding capabilities to provide a solid foundation for our downstream task.

在英语和汉语下游任务的公开排行榜1,2中,Qwen-7B显著优于同等规模的模型,甚至在某些任务上超越更大规模的模型。这证明了Qwen-7B强大的语言理解和迁移学习能力。选择Qwen-7B作为预训练模型的考虑因素如下:1) 模型规模适中,训练效率高,符合我们的计算资源限制;2) 训练数据覆盖广泛,包含跨领域任务所需的通用语言、领域专用语言及代码数据;3) 在公开基准测试中表现优异,具备足够的语言理解能力,可为下游任务提供坚实基础。

Two-stage Supervised Instruction Fine-tuning

两阶段监督指令微调

During the SFT stage, we performed fine-tuning on our curated dataset of millions of examples. Since the instruction-tuning data involves dozens of datasets across over ten biomedical NLP tasks in both Chinese and English, performing fine-tuning in a simple single stage would result in task interference, preventing convergence on some of the more challenging NLP tasks. Therefore, we propose a twostage supervised instruction fine-tuning strategy shown in Figure 3.

在监督微调(SFT)阶段,我们对精心整理的数百万条样本数据集进行了微调。由于指令调优数据涉及十余种生物医学NLP任务的中英文数据集,采用简单的单阶段微调会导致任务干扰,阻碍模型在部分更具挑战性的NLP任务上收敛。为此,我们提出了如图3所示的两阶段监督指令微调策略。


Figure 3. The two-stage training process of Taiyi. Tasks in Type1 are in the blue background, and task2 in Type2 are in the yellow background.

图 3: Taiyi的两阶段训练过程。Type1中的任务以蓝色背景显示,Type2中的任务以黄色背景显示。

According to the task type and the size of the task dataset, we first manually categorized all the tasks into two types of the tasks: Type1 and Type2. In Type1, most tasks are not generation tasks in nature (e.g., NER is a sequence labeling task in nature rather than a generation task), or the size of the task dataset is related small. These tasks include information extraction, text classification, text pair tasks, machine translation and other additional tasks. In Type2, there are QA and dialogue tasks including biomedical QA, biomedical multi-round dialogue, and general dialogue tasks. These tasks are generation tasks in nature. In the first stage of model training, we first performed supervised instruction fine-tuning on the data from the Type1 tasks (around 340,000 instances). The best checkpoint was selected through a combination of human evaluation and automated metrics on the development sets for the second stage of training. In the second training stage, all the training data from the first stage are used as retrospective data, and all data in Type2 are mixed for incremental

根据任务类型和任务数据集规模,我们首先将所有任务手动划分为两类:Type1和Type2。Type1中的任务大多本质不属于生成式任务(例如NER本质上是序列标注任务而非生成任务),或任务数据集规模相对较小。这类任务包括信息抽取、文本分类、文本对任务、机器翻译及其他附加任务。Type2包含问答和对话任务,涉及生物医学问答、生物医学多轮对话以及通用对话任务,这些任务本质属于生成式任务。在模型训练的第一阶段,我们首先对Type1任务数据(约34万条实例)进行监督式指令微调,通过开发集上的人工评估与自动化指标综合选取最佳检查点用于第二阶段训练。在第二阶段训练中,第一阶段所有训练数据将作为回溯数据使用,并与Type2全部数据进行混合以实现增量训练。

training.

训练

We used 8 A40 GPUs for SFT. The model was trained with 5 epochs ( ${\sim}7$ hours per epoch) in stage 1, and 3 epochs ( ${\sim}26$ hours per epoch) in stage 2. To improve model training efficiency, we chose Qlora (Dettmers, et al., 2023), an efficient tuning algorithm, to conduct supervised instruction fine-tuning. The main hyper parameters for the training process are set as follows: Batch size per GPU of 12, learning rate of 0.0002, warmup ratio of 0.1, max length of 1024, lora rank of 64, lora alpha of 16, and lora dropout of 0.05.

我们使用8块A40 GPU进行监督微调(SFT)。第一阶段训练5个周期(每周期约7小时),第二阶段训练3个周期(每周期约26小时)。为提高训练效率,我们采用高效调参算法Qlora (Dettmers等人,2023)进行监督指令微调。主要超参数设置如下:单GPU批次大小12,学习率0.0002,预热比例0.1,最大长度1024,lora秩64,lora alpha值16,lora dropout率0.05。

RESULTS

结果

Evaluation tasks, datasets and metrics

评估任务、数据集和指标

To investigate the capability of the Taiyi model for various bilingual BioNLP tasks, we selected four task types (i.e., NER, RE, TC, multiple-choice QA) as the metrics evaluation and other tasks for the case study. The statistics of the test sets are shown in Supplementary Table S3. Further details for each evaluation task are provided below.

为评估太乙模型在多种双语生物自然语言处理任务中的能力,我们选取了四种任务类型(即命名实体识别(NER)、关系抽取(RE)、文本分类(TC)和多项选择问答)作为指标评估对象,其他任务则用于案例分析。测试集统计数据详见补充材料表S3。各评估任务的具体细节如下:

Biomedical Named Entity Recognition (NER). Biomedical NER aims to identify predefined biomedical entities from text, such as diseases, drugs, genes, and proteins. Six biomedical NER datasets (i.e., BC5CDR-Chemcial [32], BC5CDR-disease [32], CHEMDNER [33], NCBI-Disease [34], BioRED [35] and CMeEE-dev [28]) are used for our NER evaluation. The exact match (the predicted entity text and entity type without span) micro F1-score was used as the evaluation metrics.

生物医学命名实体识别 (NER)。生物医学NER旨在从文本中识别预定义的生物医学实体,如疾病、药物、基因和蛋白质。我们使用了六个生物医学NER数据集(即BC5CDR-Chemcial [32]、BC5CDR-disease [32]、CHEMDNER [33]、NCBI-Disease [34]、BioRED [35]和CMeEE-dev [28])进行NER评估。评估指标采用精确匹配(预测实体文本和实体类型,不考虑跨度)的微平均F1分数。

Biomedical Relation Extraction (RE). Biomedical RE aims to automatically extract predefined relations or associations between biomedical entities from text. In this task, we focus on extracting the entity relation triple (head entity, tail entity, relation type). We selected one English dataset BC5CDR and one Chinese dataset CMeEE-dev [28] for the RE evaluation. The micro F1-score was used as the RE metrics.

生物医学关系抽取(RE)。生物医学关系抽取旨在从文本中自动提取预定义的生物医学实体间关系或关联。本任务聚焦于抽取实体关系三元组(头实体, 尾实体, 关系类型)。我们选取了英文数据集BC5CDR和中文数据集CMeEE-dev [28] 进行关系抽取评估,采用微平均F1值作为评价指标。

Biomedical Text Classification (TC). Biomedical TC aims to automatically categorize texts into predefined biomedical class labels. Two manually annotated English multi-label document classification datasets (BC 7 Lit C ovid [36] and HoC [37]) and one Chinese dataset (KUAKE_ QIC [28]) are used for the TC evaluation. The micro F1-score was used as the metrics.

生物医学文本分类 (TC)
生物医学文本分类旨在将文本自动归类到预定义的生物医学类别标签中。评估使用了两个手动标注的英文多标签文档分类数据集 (BC7LitCovid [36] 和 HoC [37]) 以及一个中文数据集 (KUAKE_ QIC [28]) ,并以微平均 F1 分数作为评估指标。

Biomedical Multiple-choice Question Answering (QA-mc). Biomedical multiple-choice Question Answering task aims to answer multiple-choice questions related to bio medicine and healthcare. PubMedQA [38] and MedQA-USMLE (4-option) [39] are used to evaluate the model performance in English. MedQA-MCMLE (4-option) [39] is used to evaluate the model performance in Chinese. We used the accuracy as the metrics for this task.

生物医学多选题问答 (QA-mc)。生物医学多选题问答任务旨在回答与生物医学和医疗保健相关的选择题。PubMedQA [38] 和 MedQA-USMLE (4选项) [39] 用于评估模型在英语中的表现。MedQA-MCMLE (4选项) [39] 用于评估模型在中文中的表现。我们使用准确率作为该任务的评估指标。

In addition to the four primary evaluation tasks, the capabilities of the Taiyi model were also shown through examples on several supplementary biomedical NLP tasks, such as medical Report Generation, Biomedical Event Extraction, Biomedical Machine Translation, Biomedical Title Generation, Biomedical Text Semantic Similarity, Biomedical Question Answering and Chat. Details can be found in Supplementary Examples of Taiyi output on the supplementary biomedical NLP tasks.

除四项主要评估任务外,太乙模型的能力还通过若干辅助生物医学NLP任务的示例得以展示,例如医疗报告生成、生物医学事件抽取、生物医学机器翻译、生物医学标题生成、生物医学文本语义相似度、生物医学问答及对话。具体细节可参阅《太乙模型在辅助生物医学NLP任务中的输出补充示例》。

Effectiveness of the two-stage fine-tuning strategy

两阶段微调策略的有效性

In this experiment, we tested the effect of our two-stage fine-tuning strategy on the four tasks. For comparison, we simply combined all data of the four tasks to finetune the base model as the baseline. As the cost of LLMs is high, we randomly sampled 200 instances from the test set per dataset. The performances of the models using the simple one-stage and the two-stage strategies on the tasks are shown in Figure 4.

在本实验中,我们测试了两阶段微调策略对四项任务的效果。作为对比基线,我们直接将四项任务的所有数据合并用于微调基础模型。由于大语言模型成本较高,我们从每个数据集的测试集中随机抽取了200个样本。图4展示了使用单阶段简单策略和两阶段策略的模型在各项任务上的表现。


Figure 4. Performances of models using different fine-tuning strategies. One-stage denotes the model fine-tuned by combining all task datasets. Two-stage denotes the model fine-tuned by our two-stage strategy. (A) the results on the NER tasks. (B) the results on the RE tasks. (C) the results on the TC tasks. (D) the results on the QA-mc tasks.

图 4: 采用不同微调策略的模型性能对比。One-stage表示将所有任务数据集合并进行微调的模型,Two-stage表示采用我们提出的两阶段策略微调的模型。(A) NER任务结果。(B) RE任务结果。(C) TC任务结果。(D) QA-mc任务结果。

When we simply combined all task datasets as a one-stage strategy to fine-tune the LLM, the results show poor performances on all tasks. The main reason may be the challenges of converging different tasks with varying levels of difficulty and dataset sizes. Therefore, we first fine-tuned the model on those Type1 tasks described in the section of Two-stage Supervised Instruction Fine-tuning independently, then combined all data to continue fine-tuning the model in the second stage. The results show that our two-stage strategy outperforms the one-stage strategy on all English and Chinese tasks and achieves significant average improvement $\sim10%$ in metrics). The two-stage approach allows the model to first become specialized on tasks that are not generation tasks in nature before developing more generalized capabilities across tasks in the second stage.

当我们简单地合并所有任务数据集作为单阶段策略来微调大语言模型时,结果显示在所有任务上表现不佳。主要原因可能是难以收敛不同难度级别和数据规模的任务。因此,我们首先在两阶段监督指令微调章节描述的Type1任务上独立微调模型,然后在第二阶段合并所有数据继续微调模型。结果表明,我们的两阶段策略在所有英文和中文任务上都优于单阶段策略,并在指标上实现了显著的平均提升($\sim10%$)。两阶段方法使模型能够在第二阶段开发跨任务的更通用能力之前,首先专注于本质上非生成类的任务。

Performance of Taiyi on the entire test sets

太乙模型在完整测试集上的表现

The previous experiment demonstrates our two-stage approach achieves significant improvements on the subset of the test sets compared to the simple one-stage training strategy. In this experiment, we evaluate the performance of our Taiyi on the entire test sets of the 13 tasks covering 4 BioNLP task types. The results of ChatGPT 3.5 (GPT-3.5-Turbo) and the state-of-the-art (SOTA) methods based on supervised pretrained language models for each corpus are provided for comparison.

先前实验表明,与简单单阶段训练策略相比,我们的两阶段方法在测试集子集上取得了显著提升。本实验中,我们评估了Taiyi在涵盖4类BioNLP任务的13个完整测试集上的性能,并提供ChatGPT 3.5 (GPT-3.5-Turbo)及各语料库基于监督预训练语言模型的当前最优(SOTA)方法结果作为对比。

Table 2. Performance comparison with other existing methods on the 13 BioNLP tasks

TaskDatasetsTaiyiChatGPT3.5SOTA
NERBC5CDR-Chem (en)0.8020.603 [40]0.933 (PubMedBERT [40])
(Micro-F1)BC5CDR-Dise (en)0.6910.518 [40]0.856 (PubMedBERT [40])
CHEMDNER (en)0.7990.365 [41]0.924 (BioBERT [42])
NCBI-Disease (en)0.7310.505 [40]0.878 (PubMedBERT [40])
CMeEE (zh)0.6570.470 [43]0.740 CBLUE-Leaderboard [28] ]
REBC5CDR (en)0.3750.1420.450 (BioGPT [44])
(Micro-F1)CMelE (zh)0.4320.306 [43]0.549 CBLUE-Leaderboard [28])
TC (Micro-F1)BC7LitCoivd (en)0.8400.639 [45]0.918 (Bioformer [46])
HOC (en)0.8000.512 [40]0.823 (PubMedBERT [40])
KUAKE_ QIC (zh)0.7740.485 [43]0.859 (CBLUE-Leaderboard [28])
QAPubMedQA (en)0.5440.765 [40]0.558 (PubMedBERT [40])
(Accuracy)MedQA-USMLE (en)0.3710.513 [47]0.367 (BioBERT-large [39])
MedQA-MCMLE (zh)0.6480.582 [47]0.701 (RoBERTA-large [39])
ALLAVE0.6510.4930.735

Note: For the results of ChatGPT3.5 and SOTA methods, we provided previous results on the published papers. Since the published result of ChatGPT 3.5 is not available for the relation extraction on the BC5CDR test set, we obtained the result of the ChatGPT using our same prompt via the OpenAI API. For the CMeEE, CMeIE and KUAKE_ QIC datasets, the results of Taiyi are reported on the development set since the gold standard test sets have not been released. Although they may not be directly compared, we still list them for reference.

表 2. 在13项BioNLP任务上与其他现有方法的性能对比

任务 数据集 Taiyi ChatGPT3.5 SOTA
NER (Micro-F1) BC5CDR-Chem (en) 0.802 0.603 [40] 0.933 (PubMedBERT [40])
BC5CDR-Dise (en) 0.691 0.518 [40] 0.856 (PubMedBERT [40])
CHEMDNER (en) 0.799 0.365 [41] 0.924 (BioBERT [42])
NCBI-Disease (en) 0.731 0.505 [40] 0.878 (PubMedBERT [40])
CMeEE (zh) 0.657 0.470 [43] 0.740 CBLUE-Leaderboard [28]
RE (Micro-F1) BC5CDR (en) 0.375 0.142 0.450 (BioGPT [44])
CMelE (zh) 0.432 0.306 [43] 0.549 CBLUE-Leaderboard [28])
TC (Micro-F1) BC7LitCoivd (en) 0.840 0.639 [45] 0.918 (Bioformer [46])
HOC (en) 0.800 0.512 [40] 0.823 (PubMedBERT [40])
KUAKE_ QIC (zh) 0.774 0.485 [43] 0.859 (CBLUE-Leaderboard [28])
QA (Accuracy) PubMedQA (en) 0.544 0.765 [40] 0.558 (PubMedBERT [40])
MedQA-USMLE (en) 0.371 0.513 [47] 0.367 (BioBERT-large [39])
MedQA-MCMLE (zh) 0.648 0.582 [47] 0.701 (RoBERTA-large [39])
ALL AVE 0.651 0.493 0.735

注:对于ChatGPT3.5和SOTA方法的结果,我们提供了已发表论文中的先前结果。由于ChatGPT 3.5在BC5CDR测试集上的关系抽取结果尚未发布,我们通过OpenAI API使用相同的提示获得了ChatGPT的结果。对于CMeEE、CMeIE和KUAKE_ QIC数据集,Taiyi的结果是在开发集上报告的,因为黄金标准测试集尚未发布。尽管它们可能无法直接比较,但我们仍将其列出以供参考。

As shown in Table 2, our Taiyi outperforms ChatGPT3.5 on 11 out of 13 datasets, except for two English QA datasets. It shows that the fine-tuned LLM on domain-specific instruction data can improve performance on in-domain downstream tasks. However, the lower English QA results suggest that the rich domain knowledge may be learned from the pre-training stage of the LLMs, which is difficult to compensate via instruction fine-tuning. Compared with SOTA pre-trained language models, the Taiyi model achieves comparable results on the QA tasks. However, the performance of Taiyi still falls by a margin (average $\sim9%$ in the metrics) for the NER, RE, and TC tasks, where conventional disc rim i native methods outperform generative methods.

如表 2 所示,我们的 Taiyi 在 13 个数据集中的 11 个上表现优于 ChatGPT3.5,除了两个英文问答数据集。这表明针对特定领域指令数据微调的大语言模型可以提高领域内下游任务的性能。然而,较低的英文问答结果说明丰富的领域知识可能来自大语言模型的预训练阶段,这很难通过指令微调来弥补。与最先进的预训练语言模型相比,Taiyi 模型在问答任务上取得了相当的结果。但在命名实体识别 (NER) 、关系抽取 (RE) 和文本分类 (TC) 任务上,Taiyi 的性能仍存在一定差距 (指标平均下降约 9%) ,这些任务中传统的判别式方法通常优于生成式方法。

Performance of Taiyi on the new task

太乙在新任务上的表现

To investigate whether the Taiyi model can be applied to support the new task that is not previously seen in our fine-tuning data, we hold out a recently published biomedical corpus BioRED for evaluation. BioRED is a biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level. Here, we focus on the NER performance of Taiyi on the BioRED test set.

为了探究太乙模型能否应用于支持微调数据中未见过的新任务,我们使用最新发布的生物医学语料库BioRED进行评估。BioRED是一个文档级生物医学关系抽取数据集,包含多种实体类型(如基因/蛋白质、疾病、化学物质)和关系对(如基因-疾病、化学物质-化学物质)。本文主要关注太乙模型在BioRED测试集上的命名实体识别(NER)性能。

Table 3. Performance of the Taiyi model on the BioRED test set

Entity TypePRF1
Chemical0.7170.5660.633
Disease0.8290.5330.649
Gene0.9310.4900.642
Variant0.6330.5850.608
Species0.6400.2860.395
Celline0.2130.5650.310
AlI0.7600.5070.608

表 3: Taiyi模型在BioRED测试集上的性能表现

实体类型 P R F1
Chemical 0.717 0.566 0.633
Disease 0.829 0.533 0.649
Gene 0.931 0.490 0.642
Variant 0.633 0.585 0.608
Species 0.640 0.286 0.395
Celline 0.213 0.565 0.310
AlI 0.760 0.507 0.608

As shown in Table 3, Taiyi obtains promising results (overall performance of 0.608 in F1-score) on the new BioRED NER task. It is significantly higher than the performance of ChatGPT (0.391 in F1- score [41]) on the test set. Among these entity types, Taiyi achieves better performance to chemical, disease, gene, and variant entities, since these entities are seen in other related datasets (e.g., BC5CDR, GnormPlus, and tmVar) during the SFT stage. Moreover, it is noteworthy that even for the unseen entity types (i.e., species and cell lines) in the SFT stage, Taiyi still understands the entity recognition instruction and extracts some species and cell line entities. These results suggest that Taiyi is highly robust and general iz able.

如表 3 所示,Taiyi 在新的 BioRED NER 任务中取得了优异的结果 (F1 分数总体性能为 0.608)。该性能显著高于 ChatGPT 在测试集上的表现 (F1 分数为 0.391 [41])。在这些实体类型中,Taiyi 对化学物质、疾病、基因和变异实体表现更优,因为这些实体在 SFT 阶段已出现在其他相关数据集 (如 BC5CDR、GnormPlus 和 tmVar) 中。此外值得注意的是,即使对于 SFT 阶段未见的实体类型 (即物种和细胞系),Taiyi 仍能理解实体识别指令并提取部分物种和细胞系实体。这些结果表明 Taiyi 具有高度鲁棒性和泛化能力。

Case study

案例研究

To further investigate the benefits of the in-domain instruction fine-tuning, we qualitatively analyzed model outputs on various BioNLP tasks and compared Taiyi to Qwen7B-Chat, a published LLM finetuned using general domain instruction data on the same base model Qwen7B. Table 4 shows answers produced by different models with the same questions on the Chinese chat. From the results, Taiyi generates more detailed and accurate responses to biomedical questions compared to the general domain model Qwen-Chat. More examples of different BioNLP tasks can be found in the Supplementary Case study.

为深入探究领域内指令微调的益处,我们对不同生物医学自然语言处理(BioNLP)任务的模型输出进行了定性分析,并将Taiyi与基于同款基础模型Qwen7B开发的通用领域指令微调大语言模型Qwen7B-Chat进行对比。表4展示了不同模型在中文问答场景下对相同问题的回答。结果显示,相较于通用领域模型Qwen-Chat,Taiyi针对生物医学问题能生成更详尽准确的回答。更多不同BioNLP任务的对比案例可参阅补充案例研究部分。

Input (1): 什么是鼠疫?(What is plague?)

输入 (1): 什么是鼠疫? (What is plague?)

Taiyi:

太乙:

鼠疫是一种由鼠疫杆菌引起的传染病,主要通过跳蚤等媒介传播。鼠疫杆菌主要存在于啮齿类动物体内,当啮齿类动物死亡后,跳蚤等昆虫会吸食其体液,从而将鼠疫杆菌带入人体,引起鼠疫感染。鼠疫的症状包括高热、寒战、淋巴结肿大、咳嗽、呼吸急促等,严重时可导致死亡。鼠疫是一种古老的传染病,历史上曾多次引起大规模的流行病,如14 世纪的黑死病,。(Plague is an infectious disease caused by Yersinia pestis and is mainly spread through fleas and other vectors. Yersinia pestis mainly exists in rodents. When rodents die, fleas and other insects will suck their body fluids, thereby introducing Yersinia pestis into the human body, causing plague infection. Symptoms of plague include high fever, chills, swollen lymph nodes, cough, shortness of breath, etc., which can lead to death in severe cases. Plague is an ancient infectious disease that has caused large-scale epidemics many times in history, such as the Black Death in the 14th century, which caused huge human and economic losses to Europe.)

鼠疫是一种由鼠疫杆菌 (Yersinia pestis) 引起的传染病,主要通过跳蚤等媒介传播。鼠疫杆菌主要存在于啮齿类动物体内,当啮齿类动物死亡后,跳蚤等昆虫会吸食其体液,从而将鼠疫杆菌带入人体,引起鼠疫感染。鼠疫的症状包括高热、寒战、淋巴结肿大、咳嗽、呼吸急促等,严重时可导致死亡。鼠疫是一种古老的传染病,历史上曾多次引起大规模的流行病,如14世纪的黑死病 (Black Death) ,给欧洲造成了巨大的人口和经济损失。

Qwen-Chat:

Qwen-Chat:

鼠疫是一种由鼠疫杆菌引起的急性传染病,它可以在动物之间传播,并且可以通过接触染病动物的排泄物或携带病毒的动物及其分泌物传播给人类。该病的症状包括发热、寒战、淋巴结肿大、呼吸困难和出血等。(Plague is an acute infectious disease caused by Yersinia pestis, which can be transmitted between animals and can be transmitted to humans through contact with the excrement of infected animals or animals carrying the virus and their secretions. Symptoms of the disease include fever, chills, swollen lymph nodes, difficulty breathing and bleeding.)

鼠疫是一种由鼠疫杆菌 (Yersinia pestis) 引起的急性传染病,它可以在动物之间传播,并且可以通过接触染病动物的排泄物或携带病毒的动物及其分泌物传播给人类。该病的症状包括发热、寒战、淋巴结肿大、呼吸困难和出血等。

Input (2):

输入 (2):

有什么预防方法吗?(Are there any prevention methods?)

有什么预防方法吗? (Are there any prevention methods?)

Taiyi:

太一:

是的,预防鼠疫的方法包括以下几点:

是的,预防鼠疫的方法包括以下几点:

  1. 避免接触感染源:尽量避免接触啮齿类动物,如老鼠、土拨鼠等,不要在它们的巢穴附近逗留。
  2. 避免接触感染源:尽量避免接触啮齿类动物,如老鼠、土拨鼠等,不要在它们的巢穴附近逗留。
  3. 注意个人卫生:保持手部卫生,勤洗手,避免接触有感染风险的物品,如被感染的衣物、被褥等
  4. 注意个人卫生:保持手部卫生,勤洗手,避免接触有感染风险的物品,如被感染的衣物、被褥等

. 食用安全:避免食用被感染的动物或其制品,如鼠肉、鼠肉制品等。

食用安全:避免食用被感染的动物或其制品,如鼠肉、鼠肉制品等。

居环境:保持家居环境清洁卫生,定期消毒,特别是鼠洞、鼠粪等易滋生细菌的地方

居住环境:保持家庭环境清洁卫生,定期消毒,尤其是鼠洞、鼠粪等易滋生细菌的区域

  1. 预防接种:对于有鼠疫高发区的人群,可以接种鼠疫疫苗,提高免疫力。 总之,预防鼠疫需要从多个方面入手,保持个人卫生、注意饮食安全、保持家居环境清洁卫生等都是预防鼠疫的有效方法。
  2. 预防接种:对于有鼠疫高发区的人群,可以接种鼠疫疫苗,提高免疫力。总之,预防鼠疫需要从多个方面入手,保持个人卫生、注意饮食安全、保持家居环境清洁卫生等都是预防鼠疫的有效方法。

(Yes, ways to prevent plague include the following:

预防鼠疫的方法包括:

  1. Avoid contact with sources of infection: Try to avoid contact with rodents, such as mice, prairie dogs, etc., and do not stay near their nests.
  2. 避免接触感染源:尽量避免接触啮齿类动物,如老鼠、草原犬鼠等,不要在其巢穴附近逗留。
  3. Pay attention to personal hygiene: maintain hand hygiene, wash hands frequently, and avoid contact with items with risk of infection, such as infected clothing, bedding, etc.
  4. 注意个人卫生:保持手部清洁,勤洗手,避免接触有感染风险的物品,如被污染的衣物、被褥等。
  5. Food safety: Avoid eating infected animals or their products, such as rat meat and rat meat products.
  6. 食品安全:避免食用受感染动物或其制品,如鼠肉及鼠肉制品。
  7. Home environment: Keep the home environment clean and sanitary, and disinfect it regularly, especially rat holes, rat droppings and other places where bacteria can easily breed.
  8. 家庭环境:保持家庭环境清洁卫生,定期消毒,特别是鼠洞、鼠粪等易滋生细菌的地方。
  9. Vaccination: People in areas with a high incidence of plague can be vaccinated against plague to improve their immunity. In short, the prevention of plague requires many aspects. Maintaining personal hygiene, paying attention to food safety, and keeping the home environment clean and hygienic are all effective ways to prevent plague.)
  10. 疫苗接种: 疫区人群可通过接种鼠疫疫苗提高免疫力。简而言之,鼠疫防控需多管齐下。保持个人卫生、注重食品安全、维持家居环境清洁卫生都是有效的预防措施。

Qwen-chat:

Qwen-chat:

预防鼠疫的最有效方法是避免接触携带鼠疫杆菌的动物和它们的排泄物,以及不触摸被污染的物品和表面。另外,应保持个人卫生,经常洗手,避免进食未煮熟的食物,以及及时接种疫苗来预防鼠疫的发生。(The most effective way toprevent plague is to avoid contact with animals that carry Yersinia pestis and their excrement, as well as not touching contaminated objects and surfaces. In addition, you should maintain personal hygiene, wash your hands frequently, avoid eating undercooked food, and get vaccinated in time to prevent the occurrence of plague.)

预防鼠疫的最有效方法是避免接触携带鼠疫杆菌 (Yersinia pestis) 的动物和它们的排泄物,以及不触摸被污染的物品和表面。另外,应保持个人卫生,经常洗手,避免进食未煮熟的食物,以及及时接种疫苗来预防鼠疫的发生。

Note that the English in brackets is not part of the input and output; it is the translation of the Chinese.

注意:括号内的英文并非输入输出内容,而是中文的翻译说明。

DISCUSSION

讨论

Due to the complexity of biomedical data and knowledge gaps across tasks, general LLMs often generate meaningless or completely unrelated responses to a given bio medicine input, or the users need to design professional prompts to improve the performance of LLMs on the biomedical tasks. To mitigate such issues, we elaborately fine-tuned the LLM on bilingual biomedical data across multiple biomedical tasks. For the LLM optimization to bio medicine, it is crucial to establish a data-centric infrastructure and fine-tuning procedure. In our study, we collected a huge amount of in-domain training datasets, filtered out high-quality subsets and optimized reasonable ratios for each dataset. Including the tasks that are not generation tasks in nature (e.g., NER and RE), all tasks are converted into generation tasks via instruction templates to train the model. The proposed two-stage SFT strategy is shown to effectively improve the model performance. Owing to these efforts, Taiyi can understand bilingual biomedical task instructions and has strong robustness and generalization capabilities on a variety of tasks.

由于生物医学数据的复杂性和跨任务间的知识鸿沟,通用大语言模型往往会对给定的生物医学输入生成无意义或完全无关的响应,用户需要设计专业提示词(prompt)来提升大语言模型在生物医学任务中的表现。为解决这些问题,我们在多任务双语生物医学数据上对模型进行了精细微调。针对生物医学领域的LLM优化,建立以数据为中心的基础设施和微调流程至关重要。本研究收集了大量领域内训练数据集,筛选出高质量子集并优化了各数据集的比例配比。对于本质上非生成式任务(如NER和RE),均通过指令模板转化为生成式任务进行模型训练。提出的两阶段SFT策略被证明能有效提升模型性能。得益于这些工作,Taiyi能够理解双语生物医学任务指令,并在多样化任务中展现出强大的鲁棒性和泛化能力。

However, Taiyi still has some common limitations of LLMs, including hallucinations, bias and fairness, lack of common sense, and deficient biomedical knowledge. For example, when we input the sentence “Please introduce the oct a cyclin es among antibiotics”, Taiyi generated the response of “Oct a cyclin es are a group of antibiotics that are structurally similar to each other and share a common chemical structure. They are characterized by having a 1,4-dioxane ring in their core structure……”. In fact, “Oct a cyclin es” are fictitious and there are no such antibiotics. The hallucinations may potentially lead to severe medical malpractice. Moreover, Taiyi achieves better performance on the Chinese QA task but lower scores on the English QA tasks than ChatGPT 3.5 as shown in Table 2. This suggests that the rich biomedical knowledge may be learned from the pre-training stage of the LLMs, which is difficult to compensate via simple instruction fine-tuning. In practice, it is very challenging to train capable biomedical LLMs from scratch, due to the huge compute consumption and the sensitivity to data quality and training tricks. Therefore, it is useful to develop effective tuning strategies and use additional biomedical resources to inject specific knowledge. Our future work will focus on these problems, incorporating knowledge resources (e.g., biomedical knowledge database and factual information obtained by retrieval technology), improving the biomedical interpret ability of the model's output, and aligning with human intentions to improve safety in the medical field.

然而,Taiyi仍存在大语言模型的常见局限,包括幻觉、偏见与公平性缺失、常识缺乏及生物医学知识不足。例如,当我们输入句子"请介绍抗生素中的Oct a cyclin es"时,Taiyi生成的回应是"Oct a cyclin es是一组结构相似且具有共同化学结构的抗生素,其核心结构特征为含有1,4-二氧六环......"。实际上"Oct a cyclin es"是虚构的,并不存在此类抗生素。这种幻觉可能导致严重的医疗事故。此外,如 表2 所示,Taiyi在中文问答任务上表现优于ChatGPT 3.5,但在英文问答任务中得分较低。这表明丰富的生物医学知识可能来自大语言模型的预训练阶段,仅通过简单的指令微调难以弥补。实践中,由于巨大的计算消耗及对数据质量和训练技巧的敏感性,从头训练具备生物医学能力的大语言模型极具挑战性。因此,开发有效的微调策略并利用额外生物医学资源注入特定知识具有重要意义。我们未来的工作将聚焦于以下方向:整合知识资源(如生物医学知识库和检索技术获取的事实信息)、提升模型输出的生物医学可解释性、以及对齐人类意图以提高医疗领域安全性。

CONCLUSION

结论

In this study, we collected diverse bilingual (English and Chinese) BioNLP datasets and standardized their formats. Leveraging these rich training resources and the proposed two-stage supervised finetuning approach, Taiyi shows considerable capability on various BioNLP tasks. Furthermore, Taiyi exhibits cross-lingual generalization across similar task scenarios while retaining general domain conversational abilities. Overall, utilizing rich high-quality biomedical corpora and designing effective fine-tuning strategies can substantially enhance the performance of LLMs within the biomedical domain. Our future work will focus on further enhancing Taiyi's task capabilities, interpret ability, and security for biomedical applications.

在本研究中,我们收集了多样化的双语(英文和中文)BioNLP数据集并统一了其格式。利用这些丰富的训练资源和提出的两阶段监督微调方法,Taiyi在各类BioNLP任务中展现出显著能力。此外,Taiyi在相似任务场景中表现出跨语言泛化能力,同时保持通用领域的对话能力。总体而言,利用高质量生物医学语料库并设计有效的微调策略,可大幅提升大语言模型在生物医学领域的性能。我们未来的工作将集中于进一步增强Taiyi在生物医学应用中的任务能力、可解释性和安全性。

Ethical Considerations

伦理考量

This research demonstrates the potential of LLMs in the biomedical domain. The transition from using Taiyi LLM for doctor-patient dialogue and medical report generation to practice applications in medical services will require abundant additional research to ensure the safety of this technology. In addition, strict expert evaluation for different medical scenarios must be considered in deployment to realize early diagnosis error discovery. Also, noteworthy aspects including biases and security vulnerabilities inherited from foundation models.

本研究展示了大语言模型(LLM)在生物医学领域的潜力。从使用Taiyi大语言模型进行医患对话和医疗报告生成,到医疗服务中的实际应用,还需要大量额外研究来确保该技术的安全性。此外,在部署时必须考虑针对不同医疗场景的严格专家评估,以实现早期诊断错误发现。同样值得注意的方面还包括基础模型继承的偏见和安全漏洞。

FUNDING

资金支持

This research was supported by the National Natural Science Foundation of China (No. 62302076) and the Fundamental Research Funds for the Central Universities [No. DUT23RC(3)014].

本研究得到国家自然科学基金(No. 62302076)和中央高校基本科研业务费专项资金[No. DUT23RC(3)014]的资助。

AUTHOR CONTRIBUTIONS

作者贡献

Conception and design: LL, ZY, HL. Data collection and processing: LL, YZ, PC, WrF, YQ, DP, JL, HL, WdF, ST, YL. Model training and evaluation: JN, ZW, LL, ZD, QH, GX. Analysis and interpretation: LL, JN, YZ, YS. Drafting the manuscript: LL, JN, YZ, PC, ZD, WrF, QH. Revising the drafted manuscript: LL, JW, YS, ZY, HL. All authors approved the submitted version.

构思与设计:LL、ZY、HL。数据收集与处理:LL、YZ、PC、WrF、YQ、DP、JL、HL、WdF、ST、YL。模型训练与评估:JN、ZW、LL、ZD、QH、GX。分析与解读:LL、JN、YZ、YS。撰写初稿:LL、JN、YZ、PC、ZD、WrF、QH。修订稿件:LL、JW、YS、ZY、HL。所有作者均认可提交版本。

SUPPLEMENTARY MATERIAL

补充材料

Supplementary material is available at Journal of the American Medical Informatics Association online.

补充材料可在《美国医学信息学会杂志》在线版获取。

CONFLICT OF INTEREST STATEMENT None declared.

利益冲突声明 无。

DATA AVAILABILITY

数据可用性

The benchmark datasets that support the findings of this study are available from the official websites of natural language processing challenges with Data Use Agreements. The data information and model weights of Taiyi are available at https://github.com/DUTIR-BioNLP/Taiyi-LLM.

支持本研究结果的基准数据集可从签署数据使用协议的自然语言处理挑战赛官方网站获取。Taiyi的数据信息和模型权重可在https://github.com/DUTIR-BioNLP/Taiyi-LLM查看。

REFERENCES

参考文献

Human Language Technologies, Volume 1 (Long and Short Papers); 2019.

人类语言技术,第1卷(长论文与短论文);2019。

  1. Cui J, Li Z, Yan Y, Chen B, Yuan L. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092 2023
  2. Cui J, Li Z, Yan Y, Chen B, Yuan L. Chatlaw: 集成外部知识库的开源法律大语言模型. arXiv preprint arXiv:2306.16092 2023
  3. Kasneci E, Seßler K, Küchemann S, et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences 2023;103:102274
  4. Kasneci E, Seßler K, Küchemann S 等. ChatGPT 有益教育吗?论大语言模型在教育领域的机遇与挑战. 学习与个体差异 2023;103:102274
  5. Wang Y, Ma X, Chen W. Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering. arXiv preprint arXiv:2309.02233 2023
  6. Wang Y, Ma X, Chen W. 基于医学教材增强黑盒大语言模型的临床问答能力. arXiv预印本 arXiv:2309.02233 2023

Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks

Taiyi: 面向多样化生物医学任务的双语微调大语言模型

Supplementary Material

补充材料

Table S1. Sum mari z ation of existing LLMs in the biomedical domain

Model NameBaseLanguageTraining methodSFT data
GatorTron [1]TransformerEnPT+SFTBioNLP
Med-PaLM [2]PaLMEnSFTQA
ChatDoctor [3]LLaMAEnSFTQA and Chat
MedAlpaca [4]LLaMAEnSFTQA and Chat
PMC-LLaMA [5]LLaMAEnCPT+SFTQA and Chat
BenTsao [6]ChatGLMZhSFTQA and Chat
DoctorGLM [7]ChatGLMZhSFTQA and Chat
HuatuoGPT [8]BLOOMZZhSFT+RLQA and Chat
ClinicalGPT [9]BLOOMZhSFT+RLQA and Chat
DISC-MedLLM[10]BaichuanZhSFTQA and Chat
Zhongjing [11]Ziya-LLaMAZhCPT+SFT+RLBioNLP
ChiMed-GPT [12]Ziya-LLaMAZhCPT+SFT+RLQA, Chat

PT denotes pre-training from scratch; CPT denotes continued pre-training; SFT denotes supervised fine-tuning; RL denotes reinforcement learning.

表 S1: 生物医学领域现有大语言模型汇总

模型名称 基础架构 语言 训练方法 SFT数据
GatorTron [1] Transformer En PT+SFT BioNLP
Med-PaLM [2] PaLM En SFT QA
ChatDoctor [3] LLaMA En SFT QA和对话
MedAlpaca [4] LLaMA En SFT QA和对话
PMC-LLaMA [5] LLaMA En CPT+SFT QA和对话
BenTsao [6] ChatGLM Zh SFT QA和对话
DoctorGLM [7] ChatGLM Zh SFT QA和对话
HuatuoGPT [8] BLOOMZ Zh SFT+RL QA和对话
ClinicalGPT [9] BLOOM Zh SFT+RL QA和对话
DISC-MedLLM [10] Baichuan Zh SFT QA和对话
Zhongjing [11] Ziya-LLaMA Zh CPT+SFT+RL BioNLP
ChiMed-GPT [12] Ziya-LLaMA Zh CPT+SFT+RL QA和对话

PT表示从头预训练;CPT表示持续预训练;SFT表示监督微调;RL表示强化学习。

Table S2. Examples of the instruction data

Task: NER

表 S2: 指令数据示例

| Task: NER |

Ex1(zh):

Ex1(zh):

Input:

输入:

从下面文本中识别出指定的实体类型:儿童 SARST 细胞亚群的研究表明,与成人 SARS 相比,儿童细胞下降不明显,证明上述推测成立实体类型: 微生物类,疾病,药物,医疗程序,医疗设备,临床表现,科室,身体,医学检验项目

从下面文本中识别出指定的实体类型: 儿童 SARST 细胞亚群的研究表明,与成人 SARS 相比,儿童细胞下降不明显,证明上述推测成立。实体类型: 微生物类,疾病,药物,医疗程序,医疗设备,临床表现,科室,身体,医学检验项目

Ex2(en):

Ex2(en):

Input:

输入:

Identify Chemical, Disease entities from the text: "Acute changes of blood ammonia may predict short-term adverse effects of valproic acid. Valproic acid (VPA) was given to 24 epileptic patients who were already being treated with other anti epileptic drugs. A standardized loading dose of VPA was administered, and venous blood was sampled at 0, 1, 2, 3, and 4 hours. Ammonia (NH3) was higher in patients who, during continuous therapy, complained of drowsiness (7 patients) than in those who were symptom-free (17 patients), although VPA plasma levels were similar in both groups. By measuring VPA-induced changes of blood NH3 content, it may be possible to identify patients at higher risk of ob tun dati on when VPA is given chronically.

从文本中识别化学物质和疾病实体:
"血氨的急性变化可能预测丙戊酸(Valproic acid)的短期不良反应。研究对24名已接受其他抗癫痫药物治疗的癫痫患者给予丙戊酸(VPA)。采用标准化负荷剂量给药,并在0、1、2、3和4小时采集静脉血样。在持续治疗期间主诉嗜睡的患者(7例)血氨(NH3)水平高于无症状患者(17例),尽管两组患者的VPA血浆浓度相似。通过测量VPA引起的血氨含量变化,可能有助于识别长期服用VPA时发生意识模糊高风险的患者。"

(注:根据术语处理规则,专业术语"Valproic acid"首次出现时标注英文原名,后续使用中文简称"VPA";疾病名称"epileptic"译为规范医学术语"癫痫";化学物质"ammonia"保留专业表述"血氨";测量单位"hours"转换为中文习惯表述"小时";医学术语"drowsiness"译为临床常用表述"嗜睡")

Output:

输出:

Chemical: valproic acid; Ammonia; NH3; ammonia; VPA; Valproic acid

化学: 丙戊酸 (valproic acid); 氨; NH3; 氨; 丙戊酸 (VPA); 丙戊酸 (Valproic acid)

Disease: epileptic; drowsiness

疾病:癫痫;嗜睡

Task: RE

任务: RE

Ex1(zh) Input:

练习1(zh) 输入:

实体关系三元组抽取,以“(头实体, 尾实体, 关系类型)”格式输出:

实体关系三元组抽取,以"(头实体, 尾实体, 关系类型)"格式输出:

13-三体综合征 $\mathcal{Q}30%\sim60%$ 患儿有泌尿系畸形,可见多囊肾、肾盂积水、双肾及双输尿管。关系类型标签:并发症,发病率,预后状况,鉴别诊断,同义词,相关(转化),影像学检查,外侵部位,死亡率

13-三体综合征 $\mathcal{Q}30%\sim60%$ 患儿存在泌尿系统畸形,表现为多囊肾、肾盂积水、双肾及双输尿管。关系类型标签:并发症、发病率、预后状况、鉴别诊断、同义词、相关(转化)、影像学检查、外侵部位、死亡率

Output:

输出:

(13-三体综合征, 泌尿系畸形, 并发症); (13-三体综合征, 双肾, 并发症); (13-三体综合征, 多囊肾, 并发症); (13-三体综合征, 双输尿管, 并发症); (13-三体综合征, 肾盂积水, 并发症); (泌尿系畸形, $30%\sim60%$ , 发病率)

(13-三体综合征, 泌尿系畸形, 并发症); (13-三体综合征, 双肾, 并发症); (13-三体综合征, 多囊肾, 并发症); (13-三体综合征, 双输尿管, 并发症); (13-三体综合征, 肾盂积水, 并发症); (泌尿系畸形, $30%\sim60%$ , 发病率)

Ex2(en)

Ex2(英文)

Input:

输入:

Output the chemical-induced disease relations in the following text:

输出文本中的化学物质诱发疾病关系:

Phenobarbital-induced dyskinesia in a neurologic ally-impaired child. A 2-year-old child with known neurologic impairment developed a dyskinesia soon after starting phenobarbital therapy for seizures. Known causes of movement disorders were eliminated after evaluation. On repeat challenge with phenobarbital, the dyskinesia recurred. Phenobarbital should be added to the list of anticonvulsant drugs that can cause movement disorders.

苯巴比妥诱发神经系统受损患儿运动障碍。一名已知存在神经系统损伤的2岁儿童在开始使用苯巴比妥治疗癫痫后不久出现运动障碍。经评估排除了其他已知的运动障碍病因。再次使用苯巴比妥激发试验时,运动障碍复发。苯巴比妥应被列入可能引起运动障碍的抗惊厥药物清单。

Output:

输出:

chemical-induced disease relations: [Phenobarbital, dyskinesia]; [phenobarbital, dyskinesia]

化学诱导疾病关系: [Phenobarbital, dyskinesia]; [phenobarbital, dyskinesia]

Task: TC

任务: TC

Ex1(zh)

练习1(zh)

Input:

输入:

将下面文本分类到指定的类别中:“ 9. 曾接受过用于治疗癌症的免疫细胞治疗方法,包括 CIK、DC、DC-CIK、LAK 细胞等;”

  1. 曾接受过用于治疗癌症的免疫细胞治疗方法,包括 CIK、DC、DC-CIK、LAK 细胞等;

类别标签:伦理审查, 疾病, 吸烟状况, 预期寿命, 依存性, 肿瘤进展, 受体状态, 过敏耐受, 实验室检查, 年龄, 性别,教育情况, 研究者决定, 健康群体, 知情同意, 酒精使用, 体征(医生检测), 口腔相关, 药物, 参与其它试验, 器官组织状态, 风险评估, 锻炼, 设备, 护理, 成瘾行为, 读写能力, 性取向, 症状(患者感受), 献血, 病例来源, 数据可及性, 特殊病人特征, 怀孕相关, 睡眠, 治疗或手术, 能力, 饮食, 残疾群体, 种族, 含有多类别的语句, 居住情况, 诊断, 疾病 分期

类别标签:伦理审查, 疾病, 吸烟状况, 预期寿命, 依存性, 肿瘤进展, 受体状态, 过敏耐受, 实验室检查, 年龄, 性别, 教育情况, 研究者决定, 健康群体, 知情同意, 酒精使用, 体征(医生检测), 口腔相关, 药物, 参与其它试验, 器官组织状态, 风险评估, 锻炼, 设备, 护理, 成瘾行为, 读写能力, 性取向, 症状(患者感受), 献血, 病例来源, 数据可及性, 特殊病人特征, 怀孕相关, 睡眠, 治疗或手术, 能力, 饮食, 残疾群体, 种族, 含有多类别的语句, 居住情况, 诊断, 疾病分期

Output:

输出:

上述文本被分类为:治疗或手术

上述文本被分类为:治疗或手术

Ex2(en) Input:

Ex2(en) 输入:

Classify the following text into the specified text label: "Influenza activity during the outbreak of coronavirus disease 2019 in Chinese mainland. Since coronavirus disease 2019 (COVID-19) might circulate in the following seasons, it is essential to understand how COVID-19 influences other respiratory diseases, especially influenza. In this study, we analyzed the influenza activity from mid-November 2019 to March 2020 in Chinese mainland and found that the influenza season ended much earlier than previous seasons for all subtypes and lineages, which may have resulted from the circulation of COVID-19 and measures such as travel control and personal protection. These findings provide rudimentary knowledge of the co-circulation patterns of the two types of viruses."

将以下文本分类到指定的文本标签中: "2019冠状病毒病疫情期间中国大陆的流感活动情况。由于2019冠状病毒病 (COVID-19) 可能在后续季节持续传播,了解COVID-19如何影响其他呼吸道疾病(尤其是流感)至关重要。本研究分析了2019年11月中旬至2020年3月中国大陆的流感活动情况,发现所有亚型和谱系的流感季结束时间均较往年大幅提前,这可能是由COVID-19的传播以及旅行管控和个人防护等措施所致。这些发现为理解两类病毒的共循环模式提供了基础认知。"

Text Labels: Case Report, Prevention, Transmission, Diagnosis, Mechanism, Treatment, Epidemic Forecasting Output:

文本标签:病例报告、预防、传播、诊断、机制、治疗、疫情预测

Result: Prevention

结果:预防

Taks: MT

任务:机器翻译

Ex1(zh)

Ex1(zh)

Input:

输入:

将下面文本翻译成英语:“死亡的风险比(手术 vs.非手术)为 $1.29(95%\mathrm{Cl},0.97\sim1.72;\mathsf{P}=0.08)$ ,相当于中位总生存期分别为 50.6 个月和 64.7 个月.

死亡的风险比(手术 vs. 非手术)为 $1.29(95%\mathrm{Cl},0.97\sim1.72;\mathsf{P}=0.08)$,相当于中位总生存期分别为50.6个月和64.7个月。

Output:

输出:

The hazard ratio for death ( surgery vs. no surgery ) was 1.29 ( $95%$ confidence interval [ ci ], 0.97 to $1.72;{\mathsf{p}}=$ 0.08 ), which corresponded to a median overall survival of 50.6 months and 64.7 months, respectively.

死亡风险比(手术 vs. 非手术)为1.29 (95%置信区间[ci], 0.97至1.72; p=0.08),对应的中位总生存期分别为50.6个月和64.7个月。

Ex2(en)

Ex2(en)

Input:

输入:

Machine Translation from English to Chinese: "Unfortunately, recurrent disease develops in more than $80%$ of women."

英语到中文的机器翻译:"不幸的是,超过 $80%$ 的女性会出现疾病复发。"

Output:

输出:

但遗憾的是,有 $80%$ 患者的卵巢癌会复发.

但遗憾的是,有 $80%$ 患者的卵巢癌会复发。

Table S3. Statistics of our evaluation test sets

TaskDataLanSizeMetrics
NERBC5CDR-chemEn500 abstractsMicro-F1
BC5CDR-diseEn500 abstracts
CHEMDERNEn3,000 abstracts
NCBI-DiseaseEn100 abstracts
BioREDEn100 abstracts
RECMeEE-dev BC5CDRZh en5,000 sentences 500 abstractsMicro-F1
CMelE-devZh3,585 sentences
TCBC7LitCovidEn6,239 abstractsMicro-F1
HOCN33,547 sentences
KUAKE_ QIC-devZh1,955 sentences
QA-mcPubMedQAEn500 questionsAccuracy
MedQA-USMLEEn1,273 questions
MedQA-MCMLEZh3,426 questions

表 S3: 评估测试集统计

任务 数据 语言 规模 指标
NER BC5CDR-chem En 500 篇摘要 Micro-F1
BC5CDR-dise En 500 篇摘要
CHEMDERN En 3,000 篇摘要
NCBI-Disease En 100 篇摘要
BioRED En 100 篇摘要
RE CMeEE-dev BC5CDR Zh/En 5,000 句/500 篇摘要 Micro-F1
CMelE-dev Zh 3,585 句
TC BC7LitCovid En 6,239 篇摘要 Micro-F1
HOC N3 3,547 句
KUAKE_ QIC-dev Zh 1,955 句
QA-mc PubMedQA En 500 题 Accuracy
MedQA-USMLE En 1,273 题
MedQA-MCMLE Zh 3,426 题

Examples of Taiyi output on the supplementary biomedical NLP tasks

补充生物医学NLP任务中的太乙输出示例

In addition to the four primary evaluation tasks, the capabilities of the Taiyi model were also shown through examples (in Table S4) on several supplementary biomedical NLP tasks, such as Medical Report Generation (MRG) aims to automatically summarize information during the diagnostic process. The task requires the language model to generate medical reports in a specified format based on patient self-reports and doctor-patient dialogues. Biomedical Event Extraction (EE) is to extract structured biomedical events from unstructured biomedical literature, including mining the described biological entities in the literature, as well as the fine-grained relationships between biomedical events. Biomedical Machine Translation (MT) converts biomedical texts from one natural language to another. This task can assist in translating various medical questions and recommendations, disease names, symptoms, and other information between English and Chinese. Biomedical Title Generation (TG) aims to generate informative article titles for input biomedical text paragraphs with sum mari z ation. Biomedical Text Semantic Similarity (SS) aims to determine the degree of association or similarity between two medical texts. This task typically involves text matching, similarity measurement, and semantic understanding and can be used in various applications such as medical information retrieval and document matching. Biomedical Question Answering and Chat (CHAT) need to understand questions posed in natural language and provide accurate answers. Different from the simple QA, the dialogue task is capable of simulating multi-round natural language conversations.

除四项主要评估任务外,泰医模型的能力还通过若干补充性生物医学NLP任务示例展示(见表S4),例如:

      • 医疗报告生成(MRG)* * 旨在自动汇总诊断过程中的信息。该任务要求大语言模型根据患者自述和医患对话生成指定格式的医疗报告。
      • 生物医学事件抽取(EE)* * 需从非结构化的生物医学文献中提取结构化事件,包括挖掘文献中描述的生物学实体及生物医学事件间的细粒度关系。
      • 生物医学机器翻译(MT)* * 将生物医学文本在自然语言间转换,可辅助中英文医学问题建议、疾病名称、症状等信息的互译。
      • 生物医学标题生成(TG)* * 为输入的生物医学文本段落生成具有总结性的信息型文章标题。
      • 生物医学文本语义相似度(SS)* * 用于判定两段医学文本的关联或相似程度,通常涉及文本匹配、相似度测量和语义理解,可应用于医学信息检索和文档匹配等场景。
      • 生物医学问答与对话(CHAT)* * 需理解自然语言提问并提供精准回答。与简单QA不同,该对话任务能模拟多轮自然语言交互。

Table S4. Example responses generated by Taiyi on the supplementary BioNLP tasks

Task:Chat(en)
Input (1):
Doyouknow colchicine?
Taiyi: Yes, Colchicine is a medication used to treat gout, which is a type of arthritis caused by high levels of uric acid in the
causedbytheuric acidcrystalsinthe joints.
Input (2):
Doesithaveanypotentialsideeffects?
Taiyi:
Yes,Colchicine have potential side effects,including gastrointestinal issues such as nausea,vomiting,diarrhea,and abdominal pain.It can also cause allergic reactions such asrash,itching,hives,or difficultybreathing.In some cases,

表 S4: 太乙在补充生物医学自然语言处理任务中的示例回答

任务: 聊天(英文)
输入 (1):
你知道秋水仙碱吗?
太乙: 是的,秋水仙碱是一种用于治疗痛风的药物,痛风是由关节中尿酸结晶引起的一种关节炎。
输入 (2):
它有什么潜在的副作用吗?
太乙:
是的,秋水仙碱有潜在的副作用,包括胃肠道问题如恶心、呕吐、腹泻和腹痛。它也可能引起过敏反应如皮疹、瘙痒、荨麻疹或呼吸困难。在某些情况下,

Task: Medical Report Generation (zh)

任务: 医疗报告生成

上述问诊对话的诊疗报告如下:

上述问诊对话的诊疗报告如下:

主诉:咳嗽咳痰伴鼻流涕。 现病史:患儿无明显诱因下出现咳嗽咳痰伴鼻流涕情况,无恶心呕吐,无其他不适,无其他疾病史,无药物过敏史。 辅助检查:无。 既往史:既往体健。 诊断:上呼吸道感染。 建议:口服氨溴索,桔贝合剂,若条件允许,雾化吸入治疗。

主诉:咳嗽咳痰伴鼻流涕。
现病史:患儿无明显诱因下出现咳嗽咳痰伴鼻流涕情况,无恶心呕吐,无其他不适,无其他疾病史,无药物过敏史。
辅助检查:无。
既往史:既往体健。
诊断:上呼吸道感染。
建议:口服氨溴索,桔贝合剂,若条件允许,雾化吸入治疗。

Taks: Event Extraction (zh)

任务:事件抽取

找出指定的临床发现事件属性:因患者需期末考试,故予以口服“雷贝拉唑钠肠溶片”治疗,现腹痛情况明显好转。事件抽取说明:临床发现事件由主体词,发生状态,描述词和解剖部位组成

找出指定的临床发现事件属性:因患者需期末考试,故予以口服“雷贝拉唑钠肠溶片”治疗,现腹痛情况明显好转。事件抽取说明:临床发现事件由主体词、发生状态、描述词和解剖部位组成

Taiyi:

太一:

上述句子中的临床发现事件如下:主体词:疼痛;发生状态:;描述词:明显好转;解剖部位:腹

上述句子中的临床发现事件如下:主体词:疼痛;发生状态:;描述词:明显好转;解剖部位:腹

Taks: Event Extraction (en)

任务: 事件抽取

Input:

输入:

Task: Machine Translation (zh-en)

任务:机器翻译(中英)

Input:

输入:

将下面文本翻译成英文:

Translate the following text into English:

光动力疗法(PDT)作为一种新兴的肿瘤治疗手段,因其不良反应较少、靶向性好、可重复治疗等优点,已广泛应用于临床多种肿瘤的治疗。相比于手术、化疗及放疗等传统治疗策略,光动力疗法不仅可杀伤原位肿瘤,还可通过激活 机体的免疫效应对转移瘤发挥抑制作用。然而,PDT 诱导免疫效应的高低受多种因素影响,包括光敏剂在细胞内的定位和剂量、光参数、肿瘤内的氧浓度、免疫功能的完整性等。本文针对 PDT 介导抗肿瘤免疫效应的相关机制,以及PDT 免疫效应的主要影响因素进行综述,以探讨 PDT 用于肿瘤治疗的未来发展方向。

光动力疗法(PDT)作为一种新兴的肿瘤治疗手段,因其不良反应较少、靶向性好、可重复治疗等优点,已广泛应用于临床多种肿瘤的治疗。相比于手术、化疗及放疗等传统治疗策略,光动力疗法不仅可杀伤原位肿瘤,还可通过激活机体的免疫效应对转移瘤发挥抑制作用。然而,PDT诱导免疫效应的高低受多种因素影响,包括光敏剂在细胞内的定位和剂量、光参数、肿瘤内的氧浓度、免疫功能的完整性等。本文针对PDT介导抗肿瘤免疫效应的相关机制,以及PDT免疫效应的主要影响因素进行综述,以探讨PDT用于肿瘤治疗的未来发展方向。

Taiyi:

太乙:

As a new tumor treatment, photo dynamic therapy (PDT) has been widely used in clinical practice due to its advantages of fewer side effects, high target specificity, and the possibility of repeated treatment. Compared with traditional surgical, chemotherapy, and radiotherapy, PDT can not only kill the primary tumor, but also inhibit the metastatic tumor by activating the body's immune effect. However, the high or low of PDT-induced immune effect is affected by many factors, including the localization and dosage of photo sensitize rs, light parameters, oxygen concentration in tumor, and the integrity of immune function. This paper reviews the related mechanisms of PDTmediated anti-tumor immune effect and the main influencing factors, and discusses the future development direction of PDT in tumor treatment.

作为一种新型肿瘤治疗手段,光动力疗法(PDT)凭借副作用小、靶向性高、可重复治疗等优势已在临床广泛应用。与传统手术、化疗及放疗相比,PDT不仅能杀伤原发肿瘤,还能通过激活机体免疫效应抑制转移瘤。但PDT诱导免疫效应的强弱受多种因素影响,包括光敏剂定位与剂量、光照参数、肿瘤内氧浓度及免疫功能完整性等。本文综述了PDT介导抗肿瘤免疫效应的相关机制及主要影响因素,并探讨了PDT在肿瘤治疗领域的未来发展方向。

Task: Machine Translation (en-zh)

任务:机器翻译(英译中)

Input:

输入:

Translate the following text into Chinese:

翻译以下文本为中文:

Metabolic-associated fatty liver disease (MAFLD), formerly known as non-alcoholic fatty liver disease (NAFLD), is one of the most popular chronic liver diseases in the world. There are many explorations on the epidemiology and risk factors of the disease published in recent years. With the increasing burden of MAFLD on medical care and social economy, drug treatment of MAFLD has become an urgent research hotspot. Based on this, the current epidemiological status, risk factors, diagnostic criteria and treatment of MAFLD were summarized in this review in order to enlighten the clinical practice and scientific research of MAFLD.

代谢相关脂肪性肝病 (MAFLD) ,原名非酒精性脂肪性肝病 (NAFLD) ,是全球最常见的慢性肝病之一。近年来关于该病流行病学及危险因素的探索成果颇丰。随着MAFLD对医疗和社会经济负担的日益加重,其药物治疗已成为亟待攻克的研究热点。本文据此综述MAFLD的流行现状、危险因素、诊断标准及治疗进展,以期为MAFLD临床实践与科研工作提供启示。

Taiyi:

太乙:

代谢综合征相关性脂肪肝(MAFLD)是世界范围内最为流行的慢性肝病之一。近年来,该病的流行病学及危险因素等方面的研究颇多。随着 MAFLD 在医疗保健及社会经济方面所造成的负担日益加重,该病的药物治疗已成为研究热点。本文对该病的流行病学现状、危险因素、诊断标准及治疗等方面进行了综述,以期为临床实践及 MAFLD 的科学研究提供参考。

代谢综合征相关性脂肪肝(MAFLD)是世界范围内最为流行的慢性肝病之一。近年来,该病的流行病学及危险因素等方面的研究颇多。随着 MAFLD 在医疗保健及社会经济方面所造成的负担日益加重,该病的药物治疗已成为研究热点。本文对该病的流行病学现状、危险因素、诊断标准及治疗等方面进行了综述,以期为临床实践及 MAFLD 的科学研究提供参考。

Task: Title Generation (zh)

任务:标题生成(中文)

Input:

输入:

请给下面摘要起标题:

请为以下摘要拟定标题:

气管食管瘘是指气管或支气管与食管之间的病理性瘘道,包括气管-食管瘘和支气管-食管瘘,临床以气管-食管瘘较多见。气管食管瘘致病原因较多,可引起严重的并发症,是对患者生活质量影响显著、治疗困难和病死率较高的疾病。 气管食管瘘目前治疗方式较多,但多数疗效欠佳,对新兴治疗手段的需求迫切。胸腹部 X 线摄影检出鼻胃管滞留是气管食管瘘诊断的金标准,其主要治疗方法包括外科手术治疗、支架置入、局部生物胶水封闭、干细胞治疗等。本文综述近年气管食管瘘诊断与治疗的主要研究进展,旨在为该病的临床诊治提供参考。

Tracheoesophageal fistula refers to a pathological tract between the trachea or bronchi and the esophagus, including tracheoesophageal fistula and bronchoesophageal fistula, with tracheoesophageal fistula being more common clinically. The causes of tracheoesophageal fistula are diverse and can lead to severe complications, significantly impacting patients' quality of life, presenting challenges in treatment, and exhibiting a high mortality rate. Currently, there are multiple treatment approaches for tracheoesophageal fistula, but most yield suboptimal outcomes, creating an urgent need for novel therapeutic strategies. The gold standard for diagnosing tracheoesophageal fistula is the detection of a retained nasogastric tube via chest and abdominal X-ray. Primary treatment methods include surgical intervention, stent placement, local biological glue sealing, and stem cell therapy. This article reviews recent advances in the diagnosis and treatment of tracheoesophageal fistula, aiming to provide references for its clinical management.

Taiyi:

太乙:

气管食管瘘的诊断与治疗进展

气管食管瘘的诊断与治疗进展

Task: Title Generation (zh)

任务:标题生成(中文)

Input:

输入:

Output a title for the following abstract: The incidence of diabetes mellitus has been increasing, prompting the search for non-invasive diagnostic methods. Although current methods exist, these have certain limitations, such as low reliability and accuracy, difficulty in individual patient adjustment, and discomfort during use. This paper presents a novel approach for diagnosing diabetes using high-frequency ultrasound (HFU) and a convolutional neural network (CNN). This method is based on the observation that glucose in red blood cells (RBCs) forms glycated hemoglobin (HbA1c) and accumulates on its surface. The study incubated RBCs with different glucose concentrations, collected acoustic reflection signals from them using a custom-designed 90-MHz transducer, and analyzed the signals using a CNN. The CNN was applied to the frequency spectra and spec tro grams of the signal to identify correlations between changes in RBC properties owing to glucose concentration and signal features. The results confirmed the efficacy of the CNN-based approach with a classification accuracy of 0.98. This non-invasive diagnostic technology using HFU and CNN holds promise for in vivo diagnosis without the need for blood collection.

基于高频超声与卷积神经网络的糖尿病无创诊断新方法

Taiyi:

太一:

Application of high-frequency ultrasound and convolutional neural network in non-invasive diagnosis of diabetes

高频超声与卷积神经网络在糖尿病无创诊断中的应用

Task: Text Classification (zh)

任务:文本分类

Input:

输入:

Taiyi: 上述文本被分类为:医疗费用

Taiyi: 上述文本被分类为:医疗费用

Task: Text Classification (en)

任务: 文本分类 (en)

Input:

输入:

Task: Text Semantic Similarity (zh)

任务:文本语义相似度

Input:

输入:

Taiyi: 语义不相同

Taiyi: 语义不相同

Task: Text Semantic Similarity (en)

任务:文本语义相似度 (en)

Input:

输入:

Case study

案例研究

To further investigate the benefits of the in-domain instruction fine-tuning, we qualitatively analyzed model outputs on various BioNLP tasks and compared Taiyi to Qwen7B-Chat, a published LLM finetuned using general domain instruction data on the same base model Qwen7B. Table S5 shows some examples.

为了进一步探究领域内指令微调的益处,我们定性分析了模型在多种生物自然语言处理任务上的输出,并将Taiyi与基于相同基础模型Qwen7B、使用通用领域指令数据微调的已发布大语言模型Qwen7B-Chat进行了对比。表S5展示了一些示例。

Taks: NER (zh)

任务: 命名实体识别 (zh)

Input: 从以下文本中识别出指定的实体类型:

从以下文本中识别出指定的实体类型:

Task: NER (en)

任务: 命名实体识别 (en)

Input:

输入:

dentify Chemical, Disease entities in the following text:

识别以下文本中的化学物质和疾病实体:

Phenobarbital-induced dyskinesia in a neurologic ally-impaired child. A 2-year-old child with known neurologic impairment developed a dyskinesia soon after starting phenobarbital therapy for seizures. Known causes of movement disorders were eliminated after evaluation. On repeat challenge with phenobarbital, the dyskinesia recurred. Phenobarbital should be added to the list of anticonvulsant drugs that can cause movement disorders.

苯巴比妥诱发神经系统受损儿童的运动障碍。
一名2岁已知神经系统受损的儿童在开始使用苯巴比妥治疗癫痫后不久出现运动障碍。经评估排除了其他已知运动障碍病因。再次使用苯巴比妥激发试验时,运动障碍复发。苯巴比妥应被列入可能引起运动障碍的抗惊厥药物清单。

Taiyi:

太乙:

Chemical: Phenobarbital; phenobarbital Disease: dyskinesia; neurologic ally-impaired; neurologic impairment; seizures; movement disorders

化学: 苯巴比妥 (Phenobarbital)
疾病: 运动障碍; 神经功能缺损; 神经损伤; 癫痫发作; 运动障碍

Qwen-Chat:

Qwen-Chat:

Chemical: Phenobarbital, Anticonvulsant drugs Disease entities: Neurological impairment, Seizures, Movement disorders

化学物质:苯巴比妥、抗惊厥药物
疾病实体:神经功能损伤、癫痫发作、运动障碍

Task: RE (zh)

任务: RE (zh)

Input:

输入:

Taiyi:

太一:

具有药物治疗关系的实体对如下:[疟疾感染, 氯喹]; [疟疾感染, 羟氯喹]

具有药物治疗关系的实体对如下: [疟疾感染, 氯喹]; [疟疾感染, 羟氯喹]

Task: RE (en)

任务:RE (en)

Input:

输入:

From the results, Taiyi generates more detailed and accurate responses to biomedical questions compared to the general domain model Qwen-Chat. For the Chinese NER case, Qwen-Chat incorrectly extracts entities, while Taiyi precisely identifies disease and medical procedure entities. Similarly, Taiyi identifies more correct chemical entities than Qwen-Chat in the English NER case. In the final case of relation extraction, Taiyi properly extracts complex biomedical relations between proteins, diseases, and chemicals, which Qwen-Chat is unable to capture. It is difficult for the model that has not been fine-tuned with domain task instructions to understand the biomedical relation extraction instructions. These examples highlight the advantages of domain-specific instruction fine-tuning for enhancing Taiyi's performance on diverse BioNLP tasks compared to the model fine-tuned with general instructions. More examples on different BioNLP tasks can be found in the Supplementary Material.

从结果来看,与通用领域模型Qwen-Chat相比,Taiyi对生物医学问题生成的回答更为详细准确。在中文命名实体识别(NER)案例中,Qwen-Chat错误地提取了实体,而Taiyi精准识别了疾病和医疗程序实体。同样地,在英文NER案例中,Taiyi比Qwen-Chat识别出更多正确的化学实体。在最后的关系抽取案例中,Taiyi正确提取了蛋白质、疾病与化学物质之间复杂的生物医学关系,而Qwen-Chat无法捕捉这些关联。未经领域任务指令微调的模型难以理解生物医学关系抽取指令。这些案例凸显了相比通用指令微调模型,领域特定指令微调对提升Taiyi在多样化生物自然语言处理(BioNLP)任务性能的优势。更多不同BioNLP任务的案例可参阅补充材料。

REFERENCES

参考文献

阅读全文(20积分)