DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task
DoctorGLM: 微调你的中文医生并非难事
Honglin Xiong1* Sheng Wang1,2,3* Yitao Zhu1* Zihao Zhao1* Yuxiao Liu1 Linlin Huang4 Qian Wang1,3 Dinggang Shen1,3 1 Shanghai Tech University 2Shanghai Jiao Tong University 3United Imaging Intelligence 4Huashan Hospital, Fudan University xionghl, wangqian2, dgshen @shanghai tech.edu.cn wsheng $@$ sjtu.edu.cn
洪林雄1* 王晟1,2,3* 朱一涛1* 赵子豪1* 刘宇霄1 黄琳琳4 王倩1,3 沈定刚1,3
1上海科技大学
2上海交通大学
3联影智能
4复旦大学附属华山医院
xionghl, wangqian2, dgshen @shanghaitech.edu.cn
wsheng $@$ sjtu.edu.cn
Abstract
摘要
The recent progress of large language models (LLMs), including ChatGPT and GPT-4, in comprehending and responding to human instructions has been remarkable. Nevertheless, these models typically perform better in English and have not been explicitly trained for the medical domain, resulting in suboptimal precision in diagnoses, drug recommendations, and other medical advice. Additionally, training and deploying a dialogue model is still believed to be impossible for hospitals, hindering the promotion of LLMs. To tackle these challenges, we have collected databases of medical dialogues in Chinese with ChatGPT’s help and adopted several techniques to train an easy-deploy LLM. Remarkably, we were able to fine-tune the ChatGLM-6B on a single A100 80G in 13 hours, which means having a healthcare-purpose LLM can be very affordable. DoctorGLM is currently an early-stage engineering attempt and contain various mistakes. We are sharing it with the broader community to invite feedback and suggestions to improve its healthcare-focused capabilities: https://github.com/xiong hong lin/DoctorGLM.
大型语言模型(LLM)如ChatGPT和GPT-4在理解和响应人类指令方面取得了显著进展。然而,这些模型通常在英语中表现更佳,且未针对医疗领域进行专门训练,导致其在诊断、药物推荐等医疗建议中的精确度欠佳。此外,医院仍认为训练和部署对话模型不可行,阻碍了大语言模型的推广。为解决这些问题,我们借助ChatGPT收集了中文医疗对话数据库,并采用多种技术训练了一个易于部署的大语言模型。值得注意的是,我们仅用13小时在单张A100 80G显卡上完成了ChatGLM-6B的微调,这意味着拥有医疗专用大语言模型的成本可以非常低廉。DoctorGLM目前处于早期工程尝试阶段,可能存在各种错误。我们将其开源以邀请社区提供反馈和建议,从而提升其医疗专项能力:https://github.com/xiong hong lin/DoctorGLM。
1. Introduction
1. 引言
Large Language Models (LLMs) are highly advanced artificial intelligence systems that have undergone extensive training on vast amounts of text data. By utilizing deep learning techniques, these models are able to generate responses that resemble human-like speech, making them incredibly useful in a variety of tasks, such as language translation, question answering, and text generation. OpenAI’s GPT series, among other LLMs, has exhibited remarkable results, and has the potential to revolutionize various industries, including marketing, education, and customer service. LLMs are highly sought after for their ability to process and understand large amounts of data, which makes them wellsuited to solve complex problems.
大语言模型 (LLM) 是经过海量文本数据训练的高度先进人工智能系统。通过运用深度学习技术,这些模型能够生成类人对话响应,使其在语言翻译、问答和文本生成等多样化任务中展现出卓越实用性。OpenAI的GPT系列及其他大语言模型已取得显著成果,并有望对营销、教育和客户服务等行业产生革命性影响。该模型因具备处理和理解海量数据的能力而备受追捧,这种特性使其特别适合解决复杂问题。
Despite their remarkable performance in natural language processing, large language models like ChatGPT and GPT-4 have not been designed specifically for the medical domain. As a result, using LLMs for medical purposes may lead to suboptimal precision in diagnoses, drug recommendations, and other medical advice, potentially causing harm to patients. Another limitation of large language models like ChatGPT and GPT-4 is that they are typically trained in English, which restricts their ability to comprehend and respond to other languages. This can create a barrier for individuals who do not speak English as their first language and limit the accessibility of medical advice to a wider audience. In order to overcome these limitations and better integrate LLMs into the lives of most ordinary people, it’s crucial to develop medical-tailored LLMs that can be trained in multiple languages. This will not only improve the accuracy of medical advice provided by these models but also make it more accessible to a wider audience.
尽管ChatGPT和GPT-4等大语言模型在自然语言处理方面表现卓越,但它们并非专为医疗领域设计。因此,将大语言模型用于医疗目的可能导致诊断、药物推荐及其他医疗建议的精确度不足,甚至可能对患者造成伤害。
ChatGPT和GPT-4等大语言模型的另一局限在于,它们通常以英语训练,这限制了其理解和响应其他语言的能力。对于非英语母语者而言,这会造成障碍,并阻碍更广泛人群获取医疗建议。
为了克服这些局限,并更好地将大语言模型融入普通大众的生活,开发支持多语言训练的医疗定制化大语言模型至关重要。这不仅能提升模型提供的医疗建议的准确性,还能让更广泛的受众受益。
In order to improve the precision and accuracy of medical advice provided by language models in the medical domain, a database of medical dialogues in Chinese has been compiled. This database contains information from a large number of patients, including their symptoms, recommended medications, and the necessary medical tests. The database has been created to provide language models with extensive medical knowledge and to enable them to generate more accurate and personalized responses to medical queries. By incorporating this knowledge, the hope is to improve the ability of language models to diagnose illnesses and provide better recommendations to patients, ultimately improving the quality of healthcare.
为了提高大语言模型在医疗领域提供建议的精确性和准确性,我们编制了一个中文医疗对话数据库。该数据库包含大量患者的信息,包括症状、推荐药物及必要的医学检查。建立这一数据库旨在为大语言模型提供丰富的医学知识,使其能够针对医疗查询生成更准确和个性化的回答。通过整合这些知识,有望提升大语言模型的疾病诊断能力,为患者提供更好的建议,最终提高医疗服务质量。
To optimize our medical language model for both Chinese and English languages and, more importantly, explore a feasible pipeline of customized medical LLMs, we finetuned it based on ChatGLM, a pre-trained language model with 6 billion parameters. This model is unique in that it is bilingual, offering proficiency in both English and Chinese. Furthermore, the GLM model has a unique scaling property that allows for INT4 quantization enabling effective inference on a single RTX 3060 (12G). This scaling property is a major breakthrough in the field of healthcare language modeling, as it allows for more efficient and cost-effective computation on affordable GPUs, making it easier for hospitals to deploy their medical dialogue models based on their inhouse data. Also, we use low-rank adaptation that facilitates fine-tuning on an A100 80G GPU. This allows for faster inference times, making it easier for researchers and developers to utilize large-scale language models for a variety of applications.
为优化我们的医疗语言模型在中英双语的表现,并探索定制化医疗大语言模型的可行路径,我们基于ChatGLM(一个60亿参数量的预训练双语模型)进行了微调。该模型的独特之处在于其双语能力,可同时处理英文和中文。此外,GLM模型具备独特的量化特性,支持INT4量化,可在单张RTX 3060 (12G)显卡上高效推理。这一特性在医疗语言建模领域具有重大突破,使得基于院内数据部署医疗对话模型时,能通过平价GPU实现更高性价比的计算。我们还采用低秩自适应(LoRA)技术,便于在A100 80G GPU上进行微调,从而缩短推理时间,帮助研究者和开发者更便捷地将大语言模型应用于各类场景。
Figure 1. Overview of DoctorGLM fine-tuning and inference pipeline.
图 1: DoctorGLM微调与推理流程概览。
At present, the general public often assumes that large language models (LLMs) are monopolized by technology giants due to the substantial computational costs associated with ChatGPT. However, in this paper, we demonstrate that a specialized Chinese dialogue language model focused on the medical domain can be trained for less than 100 USD. We accomplish this by utilizing parameter-efficient tuning and quantization techniques, enabling the development of an LLM-based system that can be customized for specific tasks. The main contributions of this paper are summarized below:
目前,公众普遍认为大语言模型 (LLM) 因 ChatGPT 的高昂计算成本而被科技巨头垄断。然而,本文证明只需不到 100 美元即可训练专注于医疗领域的专业中文对话模型。我们通过参数高效微调 (parameter-efficient tuning) 和量化技术实现了这一目标,从而开发出可针对特定任务定制的大语言模型系统。主要贡献如下:
2. Large Language Models in Healthcare
2. 医疗领域中的大语言模型 (Large Language Models)
Figure 2. The implementation of large-scale translation. A tiny and high-quality dataset is built through ChatGPT. The collected dataset serves as a fine-tuning set for a pre-trained language model, enabling it to perform specialized machine translation.
图 2: 大规模翻译的实现流程。通过 ChatGPT 构建小型高质量数据集,该数据集作为预训练语言模型的微调集,使其能够执行专业机器翻译任务。
Recent advances in Transformer architecture [12] and computing power have enabled the training of large language models with billions of parameters, leading to a significant improvement in their ability to summarize, translate, predict and generate human-like text [2, 9, 10]. In pre-ChatGPT era, several healthcare language models have been developed based on general-purpose model weight and training schemes. BioBERT [7] and PubMedBERT [5] are examples of BERT [3] models trained on PubMed for biomedical data, while Clinic alBERT [1] was further trained on the MIMIC dataset and outperformed its predecessor.
Transformer架构[12]和计算能力的最新进展,使得训练具有数十亿参数的大语言模型成为可能,从而显著提升了它们在总结、翻译、预测和生成类人文本方面的能力[2,9,10]。在ChatGPT之前的时代,已有多个医疗领域语言模型基于通用模型权重和训练方案被开发出来。BioBERT[7]和PubMedBERT[5]是基于PubMed生物医学数据训练的BERT[3]模型实例,而ClinicalBERT[1]则进一步在MIMIC数据集上训练并超越了其前身。
Table 1. Statistics of the used datasets. Chat means multi-round QA. Syn. means whether the data is generated by other language models. CMD. means Chinese medical dialogue.
Dataset | Department | Language | Q&A | Chat | Number | Syn. | Size |
CMD. | Surgical Obstetrics and Gynecology Pediatrics InternalMedicine | CN | 116K 229K 117K 307K | 52MB 78MB 47MB 102MB | |||
MedDialog | Andriatria Multiple | CN&EN | 113K 3.4M | × | 44MB 1.5GB | ||
ChatDoctor | Multiple | EN | √ | 5.4K | 2.9MB | ||
HearlthcareMagic | Multiple | EN | √ | 200K | 216MB |
表 1: 所用数据集的统计信息。Chat表示多轮问答,Syn.表示数据是否由其他语言模型生成,CMD.表示中文医疗对话。
Dataset | Department | Language | Q&A | Chat | Number | Syn. | Size |
---|---|---|---|---|---|---|---|
CMD. | 外科 妇产科 儿科 内科 | CN | 116K 229K 117K 307K | 52MB 78MB 47MB 102MB | |||
MedDialog | 男科 综合 | CN&EN | 113K 3.4M | × | 44MB 1.5GB | ||
ChatDoctor | 综合 | EN | √ | 5.4K | 2.9MB | ||
HearlthcareMagic | 综合 | EN | √ | 200K | 216MB |
After ChatGPT showed the potential of 100B-scale model, researches expand healthcare language model to a much larger scale and give very promising results. Med-PaLM [10] was developed in late 2022 using curated biomedical corpora and human feedback, and showed promising results, including a $67.6%$ accuracy on the MedQA exam. ChatGPT, which was not given supplementary medical training, passed all three parts of the USMLE and achieved over $50%$ accuracy across all exams and surpassed $60%$ accuracy in the majority of them [6]. ChatCAD [13] combined medical image analysis models with ChatGPT and offered an interactive computer-aided diagnosis. ChatDoctor [14] is a medical chat model fine-tuned on LLaMA model using clinical QA that is synthesised by ChatGPT.
在ChatGPT展现出千亿级模型的潜力后,研究人员将医疗领域大语言模型扩展到更大规模并取得了显著成果。Med-PaLM [10]于2022年底基于精选生物医学语料库和人类反馈开发,其MedQA考试准确率达到67.6%。未接受额外医学训练的ChatGPT通过了美国医师执照考试(USMLE)全部三个部分,所有考试平均准确率超50%,多数科目突破60% [6]。ChatCAD [13]将医学影像分析模型与ChatGPT结合,实现了交互式计算机辅助诊断。ChatDoctor [14]则是基于LLaMA模型、通过ChatGPT合成的临床问答数据进行微调的医疗对话模型。
3. Approach
3. 方法
3.1. Dataset with ChatGPT’s Help
3.1. 借助ChatGPT构建的数据集
It is worth noting that there exists a lot of high-quality datasets released in English. To utilize the available resources, we have translated ChatDoctor [14] dataset to enhance the Chinese language proficiency of the DoctorGLM.
值得注意的是,现有大量高质量数据集以英文发布。为充分利用现有资源,我们翻译了ChatDoctor [14]数据集以提升DoctorGLM的中文处理能力。
The medical-targeted LLM requires professional training data, which asks high demands for English-Chinese translation. ChatGPT is capable of professional clinical text translation, but this would incur an overhead of tens of thousands of dollars for a large-scale dataset, which is unacceptable to most researchers. Here, we take a simple and lowcost approach to large-scale translation by leveraging the capabilities of ChatGPT.
医疗领域专用的大语言模型需要专业训练数据,这对中英翻译提出了极高要求。ChatGPT能够完成专业的临床文本翻译,但大规模数据集的翻译会产生数万美元成本,这对大多数研究者而言难以承受。我们采用了一种简单低成本的方案,通过利用ChatGPT的能力实现大规模翻译。
Translation of the dataset is generally in two steps as shown in Figure 2. $X={x_ {1},x_ {2},...,x_ {\mathrm{N}}}$ is initially selected from the ChatDoctor dataset, where $x_ {n}$ is the raw English text, and corresponding high-quality translation $Y=$ ${y_ {1},y_ {2},...,y_ {\mathrm{N}}}$ is obtained through ChatGPT API. Then, a BART-based pre-trained model [11] is fine-tuned solely on paired $X$ and $Y$ without any additional datasets. In this way, the language model can distill the expert-level knowledge from ChatGPT and the refined small model can act as an acceptable alternative to LLMs 1. We have translated ChatDoctor 2 to use in our training.
数据集的翻译通常分为两个步骤,如图 2 所示。$X={x_ {1},x_ {2},...,x_ {\mathrm{N}}}$ 最初从 ChatDoctor 数据集中选取,其中 $x_ {n}$ 是原始英文文本,通过 ChatGPT API 获得对应的高质量翻译 $Y=$ ${y_ {1},y_ {2},...,y_ {\mathrm{N}}}$。然后,仅使用配对的 $X$ 和 $Y$ 微调基于 BART 的预训练模型 [11],不引入任何额外数据集。通过这种方式,语言模型可以从 ChatGPT 中提炼专家级知识,而精炼后的小模型可以作为大语言模型的可接受替代方案。我们已翻译 ChatDoctor 2 用于训练。
To develop conversational models of high quality on a limited academic budget, ChatDoctor [14] utilized a strategy where each message from the disease database was en- tered as an individual prompt into the GPT3.5-turbo model to generate instruction data automatically. The prompts provided to the ChatGPT API contained the gold standard of diseases, symptoms, and drugs, resulting in a dataset that preserves the conversational fluency of ChatGPT while also achieving higher diagnostic accuracy than ChatGPT alone.
在有限的学术预算下开发高质量对话模型时,ChatDoctor [14] 采用了一种策略:将疾病数据库中的每条信息作为独立提示输入 GPT3.5-turbo 模型,自动生成指令数据。提供给 ChatGPT API 的提示包含疾病、症状和药物的黄金标准,最终生成的数据集既保留了 ChatGPT 的对话流畅性,又实现了比单独使用 ChatGPT 更高的诊断准确率。
3.2. Prompt Designer
3.2. 提示设计器
Large language models have achieved remarkable performance in conversational tasks. However, their outputs may be unreliable and deceptive. This issue also exists with ChatGLM, which is utilized in DoctorGLM. To address this problem and enhance the reliability of DoctorGLM’s outputs, we use a prompt designer module that pre-processes the user’s input.
大语言模型在对话任务中表现出色,但其输出可能存在不可靠和欺骗性问题。DoctorGLM采用的ChatGLM同样存在这一缺陷。为解决该问题并提升DoctorGLM输出的可靠性,我们通过提示设计器模块对用户输入进行预处理。
The prompt designer module extracts relevant keywords such as the name of the disease or symptoms from the user’s input. The module then utilizes the name of the most likely disease as a label and generates a brief description based on a professional disease knowledge library. This library comprises a comprehensive collection of detailed documents about various diseases. In particular, we have 3231 disease documents in detail, all of which are sourced from the Merck Manual of Diagnosis and Therapy.
提示词设计模块从用户输入中提取相关关键词,如疾病名称或症状。该模块随后将最可能的疾病名称作为标签,并基于专业疾病知识库生成简要描述。该知识库包含大量各类疾病的详细文档,特别是我们拥有3231份详细疾病文档,均源自《默克诊疗手册》。
The prompt designer’s output includes information about the disease’s symptoms, diagnosis, treatment options, and preventive measures. By providing a professionally generated prompt, the prompt designer expands the expertise and reliability of DoctorGLM for a particular disease. Additionally, it serves as a reliable source of information for users.
提示词设计师的输出内容包括疾病症状、诊断方法、治疗方案及预防措施等信息。通过提供专业生成的提示词,提示词设计师扩展了DoctorGLM针对特定疾病的专业性和可靠性。同时,这也为用户提供了可靠的信息来源。
The generated prompt is integrated into the large language model, along with the original input, to improve the accuracy and reliability of DoctorGLM’s responses. By using this approach, we can enhance the overall performance of DoctorGLM and provide reliable and trustworthy information to users. This system can also be used with other large language models that have decent in-context learning ability, e.g., ChatGPT, LLaMA and ChatGLM.
生成的提示词与大语言模型 (Large Language Model) 及原始输入相结合,以提高 DoctorGLM 响应的准确性和可靠性。通过这种方法,我们可以提升 DoctorGLM 的整体性能,为用户提供可靠且值得信赖的信息。该系统也可与其他具备良好上下文学习能力的大语言模型配合使用,例如 ChatGPT、LLaMA 和 ChatGLM。
3.3. Training of DoctorGLM
3.3. DoctorGLM的训练
We utilized the ChatGLM-6B model [4, 15] in developing our DoctorGLM. This open bilingual language model is based on the General Language Model (GLM) framework and has 6.2 billion parameters. ChatGLM-6B is optimized for Chinese QA and dialogue, and its technology is similar to ChatGPT. The model was trained on approximately 1 trillion tokens of Chinese and English corpus, with additional supervised fine-tuning, feedback bootstrap, and rein for cement learning using human feedback. Despite having only 6.2 billion parameters, ChatGLM-6B generates an- swers that are aligned with human preference. Furthermore, we use low-rank adaptation (LoRA) to finetune ChatGLM with only 7 million trainable parameters.
我们在开发DoctorGLM时采用了ChatGLM-6B模型[4,15]。这个开源的通用语言模型(GLM)框架支持中英双语,拥有62亿参数。ChatGLM-6B针对中文问答和对话场景进行了优化,其技术原理与ChatGPT类似。该模型基于约1万亿token的中英文语料进行训练,并通过监督微调、反馈引导以及人类反馈强化学习等技术优化。尽管参数量仅为62亿,ChatGLM-6B仍能生成符合人类偏好的回答。此外,我们采用低秩自适应(LoRA)技术进行微调,仅需训练700万参数。
The fine-tuning process using all Chinese medical dialogue dataset was conducted using an A100 GPU for a duration of 8 hours. The hyper-parameters employed in the training process were as follows: the batch size of 4, a learning rate of 2e-5 with lion optimizer, a total of 1 epochs, a maximum sequence length of 512 tokens, a maximum target length of 100 tokens. with no warmup and weight decay. The low-rank adaption is applied to $q,v$ and rank is set to 8 with alpha set to 16.
使用全量中文医疗对话数据集进行微调的过程在A100 GPU上运行了8小时。训练过程中采用的超参数如下:批次大小为4,学习率为2e-5(使用lion优化器),总训练轮数为1,最大序列长度为512 tokens,最大目标长度为100 tokens,未使用预热和权重衰减。低秩自适应应用于$q,v$,秩设为8,alpha值设为16。
4. Main Results
4. 主要结果
4.1. Compare to General Purpose Models
4.1. 与通用模型对比
Here we demonstrate some some QA with DoctorGLM in Table 2, 3 and 4. We compared to our base model ChatGLM-6B and ChatGPT (gpt-3.5-turbo). DoctorGLM here have no additional prompt and filter. This model is trained based on CMD., with data from five departments. We also ask a doctor to mark the error and
我们在表2、表3和表4中展示了DoctorGLM的一些问答示例。我们将其与基础模型ChatGLM-6B和ChatGPT(gpt-3.5-turbo)进行了对比。此处的DoctorGLM未使用额外提示词和过滤机制,该模型基于CMD训练框架,并整合了五个科室的医疗数据。我们还邀请医生对错误进行了标注。
In general, DoctorGLM tend to be too arbitrary while general models like ChatGPT are more conservative. For example, in Q2 (see Table 3), ChatGLM suggested to use Ribavirin Granules,
一般来说,DoctorGLM往往过于武断,而像ChatGPT这样的通用模型则更为保守。例如,在Q2中(见表3),ChatGLM建议使用利巴韦林颗粒。
4.2. Generation Diversity
4.2. 生成多样性
Top-p and temperature are techniques used in text generation models to control the diversity and quality of the generated output. Top-p (also known as nucleus sampling or softmax sampling) is a method used to select the most likely words based on their probabilities in the model’s output distribution. It works by selecting the smallest set of words whose cumulative probability exceeds a certain threshold. This allows for more diverse output than simply selecting the most probable word at each step. Temperature, on the other hand, is a scalar parameter that controls the randomness of the generated output. A higher temperature value results in more diverse and creative output, while a lower value leads to more conservative and predictable output. The temperature parameter controls the softmax functionused to calculate the probabilities of the next word in the generated sequence. The results are presented on Table. 6
Top-p 和 temperature 是文本生成模型中用于控制生成输出多样性和质量的两种技术。Top-p (又称核采样或 softmax 采样) 是一种基于模型输出分布概率选择最可能词汇的方法。其工作原理是选择累积概率超过特定阈值的最小词汇集合,相比每一步仅选择最高概率词汇,这种方法能产生更多样化的输出。而 temperature 是一个标量参数,用于控制生成输出的随机性:较高 temperature 值会产生更具多样性和创造性的输出,较低值则生成更保守和可预测的结果。该参数通过控制 softmax 函数来调节生成序列中下一个词汇的概率分布。具体实验结果如 表6 所示。
4.3. How many iteration is good enough?
4.3. 需要多少次迭代才足够好?
For LoRA, we don’t know yet. See Table. 7 for more detail. During our training, we found model collapse after 50K iterations and can not produce any meaningful output.
对于LoRA,我们目前尚不清楚。更多细节请参见表7。在训练过程中,我们发现模型在5万次迭代后出现崩溃,无法生成任何有意义的输出。
4.4. P-tuning vs. LoRA
4.4. P-tuning 对比 LoRA
We also tested the P-tuning [8], which is a method for fine-tuning large language models that involves optimizing only continuous prompts, significantly reducing storage and memory usage per task. It performs comparably to finetuning with only $0.1%{-3%}$ of the fine-tuning parameters . We compare LoRA and P-Tuning V2 for our model and the results are shown in Table. 5.
我们还测试了 P-tuning [8],这是一种微调大语言模型的方法,仅优化连续提示 (continuous prompts),显著降低了每项任务的存储和内存消耗。其性能与全参数微调相当,而微调参数量仅需全量的 $0.1%{-3%}$。我们将 LoRA 和 P-Tuning V2 在模型上进行了对比,结果如 表 5 所示。
5. Cost
5. 成本
Training a LLM from scratch with web-scale data can be a costly endeavor, but fine-tuning can be a more economical approach. DoctorGLM’s training process can handle approximately 80,000 single question and answer pairs per hour per GPU. Assuming that three epochs are necessary, and the cloud computing server of an A100 GPU is approximately 5 USD per hour, the total training time required is 3.75 hours, which translates to a cost of approximately 18.75 USD for fine-tuning a DoctorGLM on 100,000 QA pairs.
使用网络规模数据从头训练大语言模型可能成本高昂,但微调是更经济的选择。DoctorGLM的训练流程每小时每GPU可处理约80,000个单轮问答对。假设需要3个训练周期,且A100 GPU云计算服务器每小时费用约为5美元,则训练100,000个问答对的总耗时需3.75小时,微调DoctorGLM的成本约为18.75美元。
On the other hand, the inference process for DoctorGLM requires only about 13 GB of GPU memory and can be performed on a consumer-level GPU such as an RTX 3090. This results in a total cost (inference PC) of approximately 1500 USD.
另一方面,DoctorGLM的推理过程仅需约13GB的GPU显存,可在RTX 3090等消费级GPU上运行,这使得总成本(推理主机)约为1500美元。
It’s worth noting that these costs may vary depending on the current price of GPUs and your location, but they remain relatively affordable for many research institution and hospitals.
值得注意的是,这些成本可能会因GPU当前价格和您所在地区而有所波动,但对于许多研究机构和医院来说仍相对可承受。
Table 2. Doctor’s comments are marked in blue. Factual errors are marked in red. Improper diagnosis are marked in green.
Q1近一个多月来,本人两肋、前胸包括胸口及两旁、后背两边、脊椎骨上段、包括扇子骨等处,就是上半身,发 生不适隐痛,上半身很纠结,好像里面有手揪住一样,不舒展爽朗,十分难受,一会儿前胸和两肋不适难受,一 会儿后背两边不适难受,同时发生隐痛。扭动身体,做做伸展活动,或者用拳头抵住揉搓不适隐痛处,身体就会 舒展一些好受一些。总的来说就是上半身有肋骨的地方和上半段脊椎骨范围内发生不适和隐痛,一般同时两三个 隐痛点,就是肋骨最下段的骨梢也发生不适隐痛。没有肋骨地方不痛,如肚子处就不隐痛。本人于2011年11月做 了胆囊摘除和胆总管结石手术。不是腹腔镜,是开了一条约18公分的大口子和两个导流小洞。此种不适难受隐痛 情况2012年春天发生过一次,时间不长自动好了。但这次严重时间又长,不适和隐痛感觉有时呈块状,有时呈条 |
A1(DoctorGLM)病情分析:您好,这种情况考虑是肋间神经炎引起的疼痛(还需与心绞痛、心脏神经官能症、胸 膜炎、肋软骨炎等鉴别,可以根据鉴别诊断补充问诊,明确诊断) 指导意见:建议口服消炎药,止痛药,维生素B族,按疗程治疗,疼痛缓解后,练习抬胸运动,可缓解疼痛,除了 正规治疗胆石症外,患者需要多咨询专家建议,和医生保持沟通,患者还需要重视护理方面,例如合理饮食,保 持心情愉快与此同时患者还要注意选择一家专业医院诊治,这样才能得到良好的治疗效果 A1(ChatGPT)很抱歉听到您最近出现了这些不适和隐痛的症状。根据您的描述,您的症状似乎涉及到上半身多 个部位的肋骨和脊椎骨等骨骼结构,以及随之产生的不适和隐痛感觉。由于您之前做过胆囊摘除和胆总管结石手 术,这些症状可能与手术本身有关,(距离手术时间很久了,后来自动缓解,中间未再发作,不太考虑与手术直接 相关。)但也有可能是其他原因引起的。因此,建议您及时就医,进行详细的身体检查和相关检查,以确定症状的 具体原因和处理方法。您可以咨询家庭医生或内科医生,他们可以进行初步的检查和评估,并根据需要进行相应 的检查,如X光、CT、MIRI或血液检查等。在等待就医的过程中,您可以注意休息、保持良好的姿势、避免过度 运动或重物提起,同时可以使用热敷或冷敷等方法缓解疼痛和不适感。最后,建议您保持乐观心态,积极配合医 |
A1(ChatGLM)所述症状可能是由多种原因引起的,包括身体、心理和情绪等方面。根据所述症状,建议进行以 下检查和测试: 1.医生体检:进行全面的身体检查,包括心肺听诊、血压测试、视力、听力、皮肤测试等,以确定任何潜在的健 康问题。 2.心电图(EKG):测试心脏健康,评估是否有潜在的心脏问题。 3.核磁共振(MIRI):测试骨骼和肌肉系统的健康,以及评估是否有潜在的神经问题。 4.血液检查:测试贫血、感染、肿瘤等问题。 5.心理健康测试:测试情绪和心理健康,例如压力测试、焦虑测试等。 如果症状持续存在或变得更加严重,请尽快咨询医生以获取专业建议和治疗。(过于笼统,没有针对性) |
表 2. 医生评论用蓝色标注。事实性错误用红色标注。不当诊断用绿色标注。
| Q1近一个多月来,本人两肋、前胸包括胸口及两旁、后背两边、脊椎骨上段、包括扇子骨等处,就是上半身,发生不适隐痛,上半身很纠结,好像里面有手揪住一样,不舒展爽朗,十分难受,一会儿前胸和两肋不适难受,一会儿后背两边不适难受,同时发生隐痛。扭动身体,做做伸展活动,或者用拳头抵住揉搓不适隐痛处,身体就会舒展一些好受一些。总的来说就是上半身有肋骨的地方和上半段脊椎骨范围内发生不适和隐痛,一般同时两三个隐痛点,就是肋骨最下段的骨梢也发生不适隐痛。没有肋骨地方不痛,如肚子处就不隐痛。本人于2011年11月做了胆囊摘除和胆总管结石手术。不是腹腔镜,是开了一条约18公分的大口子和两个导流小洞。此种不适难受隐痛情况2012年春天发生过一次,时间不长自动好了。但这次严重时间又长,不适和隐痛感觉有时呈块状,有时呈条 |
| A1(DoctorGLM)病情分析:您好,这种情况考虑是肋间神经炎引起的疼痛 (还需与心绞痛、心脏神经官能症、胸膜炎、肋软骨炎等鉴别,可以根据鉴别诊断补充问诊,明确诊断) 指导意见:建议口服消炎药,止痛药,维生素B族,按疗程治疗,疼痛缓解后,练习抬胸运动,可缓解疼痛,除了正规治疗胆石症外,患者需要多咨询专家建议,和医生保持沟通,患者还需要重视护理方面,例如合理饮食,保持心情愉快与此同时患者还要注意选择一家专业医院诊治,这样才能得到良好的治疗效果 A1(ChatGPT)很抱歉听到您最近出现了这些不适和隐痛的症状。根据您的描述,您的症状似乎涉及到上半身多个部位的肋骨和脊椎骨等骨骼结构,以及随之产生的不适和隐痛感觉。由于您之前做过胆囊摘除和胆总管结石手术,这些症状可能与手术本身有关, (距离手术时间很久了,后来自动缓解,中间未再发作,不太考虑与手术直接相关。) 但也有可能是其他原因引起的。因此,建议您及时就医,进行详细的身体检查和相关检查,以确定症状的具体原因和处理方法。您可以咨询家庭医生或内科医生,他们可以进行初步的检查和评估,并根据需要进行相应的检查,如X光、CT、MIRI或血液检查等。在等待就医的过程中,您可以注意休息、保持良好的姿势、避免过度运动或重物提起,同时可以使用热敷或冷敷等方法缓解疼痛和不适感。最后,建议您保持乐观心态,积极配合医 |
| A1(ChatGLM)所述症状可能是由多种原因引起的,包括身体、心理和情绪等方面。根据所述症状,建议进行以下检查和测试: 1.医生体检:进行全面的身体检查,包括心肺听诊、血压测试、视力、听力、皮肤测试等,以确定任何潜在的健康问题。 2.心电图(EKG):测试心脏健康,评估是否有潜在的心脏问题。 3.核磁共振(MIRI):测试骨骼和肌肉系统的健康,以及评估是否有潜在的神经问题。 4.血液检查:测试贫血、感染、肿瘤等问题。 5.心理健康测试:测试情绪和心理健康,例如压力测试、焦虑测试等。 如果症状持续存在或变得更加严重,请尽快咨询医生以获取专业建议和治疗。(过于笼统,没有针对性) |
6. Technical Limitations
6. 技术限制
This work is in a very early stage and contains numerous mistakes, making it unsuitable for any commercial or clinical use. One of the reasons we have published our work is to invite the broader community to help improve this healthcare-focused language model, with the aim of making it more accessible, affordable, and convenient for a larger audience. Below are some critical technical issues we encountered during this project:
本项目处于非常早期阶段,存在大量错误,不适用于任何商业或临床用途。我们公开这项工作的原因之一,是邀请更广泛的社区共同改进这个医疗领域专用的大语言模型,旨在让更多人能够更便捷、更经济地使用它。以下是我们在此项目中遇到的一些关键技术问题:
- DoctorGLM experiences a loss in capability during logistic training, and it occasionally repeats itself (see Table 2 A4 (DoctorGLM)). We suspect that fine-tuning typically incurs a higher alignment cost compared to reinforcement learning with human feedback (RLHF).
- DoctorGLM在逻辑训练过程中出现能力损失,偶尔会重复输出 (见表2 A4 (DoctorGLM)) 。我们推测微调通常比基于人类反馈的强化学习 (RLHF) 产生更高的对齐成本。
- Generating a response takes approximately 15 to 50 seconds, depending on token length, which is significantly slower than interacting with ChatGPT via the web API. This delay is partly due to the chatbot’s typing indicator.
- 生成响应大约需要15到50秒,具体时间取决于token长度,这比通过网页API与ChatGPT交互要慢得多。这种延迟部分归因于聊天机器人的打字指示器。
References
参考文献
[1] Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323, 2019. 3
[1] Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di