[论文翻译]BianQue: 通过ChatGPT优化的多轮健康对话平衡健康大语言模型的提问与建议能力


原文地址:https://arxiv.org/pdf/2310.15896


BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT

BianQue: 通过ChatGPT优化的多轮健康对话平衡健康大语言模型的提问与建议能力

Abstract

摘要

Large language models (LLMs) have performed well in providing general and extensive health suggestions in single-turn conversations, exemplified by systems such as ChatGPT, ChatGLM, ChatDoctor, DoctorGLM, and etc. However, the limited information provided by users during single turn results in inadequate personalization and targeting of the generated suggestions, which requires users to independently select the useful part. It is mainly caused by the missing ability to engage in multi-turn questioning. In real-world medical consultations, doctors usually employ a series of iterative inquiries to comprehend the patient’s condition thoroughly, enabling them to provide effective and personalized suggestions subsequently, which can be defined as chain of questioning (CoQ) for LLMs. To improve the CoQ of LLMs, we propose BianQue, a ChatGLM-based LLM finetuned with the selfconstructed health conversation dataset BianQueCorpus that is consist of multiple turns of questioning and health suggestions polished by ChatGPT. Experimental results demonstrate that the proposed BianQue can simultaneously balance the capabilities of both questioning and health suggestions, which will help promote the research and application of LLMs in the field of proactive health.1

大语言模型(LLM)在单轮对话中提供通用广泛的健康建议方面表现优异,例如ChatGPT、ChatGLM、ChatDoctor、DoctorGLM等系统。然而单轮对话中用户提供的信息有限,导致生成建议的个性化和针对性不足,需要用户自行筛选有效部分。这主要源于模型缺乏多轮追问能力。在实际医疗问诊中,医生通常通过一系列迭代询问来全面了解患者状况,从而后续提供有效且个性化的建议,这种模式可定义为大语言模型的追问链(CoQ)。为提升大语言模型的追问链能力,我们提出基于ChatGLM微调的BianQue模型,其训练数据为自建的健康对话数据集BianQueCorpus,该数据集包含经ChatGPT优化的多轮追问和健康建议。实验结果表明,BianQue能同时平衡追问和健康建议生成能力,这将推动大语言模型在主动健康领域的研究与应用。[20]

1 Introduction

1 引言

Recently, Large language models (LLMs), e.g. ChatGPT (OpenAI, 2022), LLaMA (Touvron et al., 2023), ChatGLM (Zeng et al., 2023), have been extensively applied in various fields. Through high-quality instruction fine-tuning and reinforcement learning based on human feedback (RLHF) (Ouyang et al., 2022), LLMs already have possessed stunning language comprehension, generation, and knowledge reasoning abilities. Overall, users are amazed by the excellent suggestion ability of LLMs.

近年来,大语言模型(LLM)如ChatGPT (OpenAI, 2022)、LLaMA (Touvron et al., 2023)、ChatGLM (Zeng et al., 2023)已被广泛应用于各个领域。通过基于人类反馈的高质量指令微调与强化学习(RLHF)(Ouyang et al., 2022),大语言模型已具备惊人的语言理解、生成和知识推理能力。总体而言,用户对其出色的建议能力感到惊叹。

However, LLMs are deficient in “questioning” which is an important way to pro actively understand users needs in medical, psychological, educational and other application scenarios. When we engage in healthcare conversations with these LLMs (ChatGPT2, ChatGLM3, SparkDesk4), they do not yet possess the ability to conduct multiple rounds of questioning, as presented in Appendix B. The above LLMs generally provide reasonable and universal suggestions based on the single-turn instruction provided by users. However, in the real world, doctors often need to conduct multiple turns of questioning with patients in order to provide targeted advice, as shown in Figure 1. During the user consultation, the doctor raised different questions in the first 9 turns of conversations to understand the specific situation of the baby. The above multiturn questioning process can be defined as Chain of Questioning (CoQ). We found that the current LLMs lack CoQ capabilities because LLMs lack training data for multiple rounds of questioning during the instruction fine-tuning stage and RLHF stage. When researchers construct instructions and answers, on the one hand, they ignore multiple rounds of conversation history, and on the other hand, answers are usually suggestions rather than questions.

然而,大语言模型在"提问"能力上存在不足,而这是医疗、心理、教育等应用场景中主动理解用户需求的重要方式。当我们与ChatGPT、ChatGLM、SparkDesk等大语言模型进行医疗健康对话时,它们尚不具备如附录B所示的多轮提问能力。上述模型通常仅根据用户单轮指令提供合理但通用的建议。但现实中医生往往需要如图1所示,通过多轮问诊才能给出针对性建议。在用户咨询过程中,医生前9轮对话通过不同提问来了解婴儿具体情况。这种多轮提问过程可定义为提问链(CoQ)。我们发现当前大语言模型缺乏CoQ能力,因为其在指令微调和RLHF阶段缺乏多轮提问的训练数据。研究者在构建指令和答案时,一方面忽略了多轮对话历史,另一方面答案通常为建议而非问题。

At present, research on LLMs in the health field

目前,健康领域大语言模型的研究

mainly focuses on evaluating the performance of existing models, constructing suitable datasets, and fine-tuning instructions. Singhal et al. (2022) proposed a medical Q&A benchmark MultiMedQA for evaluating the clinical knowledge QA abilities of LLMs. Li et al. (2023) constructed a real doctorpatient dialogue dataset HealthCare Magic- $100\mathrm{k}$ , and used it to fine-tune the ChatDoctor based on LLaMA. Similar health LLMs have been released one after another, e.g. BenTsao (本草) (Wang et al., 2023b), ChatGLM-6B-Med (Wang et al., 2023b), DoctorGLM (Xiong et al., 2023), MedAlpaca (Han et al., 2023), Clinical GP T (Wang et al., 2023a) and etc. These models are basically based on the assumption that “users can clearly describe their problems or situations”. Therefore, during the model construction phase, the questioning ability of the model was not considered. Although these models have achieved well performance in the field of medical QA, they do not have the ability to ask users questions.

主要聚焦于评估现有模型性能、构建合适的数据集以及微调指令。Singhal等人(2022)提出了医疗问答基准MultiMedQA,用于评估大语言模型的临床知识问答能力。Li等人(2023)构建了真实医患对话数据集HealthCare Magic-$100\mathrm{k}$,并基于LLaMA微调出ChatDoctor。类似医疗大语言模型相继发布,例如本草(BenTsao)(Wang等人,2023b)、ChatGLM-6B-Med(Wang等人,2023b)、DoctorGLM(Xiong等人,2023)、MedAlpaca(Han等人,2023)、ClinicalGPT(Wang等人,2023a)等。这些模型基本基于"用户能清晰描述自身问题或状况"的假设,因此在模型构建阶段未考虑模型的提问能力。尽管这些模型在医疗问答领域表现优异,但都不具备向用户提问的能力。

To enhance the questioning ability of LLMs, we constructed a multi-turn health conversation dataset named Bian Que Corpus, in which the targets consist of balanced proportional questions $(46.2%)$ and suggestions $(53.8%)$ , as shown in Figure 2. Meanwhile, we present BianQue, a health LLM that is specifically designed for balancing the questioning and suggestion ability. The results on multiturn health conversation dataset demonstrate that BianQue outperforms existing models and ChatGPT, especially in the ability to conduct chain of questioning (CoQ).

为提升大语言模型(LLM)的提问能力,我们构建了名为Bian Que Corpus的多轮健康对话数据集,其中目标问题包含均衡比例的问题$(46.2%)$和建议$(53.8%)$,如图2所示。同时,我们提出了BianQue——一个专为平衡提问与建议能力而设计的健康领域大语言模型。多轮健康对话数据集上的实验结果表明,BianQue在提问链(CoQ)能力方面显著优于现有模型及ChatGPT。


Figure 2: Proportion of questions and suggestions in answers of Bian Que Corpus.

图 2: 扁鹊医典回答中问题与建议的比例分布。

2 Methodology

2 方法论

2.1 Bian Que Corpus: Balancing Questioning and Suggestion

2.1 扁鹊语料库:平衡提问与建议

In the field of Chinese health conversational AI, there are already some multi-turn conversation datasets, e.g. MedDialog-CN (He et al., 2020), IMCS-V2 (Chen et al., 2022), CHIP- MDCFNPC (Zhang et al., 2022), MedDG (Zhang et al., 2022). However, these conversations are often crawled from internet consultation platforms, such as 好大夫5. These datasets are often mixed with a large amount of noise, such as missing content, missing images, reward information, privacy content, incomplete JSON content, website link, website tips, voice recording, text automatically replied by the system, etc. We first collected realworld multi-turn health conversations through data outsourcing services. Then, we performed a twostage data optimization process: (i) We constructed a data automatic cleaning process based on regularized expression to improve the quality of existing conversation datasets. (ii) We designed a polishing prompt (see Figure 4) and use ChatGPT to polish the doctors’ suggestion of multi-turn conversations, because doctors often respond very briefly through internet platforms, lacking detailed analysis and suggestions. The whole construction process of Bian Que Corpus is presented in Figure 3. We ultimately obtained a multi-turn health conversation dataset consisting of 2,437,190 samples, in which the questions accounted for $46.2%$ among the doctors’ answers.

在中文健康对话AI领域,已有一些多轮对话数据集,例如MedDialog-CN (He et al., 2020)、IMCS-V2 (Chen et al., 2022)、CHIP-MDCFNPC (Zhang et al., 2022)、MedDG (Zhang et al., 2022)。然而这些对话通常爬取自网络问诊平台(如好大夫),数据中常混杂大量噪声,包括内容缺失、图片缺失、打赏信息、隐私内容、不完整的JSON内容、网站链接、网站提示、录音文件、系统自动回复文本等。我们首先通过数据外包服务收集了真实世界的多轮健康对话,随后实施了两阶段数据优化流程:(i) 基于正则表达式构建自动化清洗流程以提升现有对话数据集质量;(ii) 设计优化提示词(见图4),利用ChatGPT对多轮对话中医生的建议进行润色,因为网络平台中医生的回复往往过于简略,缺乏详细分析和建议。扁鹊语料库的完整构建流程如图3所示。最终我们获得了包含2,437,190条样本的多轮健康对话数据集,其中患者提问占医生回答总量的$46.2%$。

2.2 BianQue Model

2.2 BianQue 模型

We chose the ChatGLM-6B (Du et al., 2022; Zeng et al., 2023) as the base LLM architecture to construct the BianQue, since it is open source and has excellent Chinese understanding and generation performance. The input of model is defined as:

我们选择ChatGLM-6B (Du et al., 2022; Zeng et al., 2023) 作为构建BianQue的基础大语言模型架构,因为其开源且具备出色的中文理解与生成性能。模型输入定义为:

$i n p u t=u_ {1}^{u}+^{\prime}\backslash n^{\prime}+u_ {1}^{p}+...+u_ {N}^{u}+^{\prime}\backslash n^{\prime}+u_ {N}^{p}$ where $u_ {i}^{u}{=}^{\ast}$ 病人:’ $^+$ utteranc ${{\mathrm{e}_ {i}^{u}}}$ , $u_ {i}^{p}{=}^{\ast}$ 医生: $^{\cdot}+$ utteranc ${\bf e}_ {i}^{p}$ $(i<N)$ , $u_ {N}^{p}=\$ 医生:’, $N$ is the num- ber of dialogue turn.

$i n p u t=u_ {1}^{u}+^{\prime}\backslash n^{\prime}+u_ {1}^{p}+...+u_ {N}^{u}+^{\prime}\backslash n^{\prime}+u_ {N}^{p}$ 其中 $u_ {i}^{u}{=}^{\ast}$ 病人:’ $^+$ 语句 ${{\mathrm{e}_ {i}^{u}}}$ , $u_ {i}^{p}{=}^{\ast}$ 医生: $^{\cdot}+$ 语句 ${\bf e}_ {i}^{p}$ $(i<N)$ , $u_ {N}^{p}=\$ 医生:’, $N$ 表示对话轮次数。

3 Experiments

3 实验

3.1 Baselines and Benchmarks

3.1 基线方法与基准测试

We select ChatGLM $\cdot6\mathbf{B}^{6}$ (Zeng et al., 2023), ChatGPT (gpt-3.5-turbo) (OpenAI, 2022), and DoctorGLM (Xiong et al., 2023) as the baseline models. Comparative experiments were conducted on test set of MedDialog-CN, IMCS-V2, CHIPMDCFNPC and MedDG respectively, since they are multi-turn conversation datasets that have both suggestions and questions in targets.

我们选择 ChatGLM $\cdot6\mathbf{B}^{6}$ (Zeng et al., 2023)、ChatGPT (gpt-3.5-turbo) (OpenAI, 2022) 和 DoctorGLM (Xiong et al., 2023) 作为基线模型。由于 MedDialog-CN、IMCS-V2、CHIPMDCFNPC 和 MedDG 是多轮对话数据集且目标中同时包含建议和问题,我们分别在它们的测试集上进行了对比实验。

3.2 Implementation details

3.2 实现细节

BianQue is finetuned on the proposed BianQueCorpus using the Warm up Decay LR learning rate scheduler with $w a r m u p_ s t e p s=1000$ and warm up max $\mathit{l r}=5e-5$ . During the training stage, the maximum input length is set to 1,536, while the maximum target length is set to 512. A batch size of 80 and global training steps of 25,000 are applied. The decoding algorithms of Top-p sampling with $p=0.75$ and temperature $\tau=0.95$ is applied in the inference stage.

BianQue 在提出的 BianQueCorpus 上进行了微调,采用 Warm up Decay LR 学习率调度器,其中 $warmup_ steps=1000$,最大预热学习率 $\mathit{l r}=5e-5$。训练阶段的最大输入长度设置为 1,536,最大目标长度设置为 512,批处理大小为 80,全局训练步数为 25,000。推理阶段采用 Top-p 采样解码算法,其中 $p=0.75$,温度 $\tau=0.95$。

3.3 Results and Analysis

3.3 结果与分析

Following the Clinical GP T (Wang et al., 2023a), We evaluated BianQue and other models with the metrics: BLEU-1/2/3/4 (Papineni et al., 2002) and ROUGE-1/2/L (Lin, 2004). In addition, we definea new metric to measure the model’s Proactive Questioning ability (PQA):

遵循Clinical GP T (Wang et al., 2023a) 的评估方法,我们采用BLEU-1/2/3/4 (Papineni et al., 2002) 和ROUGE-1/2/L (Lin, 2004) 指标对BianQue及其他模型进行评测。此外,我们定义了一项新指标来衡量模型的主动提问能力 (Proactive Questioning Ability, PQA):

$$
\begin{array}{l}{{P Q A=\displaystyle\frac{2P_ {q}R_ {q}}{P_ {q}+R_ {q}},}}\ {{P_ {q}=\displaystyle\frac{Q_ {t p}}{Q_ {t p}+Q_ {t\overline{{{p}}}}},R_ {q}=\displaystyle\frac{Q_ {t p}}{Q_ {t p}+Q_ {\bar{t}\overline{{{p}}}}},}}\end{array}
$$

$$
\begin{array}{l}{{P Q A=\displaystyle\frac{2P_ {q}R_ {q}}{P_ {q}+R_ {q}},}}\ {{P_ {q}=\displaystyle\frac{Q_ {t p}}{Q_ {t p}+Q_ {t\overline{{{p}}}}},R_ {q}=\displaystyle\frac{Q_ {t p}}{Q_ {t p}+Q_ {\bar{t}\overline{{{p}}}}},}}\end{array}
$$


Figure 3: Construction process of Bian Que Corpus dataset and BianQue model.

图 3: Bian Que Corpus数据集与BianQue模型的构建流程。


Figure 4: The prompt used for polishing the suggestions of doctors based on real-world multi-turn conversation context.

图 4: 基于真实多轮对话场景优化医生建议所使用的提示词。

where $Q_ {t p}$ is the number of samples with both target and prediction are question, $Q_ {t\overline{{p}}}$ is the number of samples with question target and suggestion prediction, $Q_ {\bar{t}\bar{p}}$ is the number of samples with both target and prediction are suggestion.

其中 $Q_ {tp}$ 是目标和预测均为问题的样本数量,$Q_ {t\overline{p}}$ 是目标为问题但预测为建议的样本数量,$Q_ {\bar{t}\bar{p}}$ 是目标和预测均为建议的样本数量。

As shown in Table 1, BianQue demonstrates considerable performance on MedDialog-CN, IMCSV2, CHIP-MDCFNPC and MedDG, achieving better scores than other models across all metrics.

如表 1 所示,BianQue 在 MedDialog-CN、IMCSV2、CHIP-MDCFNPC 和 MedDG 数据集上展现出显著性能,所有指标均优于其他模型。

4 Conclusion and Future Work

4 结论与未来工作

In this study, we introduced BianQue, a health LLM with balanced questioning and suggestion ability, which is finetuned based on the proposed large-scale multi-turn health conversation dataset Bian Que Corpus, in which the targets consist of balanced proportional questions $(46.2%)$ and suggestions $(53.8%)$ . The empirical results highlight the superior mulit-turn questioning ability. Future work requires further focus on the conversion mechanism between questioning and suggestion.

本研究介绍了BianQue——一个具备均衡问答与建议能力的健康领域大语言模型,该模型基于我们提出的大规模多轮健康对话数据集Bian Que Corpus微调而成。该数据集中目标内容包含均衡比例的问题(46.2%)和建议(53.8%)。实证结果突显了其卓越的多轮问答能力。未来工作需要进一步聚焦问题与建议之间的转换机制。

Limitations

局限性

It must be emphasized that there are potential risks when using generative language models for health conversation. Doctors in the real world are rigorous in diagnosing diseases and providing medication guidance. However, the current state-of-the-art LLMs (e.g. ChatGPT) still cannot guarantee the accuracy of the text they generate. Therefore, it is necessary to set up inspection and error correction mechanisms for the health suggestions generated by LLMs. At the same time, when LLMs learn the ability to pro actively question, their usage risk also increases, as the models may ask users some questions related to privacy. For example, when users consult AI about cold-related issues, AI may proactively inquire about their age, gender, and other privacy information. Further privacy protection mechanisms need to be considered in the research and application of LLMs. Overall, the methods proposed in this article are still in the early research stage, and the questioning and suggestion mechanisms are not clear enough. The proposed model is limited to academic research and cannot be used in real-world deployment.

必须强调,使用生成式语言模型进行健康对话存在潜在风险。现实中的医生在诊断疾病和提供用药指导时非常严谨,然而当前最先进的大语言模型(如ChatGPT)仍无法保证其生成文本的准确性。因此,有必要对大语言模型生成的健康建议设置检查和纠错机制。同时,当大语言模型学会主动提问的能力时,其使用风险也会增加,因为模型可能会询问用户一些涉及隐私的问题。例如,当用户向AI咨询感冒相关问题时,AI可能会主动询问其年龄、性别等隐私信息。在大语言模型的研究和应用中,需要进一步考虑隐私保护机制。总体而言,本文提出的方法仍处于早期研究阶段,提问和建议机制不够明确。所提出的模型仅限于学术研究,无法在实际场景中部署。

Ethics Statement

伦理声明

The Bianque model is committed to improving the proactive questioning ability of LLMs, rather than providing very professional medical diagnosis or advice. The multi-turn conversation dataset used in this study is mainly based on the real world doctorpatient conversations, which has gone through a strict data cleansing process to eliminate private information and dirty text content. To this end, we constructed 50 regular expressions and used the re package for filtering. We compared the data quality before and after data cleansing, and the excellent rate increased from $82%$ to $93%$ . Due to the lack of human feedback during the model finetuning stage, the current version of the model may involve user privacy when asking questions, which is particularly important to note. On the other hand, the health recommendations generated by the model have not undergone rigorous examination and proofreading, and therefore cannot be used as a substitute for real-world doctors. We emphasize that this is an early research-oriented model, rather than a mature and directly applicable model. Therefore, future work needs to combine RLHF to improve the safety level of model generated questions or suggestions. Besides, when Bianque is applied to downstream scenarios, it is necessary to inform the users in advance that the answers they see are generated by health AI, which are for reference only.

扁鹊(Bianque)模型致力于提升大语言模型的主动提问能力,而非提供非常专业的医疗诊断或建议。本研究使用的多轮对话数据集主要基于真实世界医患对话,经过严格的数据清洗流程以消除隐私信息和脏文本内容。为此,我们构建了50个正则表达式并使用re包进行过滤。对比清洗前后的数据质量,优秀率从$82%$提升至$93%$。由于模型微调阶段缺乏人类反馈,当前版本模型在提问时可能涉及用户隐私,这一点需要特别注意。另一方面,模型生成的健康建议未经严格审查校对,因此不能替代真实医生。我们强调这是一个早期研究导向的模型,而非成熟可直接应用的模型。因此未来工作需要结合RLHF来提升模型生成问题或建议的安全水平。此外,当扁鹊应用于下游场景时,需提前告知用户所见回答由健康AI生成,仅供参考。

Table 1: Evaluation results.

DatasetModelBLEU-1BLEU-2BLEU-3BLEU-4R-1R-2R-LPQA
MedDialog -CNChatGLM-6B7.283.722.101.2310.860.927.430.20
DoctorGLM10.395.062.941.8013.271.0411.170.01
ChatGPT7.613.902.211.3011.110.967.820.28
BianQue11.126.504.423.1015.552.1512.960.53
IMCS-V2ChatGLM-6B6.833.612.121.3010.241.037.260.36
DoctorGLM8.384.222.521.5511.870.959.220.06
ChatGPT8.464.542.711.7011.481.298.970.38
BianQue14.5010.167.856.2321.736.2419.090.70
CHIP- MDCFNPCChatGLM-6B6.223.111.811.109.620.850.670.35
DoctorGLM8.594.332.681.7112.051.119.680.05
ChatGPT7.523.742.201.3610.510.978.030.38
BianQue13.418.496.054.4219.003.9916.560.57
MedDGChatGLM-6B4.762.311.340.817.350.565.060.47
DoctorGLM6.873.472.151.359.620.887.610.09
ChatGPT5.112.411.380.837.580.505.460.63
BianQue14.8610.438.096.3721.566.4619.560.81

表 1: 评估结果

Dataset Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 R-1 R-2 R-L PQA
MedDialog-CN ChatGLM-6B 7.28 3.72 2.10 1.23 10.86 0.92 7.43 0.20
DoctorGLM 10.39 5.06 2.94 1.80 13.27 1.04 11.17 0.01
ChatGPT 7.61 3.90 2.21 1.30 11.11 0.96 7.82 0.28
BianQue 11.12 6.50 4.42 3.10 15.55 2.15 12.96 0.53
IMCS-V2 ChatGLM-6B 6.83 3.61 2.12 1.30 10.24 1.03 7.26 0.36
DoctorGLM 8.38 4.22 2.52 1.55 11.87 0.95 9.22 0.06
ChatGPT 8.46 4.54 2.71 1.70 11.48 1.29 8.97 0.38
BianQue 14.50 10.16 7.85 6.23 21.73 6.24 19.09 0.70
CHIP-MDCFNPC ChatGLM-6B 6.22 3.11 1.81 1.10 9.62 0.85 0.67 0.35
DoctorGLM 8.59 4.33 2.68 1.71 12.05 1.11 9.68 0.05
ChatGPT 7.52 3.74 2.20 1.36 10.51 0.97 8.03 0.38
BianQue 13.41 8.49 6.05 4.42 19.00 3.99 16.56 0.57
MedDG ChatGLM-6B 4.76 2.31 1.34 0.81 7.35 0.56 5.06 0.47
DoctorGLM 6.87 3.47 2.15 1.35 9.62 0.88 7.61 0.09
ChatGPT 5.11 2.41 1.38 0.83 7.58 0.50 5.46 0.63
BianQue 14.86 10.43 8.09 6.37 21.56 6.46 19.56 0.81

Acknowledgements

致谢

This work was supported by the Science and Technology Project of Guangzhou (202103010002), the Natural Science Foundation of Guangdong Province (2022 A 1515011588), the National Key

本研究由广州市科技计划项目(202103010002)、广东省自然科学基金(2022A1515011588)、国家重点研发计划项目资助。

R&D Program of China (2022 Y FB 4500600), the Science and Technology Project of Guangdong (2022 B 0101010003), the National Natural Science Foundation of China under Grant U1801262 and Guangdong Provincial Key Laboratory of Human Digital Twin (2022B1212010004).

国家重点研发计划(2022YFB4500600)、广东省科技计划项目(2022B0101010003)、国家自然科学基金项目(U1801262)及广东省人体数字孪生重点实验室(2022B1212010004)。

References

参考文献

Wei Chen, Zhiwei Li, Hongyi Fang, Qianyuan Yao, Cheng Zhong, Jianye Hao, Qi Zhang, Xuanjing Huang, Jiajie Peng, and Zhongyu Wei. 2022. A benchmark for automatic medical consultation system: frameworks, tasks and datasets. Bioinformatics, 39(1). Btac817.

Wei Chen、Zhiwei Li、Hongyi Fang、Qianyuan Yao、Cheng Zhong、Jianye Hao、Qi Zhang、Xuanjing Huang、Jiajie Peng 和 Zhongyu Wei。2022。自动医疗咨询系统基准:框架、任务与数据集。Bioinformatics, 39(1)。Btac817。

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General language model pre training with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.

郑晓杜、余杰钱、肖刘、明丁、杰忠邱、志林杨和杰唐。2022。GLM:基于自回归空白填充的通用语言模型预训练。载于《第60届计算语言学协会年会论文集(第一卷:长论文)》,第320-335页,爱尔兰都柏林。计算语言学协会。

Tianyu Han, Lisa C. Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K. Bressem. 2023. Medalpaca – an open-source collection of medical conversational ai models and training data.

Tianyu Han, Lisa C. Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K. Bressem. 2023. Medalpaca - 一个开源的医疗对话AI模型及训练数据集合

Xuehai He, Shu Chen, Zeqian Ju, Xiangyu Dong, Hongchao Fang, Sicheng Wang, Yue Yang, Jiaqi Zeng, Ruisi Zhang, Ruoyu Zhang, Meng Zhou,

薛海和,舒晨,泽谦居,向宇东,洪超方,思成王,岳阳,佳琪曾,瑞思张,若愚张,孟周

Penghui Zhu, and Pengtao Xie. 2020. Meddialog: Two large-scale medical dialogue datasets.

彭辉朱,和谢鹏涛。2020。Meddialog:两个大规模医疗对话数据集。

Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, and You Zhang. 2023. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge.

Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, and You Zhang. 2023. ChatDoctor: 基于Llama模型和医学领域知识微调的医疗对话模型

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

Chin-Yew Lin. 2004. ROUGE: 自动摘要评估工具包. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

OpenAI. 2022. Introducing chatgpt.

OpenAI. 2022. 推出ChatGPT

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.

Long Ouyang、Jeffrey Wu、Xu Jiang、Diogo Almeida、Carroll Wainwright、Pamela Mishkin、Chong Zhang、Sandhini Agarwal、Katarina Slama、Alex Ray、John Schulman、Jacob Hilton、Fraser Kelton、Luke Miller、Maddie Simens、Amanda Askell、Peter Welinder、Paul F Christiano、Jan Leike 和 Ryan Lowe。2022。通过人类反馈训练语言模型遵循指令。载于《神经信息处理系统进展》第35卷,第27730–27744页。Curran Associates公司。

Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.

Kishore Papineni、Salim Roukos、Todd Ward和WeiJing Zhu。2002。BLEU:一种机器翻译自动评估方法。载于《第40届计算语言学协会年会论文集》,第311–318页,美国宾夕法尼亚州费城。计算语言学协会。

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mah- davi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Sen evi rat ne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Web- ster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Kart hikes a lingam, and Vivek Natarajan. 2022. Large language models encode clinical knowledge.

Karan Singhal、Shekoofeh Azizi、Tao Tu、S. Sara Mahdavi、Jason Wei、Hyung Won Chung、Nathan Scales、Ajay Tanwani、Heather Cole-Lewis、Stephen Pfohl、Perry Payne、Martin Seneviratne、Paul Gamble、Chris Kelly、Nathaneal Scharli、Aakanksha Chowdhery、Philip Mansfield、Blaise Agüera y Arcas、Dale Webster、Greg S. Corrado、Yossi Matias、Katherine Chou、Juraj Gottweis、Nenad Tomasev、Yun Liu、Alvin Rajkomar、Joelle Barral、Christopher Semturs、Alan Karthikesalingam 和 Vivek Natarajan。2022。大语言模型 (Large Language Model) 编码临床知识。

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models.

Hugo Touvron、Thibaut Lavril、Gautier Izacard、Xavier Martinet、Marie-Anne Lachaux、Timothée Lacroix、Baptiste Rozière、Naman Goyal、Eric Hambro、Faisal Azhar、Aurelien Rodriguez、Armand Joulin、Edouard Grave 和 Guillaume Lample。2023。Llama: 开放高效的基础语言模型。

Guangyu Wang, Guoxing Yang, Zongxin Du, Longjun Fan, and Xiaohu Li. 2023a. Clinical gp t: Large language models finetuned with diverse medical data and comprehensive evaluation.

王光宇、杨国兴、杜宗新、范龙军和李晓虎。2023a。Clinical GPT:基于多样化医疗数据微调及综合评估的大语言模型。

Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023b. Huatuo: Tuning llama model with chinese medical knowledge.

王浩春、刘驰、席女娲、强泽文、赵森东、秦兵和刘挺。2023b。华佗:基于中文医学知识调优的Llama模型。

Honglin Xiong, Sheng Wang, Yitao Zhu, Zihao Zhao, Yuxiao Liu, Linlin Huang, Qian Wang, and Dinggang Shen. 2023. Doctorglm: Fine-tuning your chinese doctor is not a herculean task.

洪林雄、王升、朱一涛、赵子豪、刘宇霄、黄琳琳、王倩和沈定刚。2023. DoctorGLM:微调你的中文医生并非艰巨任务。

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR).

Aohan Zeng、Xiao Liu、Zhengxiao Du、Zihan Wang、Hanyu Lai、Ming Ding、Zhuoyi Yang、Yifan Xu、Wendi Zheng、Xiao Xia、Weng Lam Tam、Zixuan Ma、Yufei Xue、Jidong Zhai、Wenguang Chen、Zhiyuan Liu、Peng Zhang、Yuxiao Dong 和 Jie Tang。2023。GLM-130B:一个开放的双语预训练模型。第十一届国际学习表征会议 (ICLR)。

Ningyu Zhang, Mosha Chen, Zhen Bi, Xiaozhuan Liang, Lei Li, Xin Shang, Kangping Yin, Chuanqi Tan, Jian Xu, Fei Huang, Luo Si, Yuan Ni, Guotong Xie, Zhi- fang Sui, Baobao Chang, Hui Zong, Zheng Yuan, Linfeng Li, Jun Yan, Hongying Zan, Kunli Zhang, Buzhou Tang, and Qingcai Chen. 2022. CBLUE: A Chinese biomedical language understanding evaluation benchmark. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7888–7915, Dublin, Ireland. Association for Computational Linguistics.

Ningyu Zhang、Mosha Chen、Zhen Bi、Xiaozhuan Liang、Lei Li、Xin Shang、Kangping Yin、Chuanqi Tan、Jian Xu、Fei Huang、Luo Si、Yuan Ni、Guotong Xie、Zhi-fang Sui、Baobao Chang、Hui Zong、Zheng Yuan、Linfeng Li、Jun Yan、Hongying Zan、Kunli Zhang、Buzhou Tang 和 Qingcai Chen。2022。CBLUE:中文生物医学语言理解评估基准。载于《第60届计算语言学协会年会论文集(第一卷:长论文)》,第7888–7915页,爱尔兰都柏林。计算语言学协会。

A Reproducibility Checklist

可复现性检查清单

• Model and Data: The BianQue model and Bian Que Corpus will be released upon decision of the paper.

• 模型与数据: BianQue模型和Bian Que Corpus将在论文确定发表后发布。

• System Hardware: BianQue is trained on a Ubuntu 20.04.6 LTS server that has 2 CPUs called "Intel(R) Xeon(R) Platinum 8358P CPU $a 2.60\mathrm{GHz^{\prime\prime}}$ , 8 NVIDIA A800-SXM4- 80GB GPUs, and 1,024GB memory.

• 系统硬件: BianQue运行在Ubuntu 20.04.6 LTS服务器上训练,该服务器配备2颗"Intel(R) Xeon(R) Platinum 8358P CPU $a 2.60\mathrm{GHz^{\prime\prime}}$处理器、8块NVIDIA A800-SXM4-80GB GPU以及1,024GB内存。

• Driver Version: The version of Nvidia driver is "525.105.17". ${\mathrm{CUDA}}=11.6$ , and Cudnn=8.4.0.27.

• 驱动版本:Nvidia 驱动版本为 "525.105.17"。${\mathrm{CUDA}}=11.6$,Cudnn=8.4.0.27。

• Package version: python $_ {1}{=}3.8.16$ , torch $\scriptstyle{\stackrel{7}{=}}1.13.1+\operatorname{cul}16$ , transformers $\mathrm{L^{8}}\mathrm{=}4.28.0$ , deepspeed $ ^{9}{=}0.9.3$ , datasets $=2.11.0$ and jieb $\scriptstyle=0.42.1$ is recommended. Other dependent packages and versions will be released in our open source repository.

• 软件包版本:推荐使用 Python $_ {1}{=}3.8.16$、torch $\scriptstyle{\stackrel{7}{=}}1.13.1+\operatorname{cul}16$、transformers $\mathrm{L^{8}}\mathrm{=}4.28.0$、deepspeed $ ^{9}{=}0.9.3$、datasets $=2.11.0$ 以及 jieb $\scriptstyle=0.42.1$。其他依赖包及版本将在我们的开源仓库中发布。

• Model Parameters: BianQue has 6.2B parameters with 28 layers and max sequence length of 2,048. During the inference phase, the model requires at least 14GB of GPU memory.

• 模型参数: BianQue 拥有 62 亿参数、28 层网络结构,最大序列长度为 2048。推理阶段至少需要 14GB 显存。

• Training Time: BianQue is trained with global steps of 25,000 and torch d type of "float16" on 8 NVIDIA A800-SXM4-80GB GPUs. The training time is about 66 hours.

• 训练时间:BianQue在8块NVIDIA A800-SXM4-80GB GPU上以25,000全局步数和"float16"的torch d类型进行训练,耗时约66小时。

B Sample Conversations of LLMs

B 大语言模型 (Large Language Model) 的对话示例

The following are examples of health conversation testing in ChatGPT (Figure 5), ChatGLM (Fig- ure 6), and SparkDesk (Figure 7). These are three common Chinese LLMs, but none of them have CoQ capabilities. The above LLMs generally provide reasonable and universal suggestions based on the single-turn instruction provided by users. However, in the real world, doctors often need to conduct multiple turns of questioning with patients in order to provide targeted advice.

以下是ChatGPT (图5)、ChatGLM (图6)和SparkDesk (图7)中的健康对话测试示例。这三种都是常见的中文大语言模型,但均不具备对话式问答(CoQ)能力。上述大语言模型通常仅根据用户提供的单轮指令给出合理但泛化的建议,而现实场景中医生往往需要与患者进行多轮问诊才能提供针对性建议。

C Sample Conversations of BianQue

C BianQue 示例对话

Figure 8 shows an example of the BianQue model pro actively asking questions.

图 8: BianQue 模型主动提问的示例。

孩子九个月,感冒引起的支气管炎和肺炎,现在住院输了两天液,没有一点效果,现在气喘加重,咳嗽,和鼻塞也没减轻,请问我现在应该采取什么样的治疗?

九个月婴儿因感冒引发支气管炎和肺炎,住院输液两天未见疗效,现气喘加剧,咳嗽鼻塞未缓解,请问应采取何种治疗方案?

(注:根据用户提供的翻译规则,此为纯医疗咨询文本,无专业术语或格式需特殊处理,故直接进行语义转换。用户强调的AI术语对照表及技术规范在此场景不适用。)

My child is nine months old and has suffered from bronchitis and pneumonia caused by a cold. He has been hospitalized for two days and have received fluid transfusions without any effect. His asthma has worsened, and cough and nasal congestion have not alleviated. May I ask what treatment he should take now?

我的孩子九个月大,因感冒引发支气管炎和肺炎已住院两天,输液治疗未见效果。目前哮喘加重,咳嗽鼻塞症状未缓解,请问现在应该采取什么治疗方案?

我不是医生,但我可以给你一些建议。对于九个月大的孩子,如果感冒引起了支气管炎和肺炎,你应该尽快寻求医疗帮助。住院治疗是一个正确的决定,因为医生可以监测孩子的病情并提供适当的治疗。

我不是医生,但我可以给你一些建议。对于九个月大的孩子,如果感冒引起了支气管炎和肺炎,你应该尽快寻求医疗帮助。住院治疗是一个正确的决定,因为医生可以监测孩子的病情并提供适当的治疗。

孩子九个月,感冒引起的支气管炎和肺炎,现在住院输了两天液,没有一点效果,现在气喘加重,咳嗽,和鼻塞也没减轻,请问我现在应该采取什么样的治疗?

孩子九个月,因感冒引发支气管炎和肺炎,住院输液两天未见好转,现气喘加重,咳嗽和鼻塞症状未缓解,请问当前应采取何种治疗方案?

(注:根据用户提供的翻译规则,此处仅保留医疗咨询内容的简体中文翻译,未添加任何解释性内容或格式标记。专业术语如"支气管炎"、"肺炎"等已按医学标准术语直接翻译,未进行英文标注,因原文未出现需保留的英文专业术语或公司名称。)

My child is nine months old and has suffered from bronchitis and pneumonia caused by a cold. He has been hospitalized for two days and have received fluid transfusions without any effect. His asthma has worsened, and cough and nasal congestion have not alleviated. May I ask what treatment he should take now?

我的孩子九个月大,因感冒引发支气管炎和肺炎,已住院两天并接受输液治疗但未见效。目前哮喘加重,咳嗽和鼻塞症状未缓解,请问现在应采取什么治疗方案?

孩子九个月,感冒引起的支气管炎和肺炎,现在住院输了两天液,没有一点效果,现在气喘加重,咳嗽,和鼻塞也没减轻,请问我现在应该采取什么样的治疗?

9个月幼儿因感冒引发支气管炎及肺炎,住院输液治疗2天未见好转,现气喘加重,咳嗽及鼻塞症状未缓解,请问后续应如何调整治疗方案?

(注:根据用户提供的翻译规则,此处仅保留医疗咨询内容的直译,未添加任何解释性内容。专业术语如"支气管炎(bronchitis)"、"肺炎(pneumonia)"按首次出现标注英文的规则处理,但原文未出现英文术语故未添加。格式上已转换为简洁的Markdown段落,并确保医学术语准确性。)

My child is nine months old and has suffered from bronchitis and pneumonia caused by a cold. He has been hospitalized for two days and have received fluid transfusions without any effect. His asthma has worsened, and cough and nasal congestion have not alleviated. May I ask what treatment he should take now?

我的孩子九个月大,因感冒引发了支气管炎和肺炎。已住院两天并接受输液治疗,但未见效果。哮喘症状加重,咳嗽和鼻塞也未缓解。请问现在应该采取什么治疗方案?

很抱歉听到您的孩子生病了。根据我所查到的信息,支气管炎和肺炎都是呼吸道感染的一种,但是它们的症状和治疗方法略有不同。

很抱歉听到您的孩子生病了。根据我所查到的信息,支气管炎和肺炎都是呼吸道感染的一种,但是它们的症状和治疗方法略有不同。

建议您及时咨询医生,以便得到更准确的诊断和治疗方案。同时,您可以注意以下几点:

建议您及时咨询医生,以便得到更准确的诊断和治疗方案。同时,您可以注意以下几点:


Figure 7: A case of a user confiding to SparkDesk.

图 7: 用户向SparkDesk倾诉的案例

I'm sorry to hear that your child is sick. According to the information I have found, bronchitis and pneumonia are both respiratory tract infections, but their symptoms and treatment methods are slightly different.

很遗憾听到您的孩子生病了。根据我查到的信息,支气管炎和肺炎都属于呼吸道感染,但症状和治疗方法略有不同。

We suggest that you consult a doctor in a timely manner to obtain a more accurate diagnosis and treatment plan. Meanwhile, you can pay attention to the following points:

建议您及时就医,以获得更准确的诊断和治疗方案。同时,您可以注意以下几点:

医生你好,我家宝宝感冒了Hello doctor, my baby has caught a cold 您好,宝宝感冒了多久了?Hello, how long has the baby been catching a cold?

您好,宝宝感冒了多久了?

宝宝有鼻塞,流涕,咳嗽,咳痰吗?Does the baby have nasal congestion, runny nose, cough, or phlegm?

宝宝有鼻塞、流涕、咳嗽、咳痰吗?


Figure 8: A case of a user confiding to BianQue.

图 8: 用户向BianQue倾诉的案例

阅读全文(20积分)