BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT

BianQue: 通过ChatGPT优化的多轮健康对话平衡健康大语言模型的提问与建议能力

Abstract

摘要

Large language models (LLMs) have performed well in providing general and extensive health suggestions in single-turn conversations, exemplified by systems such as ChatGPT, ChatGLM, ChatDoctor, DoctorGLM, and etc. However, the limited information provided by users during single turn results in inadequate personalization and targeting of the generated suggestions, which requires users to independently select the useful part. It is mainly caused by the missing ability to engage in multi-turn questioning. In real-world medical consultations, doctors usually employ a series of iterative inquiries to comprehend the patient’s condition thoroughly, enabling them to provide effective and personalized suggestions subsequently, which can be defined as chain of questioning (CoQ) for LLMs. To improve the CoQ of LLMs, we propose BianQue, a ChatGLM-based LLM finetuned with the selfconstructed health conversation dataset BianQueCorpus that is consist of multiple turns of questioning and health suggestions polished by ChatGPT. Experimental results demonstrate that the proposed BianQue can simultaneously balance the capabilities of both questioning and health suggestions, which will help promote the research and application of LLMs in the field of proactive health.1

大语言模型(LLM)在单轮对话中提供通用广泛的健康建议方面表现优异，例如ChatGPT、ChatGLM、ChatDoctor、DoctorGLM等系统。然而单轮对话中用户提供的信息有限，导致生成建议的个性化和针对性不足，需要用户自行筛选有效部分。这主要源于模型缺乏多轮追问能力。在实际医疗问诊中，医生通常通过一系列迭代询问来全面了解患者状况，从而后续提供有效且个性化的建议，这种模式可定义为大语言模型的追问链(CoQ)。为提升大语言模型的追问链能力，我们提出基于ChatGLM微调的BianQue模型，其训练数据为自建的健康对话数据集BianQueCorpus，该数据集包含经ChatGPT优化的多轮追问和健康建议。实验结果表明，BianQue能同时平衡追问和健康建议生成能力，这将推动大语言模型在主动健康领域的研究与应用。[20]

1 Introduction

1 引言

Recently, Large language models (LLMs), e.g. ChatGPT (OpenAI, 2022), LLaMA (Touvron et al., 2023), ChatGLM (Zeng et al., 2023), have been extensively applied in various fields. Through high-quality instruction fine-tuning and reinforcement learning based on human feedback (RLHF) (Ouyang et al., 2022), LLMs already have possessed stunning language comprehension, generation, and knowledge reasoning abilities. Overall, users are amazed by the excellent suggestion ability of LLMs.

近年来，大语言模型（LLM）如ChatGPT (OpenAI, 2022)、LLaMA (Touvron et al., 2023)、ChatGLM (Zeng et al., 2023)已被广泛应用于各个领域。通过基于人类反馈的高质量指令微调与强化学习（RLHF）(Ouyang et al., 2022)，大语言模型已具备惊人的语言理解、生成和知识推理能力。总体而言，用户对其出色的建议能力感到惊叹。

However, LLMs are deficient in “questioning” which is an important way to pro actively understand users needs in medical, psychological, educational and other application scenarios. When we engage in healthcare conversations with these LLMs (ChatGPT2, ChatGLM3, SparkDesk4), they do not yet possess the ability to conduct multiple rounds of questioning, as presented in Appendix B. The above LLMs generally provide reasonable and universal suggestions based on the single-turn instruction provided by users. However, in the real world, doctors often need to conduct multiple turns of questioning with patients in order to provide targeted advice, as shown in Figure 1. During the user consultation, the doctor raised different questions in the first 9 turns of conversations to understand the specific situation of the baby. The above multiturn questioning process can be defined as Chain of Questioning (CoQ). We found that the current LLMs lack CoQ capabilities because LLMs lack training data for multiple rounds of questioning during the instruction fine-tuning stage and RLHF stage. When researchers construct instructions and answers, on the one hand, they ignore multiple rounds of conversation history, and on the other hand, answers are usually suggestions rather than questions.

然而，大语言模型在"提问"能力上存在不足，而这是医疗、心理、教育等应用场景中主动理解用户需求的重要方式。当我们与ChatGPT、ChatGLM、SparkDesk等大语言模型进行医疗健康对话时，它们尚不具备如附录B所示的多轮提问能力。上述模型通常仅根据用户单轮指令提供合理但通用的建议。但现实中医生往往需要如图1所示，通过多轮问诊才能给出针对性建议。在用户咨询过程中，医生前9轮对话通过不同提问来了解婴儿具体情况。这种多轮提问过程可定义为提问链(CoQ)。我们发现当前大语言模型缺乏CoQ能力，因为其在指令微调和RLHF阶段缺乏多轮提问的训练数据。研究者在构建指令和答案时，一方面忽略了多轮对话历史，另一方面答案通常为建议而非问题。

At present, research on LLMs in the health field

目前，健康领域大语言模型的研究

mainly focuses on evaluating the performance of existing models, constructing suitable datasets, and fine-tuning instructions. Singhal et al. (2022) proposed a medical Q&A benchmark MultiMedQA for evaluating the clinical knowledge QA abilities of LLMs. Li et al. (2023) constructed a real doctorpatient dialogue dataset HealthCare Magic- $100\mathrm{k}$ , and used it to fine-tune the ChatDoctor based on LLaMA. Similar health LLMs have been released one after another, e.g. BenTsao (本草) (Wang et al., 2023b), ChatGLM-6B-Med (Wang et al., 2023b), DoctorGLM (Xiong et al., 2023), MedAlpaca (Han et al., 2023), Clinical GP T (Wang et al., 2023a) and etc. These models are basically based on the assumption that “users can clearly describe their problems or situations”. Therefore, during the model construction phase, the questioning ability of the model was not considered. Although these models have achieved well performance in the field of medical QA, they do not have the ability to ask users questions.

主要聚焦于评估现有模型性能、构建合适的数据集以及微调指令。Singhal等人(2022)提出了医疗问答基准MultiMedQA，用于评估大语言模型的临床知识问答能力。Li等人(2023)构建了真实医患对话数据集HealthCare Magic-$100\mathrm{k}$，并基于LLaMA微调出ChatDoctor。类似医疗大语言模型相继发布，例如本草(BenTsao)(Wang等人,2023b)、ChatGLM-6B-Med(Wang等人,2023b)、DoctorGLM(Xiong等人,2023)、MedAlpaca(Han等人,2023)、ClinicalGPT(Wang等人,2023a)等。这些模型基本基于"用户能清晰描述自身问题或状况"的假设，因此在模型构建阶段未考虑模型的提问能力。尽管这些模型在医疗问答领域表现优异，但都不具备向用户提问的能力。

To enhance the questioning ability of LLMs, we constructed a multi-turn health conversation dataset named Bian Que Corpus, in which the targets consist of balanced proportional questions $(46.2%)$ and suggestions $(53.8%)$ , as shown in Figure 2. Meanwhile, we present BianQue, a health LLM that is specifically designed for balancing the questioning and suggestion ability. The results on multiturn health conversation dataset demonstrate that BianQue outperforms existing models and ChatGPT, especially in the ability to conduct chain of questioning (CoQ).

为提升大语言模型(LLM)的提问能力，我们构建了名为Bian Que Corpus的多轮健康对话数据集，其中目标问题包含均衡比例的问题$(46.2%)$和建议$(53.8%)$，如图2所示。同时，我们提出了BianQue——一个专为平衡提问与建议能力而设计的健康领域大语言模型。多轮健康对话数据集上的实验结果表明，BianQue在提问链(CoQ)能力方面显著优于现有模型及ChatGPT。

Figure 2: Proportion of questions and suggestions in answers of Bian Que Corpus.

图 2: 扁鹊医典回答中问题与建议的比例分布。

2 Methodology

2 方法论

2.1 Bian Que Corpus: Balancing Questioning and Suggestion

2.1 扁鹊语料库：平衡提问与建议

In the field of Chinese health conversational AI, there are already some multi-turn conversation datasets, e.g. MedDialog-CN (He et al., 2020), IMCS-V2 (Chen et al., 2022), CHIP- MDCFNPC (Zhang et al., 2022), MedDG (Zhang et al., 2022). However, these conversations are often crawled from internet consultation platforms, such as 好大夫5. These datasets are often mixed with a large amount of noise, such as missing content, missing images, reward information, privacy content, incomplete JSON content, website link, website tips, voice recording, text automatically replied by the system, etc. We first collected realworld multi-turn health conversations through data outsourcing services. Then, we performed a twostage data optimization process: (i) We constructed a data automatic cleaning process based on regularized expression to improve the quality of existing conversation datasets. (ii) We designed a polishing prompt (see Figure 4) and use ChatGPT to polish the doctors’ suggestion of multi-turn conversations, because doctors often respond very briefly through internet platforms, lacking detailed analysis and suggestions. The whole construction process of Bian Que Corpus is presented in Figure 3. We ultimately obtained a multi-turn health conversation dataset consisting of 2,437,190 samples, in which the questions accounted for $46.2%$ among the doctors’ answers.

在中文健康对话AI领域，已有一些多轮对话数据集，例如MedDialog-CN (He et al., 2020)、IMCS-V2 (Chen et al., 2022)、CHIP-MDCFNPC (Zhang et al., 2022)、MedDG (Zhang et al., 2022)。然而这些对话通常爬取自网络问诊平台（如好大夫），数据中常混杂大量噪声，包括内容缺失、图片缺失、打赏信息、隐私内容、不完整的JSON内容、网站链接、网站提示、录音文件、系统自动回复文本等。我们首先通过数据外包服务收集了真实世界的多轮健康对话，随后实施了两阶段数据优化流程：(i) 基于正则表达式构建自动化清洗流程以提升现有对话数据集质量；(ii) 设计优化提示词（见图4），利用ChatGPT对多轮对话中医生的建议进行润色，因为网络平台中医生的回复往往过于简略，缺乏详细分析和建议。扁鹊语料库的完整构建流程如图3所示。最终我们获得了包含2,437,190条样本的多轮健康对话数据集，其中患者提问占医生回答总量的$46.2%$。

2.2 BianQue Model

2.2 BianQue 模型

We chose the ChatGLM-6B (Du et al., 2022; Zeng et al., 2023) as the base LLM architecture to construct the BianQue, since it is open source and has excellent Chinese understanding and generation performance. The input of model is defined as:

我们选择ChatGLM-6B (Du et al., 2022; Zeng et al., 2023) 作为构建BianQue的基础大语言模型架构，因为其开源且具备出色的中文理解与生成性能。模型输入定义为:

$i n p u t=u_ {1}^{u}+^{\prime}\backslash n^{\prime}+u_ {1}^{p}+...+u_ {N}^{u}+^{\prime}\backslash n^{\prime}+u_ {N}^{p}$ where $u_ {i}^{u}{=}^{\ast}$ 病人：’ $^+$ utteranc ${{\mathrm{e}_ {i}^{u}}}$ , $u_ {i}^{p}{=}^{\ast}$ 医生： $^{\cdot}+$ utteranc ${\bf e}_ {i}^{p}$ $(i<N)$ , $u_ {N}^{p}=\$ 医生：’, $N$ is the num- ber of dialogue turn.

$i n p u t=u_ {1}^{u}+^{\prime}\backslash n^{\prime}+u_ {1}^{p}+...+u_ {N}^{u}+^{\prime}\backslash n^{\prime}+u_ {N}^{p}$ 其中 $u_ {i}^{u}{=}^{\ast}$ 病人：’ $^+$ 语句 ${{\mathrm{e}_ {i}^{u}}}$ , $u_ {i}^{p}{=}^{\ast}$ 医生： $^{\cdot}+$ 语句 ${\bf e}_ {i}^{p}$ $(i<N)$ , $u_ {N}^{p}=\$ 医生：’, $N$ 表示对话轮次数。

3 Experiments

3 实验

3.1 Baselines and Benchmarks

3.1 基线方法与基准测试

We select ChatGLM $\cdot6\mathbf{B}^{6}$ (Zeng et al., 2023), ChatGPT (gpt-3.5-turbo) (OpenAI, 2022), and DoctorGLM (Xiong et al., 2023) as the baseline models. Comparative experiments were conducted on test set of MedDialog-CN, IMCS-V2, CHIPMDCFNPC and MedDG respectively, since they are multi-turn conversation datasets that have both suggestions and questions in targets.

我们选择 ChatGLM $\cdot6\mathbf{B}^{6}$ (Zeng et al., 2023)、ChatGPT (gpt-3.5-turbo) (OpenAI, 2022) 和 DoctorGLM (Xiong et al., 2023) 作为基线模型。由于 MedDialog-CN、IMCS-V2、CHIPMDCFNPC 和 MedDG 是多轮对话数据集且目标中同时包含建议和问题，我们分别在它们的测试集上进行了对比实验。

3.2 Implementation details

3.2 实现细节

BianQue is finetuned on the proposed BianQueCorpus using the Warm up Decay LR learning rate scheduler with $w a r m u p_ s t e p s=1000$ and warm up max $\mathit{l r}=5e-5$ . During the training stage, the maximum input length is set to 1,536, while the maximum target length is set to 512. A batch size of 80 and global training steps of 25,000 are applied. The decoding algorithms of Top-p sampling with $p=0.75$ and temperature $\tau=0.95$ is applied in the inference stage.

BianQue 在提出的 BianQueCorpus 上进行了微调，采用 Warm up Decay LR 学习率调度器，其中 $warmup_ steps=1000$，最大预热学习率 $\mathit{l r}=5e-5$。训练阶段的最大输入长度设置为 1,536，最大目标长度设置为 512，批处理大小为 80，全局训练步数为 25,000。推理阶段采用 Top-p 采样解码算法，其中 $p=0.75$，温度 $\tau=0.95$。

3.3 Results and Analysis

3.3 结果与分析

Following the Clinical GP T (Wang et al., 2023a), We evaluated BianQue and other models with the metrics: BLEU-1/2/3/4 (Papineni et al., 2002) and ROUGE-1/2/L (Lin, 2004). In addition, we definea new metric to measure the model’s Proactive Questioning ability (PQA):

遵循Clinical GP T (Wang et al., 2023a) 的评估方法，我们采用BLEU-1/2/3/4 (Papineni et al., 2002) 和ROUGE-1/2/L (Lin, 2004) 指标对BianQue及其他模型进行评测。此外，我们定义了一项新指标来衡量模型的主动提问能力 (Proactive Questioning Ability, PQA)：

$$
\begin{array}{l}{{P Q A=\displaystyle\frac{2P_ {q}R_ {q}}{P_ {q}+R_ {q}},}}\ {{P_ {q}=\displaystyle\frac{Q_ {t p}}{Q_ {t p}+Q_ {t\overline{{{p}}}}},R_ {q}=\displaystyle\frac{Q_ {t p}}{Q_ {t p}+Q_ {\bar{t}\overline{{{p}}}}},}}\end{array}
$$

Figure 3: Construction process of Bian Que Corpus dataset and BianQue model.

图 3: Bian Que Corpus数据集与BianQue模型的构建流程。

Figure 4: The prompt used for polishing the suggestions of doctors based on real-world multi-turn conversation context.

图 4: 基于真实多轮对话场景优化医生建议所使用的提示词。

where $Q_ {t p}$ is the number of samples with both target and prediction are question, $Q_ {t\overline{{p}}}$ is the number of samples with question target and suggestion prediction, $Q_ {\bar{t}\bar{p}}$ is the number of samples with both target and prediction are suggestion.

其中 $Q_ {tp}$ 是目标和预测均为问题的样本数量，$Q_ {t\overline{p}}$ 是目标为问题但预测为建议的样本数量，$Q_ {\bar{t}\bar{p}}$ 是目标和预测均为建议的样本数量。

As shown in Table 1, BianQue demonstrates considerable performance on MedDialog-CN, IMCSV2, CHIP-MDCFNPC and MedDG, achieving better scores than other models across all metrics.

如表 1 所示，BianQue 在 MedDialog-CN、IMCSV2、CHIP-MDCFNPC 和 MedDG 数据集上展现出显著性能，所有指标均优于其他模型。

4 Conclusion and Future Work

4 结论与未来工作

In this study, we introduced BianQue, a health LLM with balanced questioning and suggestion ability, which is finetuned based on the proposed large-scale multi-turn health conversation dataset Bian Que Corpus, in which the targets consist of balanced proportional questions $(46.2%)$ and suggestions $(53.8%)$ . The empirical results highlight the superior mulit-turn questioning ability. Future work requires further focus on the conversion mechanism between questioning and suggestion.

本研究介绍了BianQue——一个具备均衡问答与建议能力的健康领域大语言模型，该模型基于我们提出的大规模多轮健康对话数据集Bian Que Corpus微调而成。该数据集中目标内容包含均衡比例的问题（46.2%）和建议（53.8%）。实证结果突显了其卓越的多轮问答能力。未来工作需要进一步聚焦问题与建议之间的转换机制。

Limitations

局限性

It must be emphasized that there are potential risks when using generative language models for health conversation. Doctors in the real world are rigorous in diagnosing diseases and providing medication guidance. However, the current state-of-the-art LLMs (e.g. ChatGPT) still cannot guarantee the accuracy of the text they generate. Therefore, it is necessary to set up inspection and error correction mechanisms for the health suggestions generated by LLMs. At the same time, when LLMs learn the ability to pro actively question, their usage risk also increases, as the models may ask users some questions related to privacy. For example, when users consult AI about cold-related issues, AI may proactively inquire about their age, gender, and other privacy information. Further privacy protection mechanisms need to be considered in the research and application of LLMs. Overall, the methods proposed in this article are still in the early research stage, and the questioning and suggestion mechanisms are not clear enough. The proposed model is limited to academic research and cannot be used in real-world deployment.

必须强调，使用生成式语言模型进行健康对话存在潜在风险。现实中的医生在诊断疾病和提供用药指导时非常严谨，然而当前最先进的大语言模型（如ChatGPT）仍无法保证其生成文本的准确性。因此，有必要对大语言模型生成的健康建议设置检查和纠错机制。同时，当大语言模型学会主动提问的能力时，其使用风险也会增加，因为模型可能会询问用户一些涉及隐私的问题。例如，当用户向AI咨询感冒相关问题时，AI可能会主动询问其年龄、性别等隐私信息。在大语言模型的研究和应用中，需要进一步考虑隐私保护机制。总体而言，本文提出的方法仍处于早期研究阶段，提问和建议机制不够明确。所提出的模型仅限于学术研究，无法在实际场景中部署。

Ethics Statement

伦理声明

The Bianque model is committed to improving the proactive questioning ability of LLMs, rather than providing very professional medical diagnosis or advice. The multi-turn conversation dataset used in this study is mainly based on the real world doctorpatient conversations, which has gone through a strict data cleansing process to eliminate private information and dirty text content. To this end, we constructed 50 regular expressions and used the re package for filtering. We compared the data quality before and after data cleansing, and the excellent rate increased from $82%$ to $93%$ . Due to the lack of human feedback during the model finetuning stage, the current version of the model may involve user privacy when asking questions, which is particularly important to note. On the other hand, the health recommendations generated by the model have not undergone rigorous examination and proofreading, and therefore cannot be used as a substitute for real-world doctors. We emphasize that this is an early research-oriented model, rather than a mature and directly applicable model. Therefore, future work needs to combine RLHF to improve the safety level of model generated questions or suggestions. Besides, when Bianque is applied to downstream scenarios, it is necessary to inform the users in advance that the answers they see are generated by health AI, which are for reference only.

扁鹊(Bianque)模型致力于提升大语言模型的主动提问能力，而非提供非常专业的医疗诊断或建议。本研究使用的多轮对话数据集主要基于真实世界医患对话，经过严格的数据清洗流程以消除隐私信息和脏文本内容。为此，我们构建了50个正则表达式并使用re包进行过滤。对比清洗前后的数据质量，优秀率从$82%$提升至$93%$。由于模型微调阶段缺乏人类反馈，当前版本模型在提问时可能涉及用户隐私，这一点需要特别注意。另一方面，模型生成的健康建议未经严格审查校对，因此不能替代真实医生。我们强调这是一个早期研究导向的模型，而非成熟可直接应用的模型。因此未来工作需要结合RLHF来提升模型生成问题或建议的安全水平。此外，当扁鹊应用于下游场景时，需提前告知用户所见回答由健康AI生成，仅供参考。

Table 1: Evaluation results.

Dataset	Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	R-1	R-2	R-L	PQA
MedDialog -CN	ChatGLM-6B	7.28	3.72	2.10	1.23	10.86	0.92	7.43	0.20
	DoctorGLM	10.39	5.06	2.94	1.80	13.27	1.04	11.17	0.01
	ChatGPT	7.61	3.90	2.21	1.30	11.11	0.96	7.82	0.28
	BianQue	11.12	6.50	4.42	3.10	15.55	2.15	12.96	0.53
IMCS-V2	ChatGLM-6B	6.83	3.61	2.12	1.30	10.24	1.03	7.26	0.36
	DoctorGLM	8.38	4.22	2.52	1.55	11.87	0.95	9.22	0.06
	ChatGPT	8.46	4.54	2.71	1.70	11.48	1.29	8.97	0.38
	BianQue	14.50	10.16	7.85	6.23	21.73	6.24	19.09	0.70
CHIP- MDCFNPC	ChatGLM-6B	6.22	3.11	1.81	1.10	9.62	0.85	0.67	0.35
	DoctorGLM	8.59	4.33	2.68	1.71	12.05	1.11	9.68	0.05
	ChatGPT	7.52	3.74	2.20	1.36	10.51	0.97	8.03	0.38
	BianQue	13.41	8.49	6.05	4.42	19.00	3.99	16.56	0.57
MedDG	ChatGLM-6B	4.76	2.31	1.34	0.81	7.35	0.56	5.06	0.47
	DoctorGLM	6.87	3.47	2.15	1.35	9.62	0.88	7.61	0.09
	ChatGPT	5.11	2.41	1.38	0.83	7.58	0.50	5.46	0.63
	BianQue	14.86	10.43	8.09	6.37	21.56	6.46	19.56	0.81

表 1: 评估结果

Dataset	Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	R-1	R-2	R-L	PQA
MedDialog-CN	ChatGLM-6B	7.28	3.72	2.10	1.23	10.86	0.92	7.43	0.20
	DoctorGLM	10.39	5.06	2.94	1.80	13.27	1.04	11.17	0.01
	ChatGPT	7.61	3.90	2.21	1.30	11.11	0.96	7.82	0.28
	BianQue	11.12	6.50	4.42	3.10	15.55	2.15	12.96	0.53
IMCS-V2	ChatGLM-6B	6.83	3.61	2.12	1.30	10.24	1.03	7.26	0.36
	DoctorGLM	8.38	4.22	2.52	1.55	11.87	0.95	9.22	0.06
	ChatGPT	8.46	4.54	2.71	1.70	11.48	1.29	8.97	0.38
	BianQue	14.50	10.16	7.85	6.23	21.73	6.24	19.09	0.70
CHIP-MDCFNPC	ChatGLM-6B	6.22	3.11	1.81	1.10	9.62	0.85	0.67	0.35
	DoctorGLM	8.59	4.33	2.68	1.71	12.05	1.11	9.68	0.05
	ChatGPT	7.52	3.74	2.20	1.36	10.51	0.97	8.03	0.38
	BianQue	13.41	8.49	6.05	4.42	19.00	3.99	16.56	0.57
MedDG	ChatGLM-6B	4.76	2.31	1.34	0.81	7.35	0.56	5.06	0.47
	DoctorGLM	6.87	3.47	2.15	1.35	9.62	0.88	7.61	0.09
	ChatGPT	5.11	2.41	1.38	0.83	7.58	0.50	5.46	0.63
	BianQue	14.86	10.43	8.09	6.37	21.56	6.46	19.56	0.81

Acknowledgements

致谢

This work was supported by the Science and Technology Project of Guangzhou (202103010002), the Natural Science Foundation of Guangdong Province (2022 A 1515011588), the National Key

本研究由广州市科技计划项目(202103010002)、广东省自然科学基金(2022A1

[论文翻译]BianQue: 通过ChatGPT优化的多轮健康对话平衡健康大语言模型的提问与建议能力

原文地址：https://arxiv.org/pdf/2310.15896