ABSTRACT

摘要

This study harnesses state-of-the-art AI technology for detecting mental disorders through user-generated textual content. Existing studies typically rely on fully supervised machine learning, which presents challenges such as the labor-intensive manual process of annotating extensive training data for each research problem and the need to design specialized deep learning architectures for each task. We propose a novel method to address these challenges by leveraging large language models and continuous multi-prompt engineering, which offers two key advantages: (1) developing personalized prompts that capture each user's unique characteristics and (2) integrating structured medical knowledge into prompts to provide context for disease detection and facilitate predictive modeling. We evaluate our method using three widely prevalent mental disorders as research cases. Our method significantly outperforms existing methods, including feature engineering, architecture engineering, and discrete prompt engineering. Meanwhile, our approach demonstrates success in few-shot learning, i.e., requiring only a minimal number of training examples. Moreover, our method can be generalized to other rare mental disorder detection tasks with few positive labels. In addition to its technical contributions, our method has the potential to enhance the well-being of individuals with mental disorders and offer a cost-effective, accessible alternative for stakeholders beyond traditional mental disorder screening methods.

本研究利用前沿AI技术，通过用户生成文本内容检测精神障碍。现有研究通常依赖全监督机器学习，存在两大挑战：(1) 为每个研究问题标注大量训练数据需要耗费大量人工；(2) 需为每项任务设计专用深度学习架构。我们提出创新方法，通过大语言模型和连续多提示工程解决这些问题，其优势在于：(1) 开发能捕捉用户独特性征的个性化提示；(2) 将结构化医学知识融入提示，为疾病检测提供背景并辅助预测建模。我们选取三种高发精神障碍作为研究案例进行评估，本方法在特征工程、架构工程和离散提示工程等现有方法中显著胜出。同时，该方法在少样本学习场景中表现优异，仅需极少量训练样本即可实现。此外，本方法可推广至其他阳性标签稀缺的罕见精神障碍检测任务。除技术贡献外，该方法有望提升精神障碍患者福祉，并为利益相关方提供比传统筛查更经济、可及的替代方案。

Keywords: prompt engineering; large language model; machine learning; computational design science; mental health management

关键词：提示工程 (Prompt Engineering)；大语言模型 (Large Language Model)；机器学习 (Machine Learning)；计算设计科学 (Computational Design Science)；心理健康管理 (Mental Health Management)

1. INTRODUCTION

1. 引言

Mental disorders are a major global health burden. There are over 150 recognized core mental health conditions (APA 2022) and approximately 1 in 8 people worldwide (970 million), live with a mental disorder (WHO 2022a). Recent studies indicate a significant $25%$ increase in anxiety and depression, two key mental disorders, since the onset of the COVID-19 pandemic in 2020 (WHO 2022a). More importantly, despite considerable research efforts, mental disorders are difficult to detect for reasons such as the lack of a reliable laboratory test for diagnosis and insufficient behavioral data in electronic health records (EHR) for effective detection (WHO 2022b). Moreover, mental disorders continue to be under diagnosed due to factors such as lack of awareness of their symptoms, myths and misunderstandings, stigma leading individuals to hide their issues and delay seeking help, and barriers to healthcare access (Patel et al. 2018). Using user-generated content as a supplement to existing mental disorder screening methods is considered a promising approach to combating mental disorders and has far-reaching health and societal implications (Guntuku et al. 2017, Wongkoblap et al. 2017). Information systems (IS) research closely follows this direction and emphasizes the use of user-generated content for mental disorder detection (Chau et al. 2020, D. Zhang et al. 2024, W. Zhang et al. 2024).

精神疾病是全球主要的健康负担。目前公认的核心心理健康问题超过150种(APA 2022)，全球约八分之一人口(9.7亿)患有精神疾病(WHO 2022a)。最新研究表明，自2020年新冠疫情爆发以来，焦虑症和抑郁症这两种关键精神障碍的发病率显著上升了25%(WHO 2022a)。更重要的是，尽管研究投入巨大，精神疾病仍难以检测——原因包括缺乏可靠的实验室诊断测试，以及电子健康档案(EHR)中行为数据不足导致有效检测困难(WHO 2022b)。此外，由于症状认知不足、误解与偏见使患者隐瞒问题并延误求助，以及医疗资源获取障碍等因素(Patel et al. 2018)，精神疾病仍存在普遍漏诊现象。利用用户生成内容作为现有精神疾病筛查方法的补充手段，被视为对抗精神疾病的有效途径，具有深远的健康与社会意义(Guntuku et al. 2017, Wongkoblap et al. 2017)。信息系统(IS)研究密切关注这一方向，强调利用用户生成内容进行精神疾病检测(Chau et al. 2020, D. Zhang et al. 2024, W. Zhang et al. 2024)。

Previous work in data-driven healthcare studies demonstrates that patients with chronic diseases, including mental disorders, consistently share their symptoms, life events associated with their conditions, and details of their treatments through user-generated textual content online (Abbasi et al. 2019, Chau et al. 2020, Zhang and Ram 2020). Among the various healthcare studies that leverage user-generated content, research on mental disorders is particularly well-suited for utilizing this type of data because the current diagnosis of mental disorders relies on self-reported symptoms and life events in natural languages (APA 2022); and decades of scientific research have shown that user-generated text can reveal people's psychological states (e.g., emotions, moods) as well as behaviors and activities often associated with mental disorders (Tausczik and Pennebaker 2010, W. Zhang et al. 2024). Hence, using AI to analyze user-generated content holds great potential for enhancing the detection and management of mental disorders by extracting valuable insights from individuals firsthand experiences (Bardhan et al. 2020). For instance, online platforms can leverage mental disorder detection techniques to develop new services featuring personalized recommendations for users (e.g., encouraging individuals to seek help and treatments, promoting educational content and tools, offering treatment options, and fostering social support). Public administration can employ these techniques to strategically allocate resources to areas with high incidence rates, thereby enhancing the overall effectiveness of chronic disease programs. Policymakers can monitor large-scale user-generated textual content and facilitate the creation of evidence-based policies tailored to the specific needs of different patient cohorts.

先前在数据驱动的医疗健康研究中发现，慢性疾病（包括精神障碍）患者会持续通过在线用户生成文本内容分享症状、病情相关生活事件及治疗细节（Abbasi等2019，Chau等2020，Zhang和Ram2020）。在利用用户生成内容的各类医疗研究中，精神障碍研究特别适合采用此类数据，因为当前精神障碍诊断依赖于自然语言描述的自我报告症状和生活事件（APA2022）；数十年科学研究表明，用户生成文本能揭示人们的心理状态（如情绪、心境）以及常与精神障碍相关的行为活动（Tausczik和Pennebaker2010，W.Zhang等2024）。因此，运用AI分析用户生成内容，通过从个体第一手经验中提取有价值洞见，对提升精神障碍检测与管理具有巨大潜力（Bardhan等2020）。例如，在线平台可运用精神障碍检测技术开发新服务，为用户提供个性化推荐（如鼓励寻求帮助和治疗、推送教育内容与工具、提供治疗方案、促进社会支持）；公共管理部门可借此策略性地向高发地区分配资源，从而提升慢性病防治整体效能；政策制定者能通过监测大规模用户生成文本，推动制定基于证据、针对不同患者群体需求的精准政策。

However, existing methods on mental disorders through user-generated content show limitations in terms of real-world applicability and general iz ability due to the heavy reliance on fully supervised learning. Each mental disorder has unique characteristics and features, and these distinctions could include variations in the symptoms that patients reported, the life events that led to the development or progression of the disease, and the specific treatments that were applied. As a result, researchers have to follow a labor-intensive process to analyze and predict outcomes for these chronic diseases. For instance, it is required to collect and label data, creating a dataset where examples are categorized or identified according to the specific disease they were related to. However, it is difficult to reuse a dataset for a specific chronic disease for another disease using a fully supervised learning model, resulting in the need to create multiple datasets for different mental disorders. Furthermore, a customized machine learning model for each individual disease has to be designed and fine-tuned, which involves meticulously optimizing the algorithms, parameters, and features to make the model effective for that particular disease. This process is highly costly in terms of time, effort, and resources, significantly hampering the applicability of

然而，现有通过用户生成内容研究心理障碍的方法因严重依赖全监督学习，在实际应用和泛化能力方面存在局限。每种心理障碍都具有独特的特征，这些差异可能包括患者报告的症状差异、导致疾病发生或进展的生活事件差异，以及所采用的具体治疗方式差异。因此，研究人员必须遵循劳动密集型流程来分析预测这些慢性疾病的结果。例如：需要收集标注数据，创建按相关疾病分类识别的数据集。但基于全监督学习模型难以将特定慢性疾病数据集复用于其他疾病，导致需要为不同心理障碍创建多个数据集。此外，必须为每种疾病单独设计定制机器学习模型并进行微调，这涉及精心优化算法、参数和特征以使模型对该疾病有效。这一过程在时间、人力和资源方面成本极高，严重阻碍了应用性。

the resulting prediction model.

生成的预测模型。

This study aims to address this research gap by developing a general iz able and adaptable method capable of detecting multiple mental disorders without constructing a large amount of training data or designing a customized model for each disease. With the emergence of LLMs in AI and their remarkable abilities across various downstream tasks, the learning paradigms in NLP-related tasks have evolved from traditional feature engineering and architecture engineering to more advanced learning paradigms, including fine-tuning and prompt engineering (Liu et al. 2023). The foundation for such an approach is grounded in LLMs that have already employed extensive training data, computing power, and algorithmic capabilities to achieve highly promising results across various domains, surpassing the performance of individually-trained, task-specific models. Given these technological advances, leveraging the immense potential of LLMs and prompt engineering for mental disorder detection using user-generated content can minimize the cost of training and developing disease- or problem-specific models. This approach offers a more effective and efficient alternative than traditional methods.

本研究旨在通过开发一种通用且可适应的方法来填补这一研究空白，该方法能够检测多种精神障碍，而无需构建大量训练数据或为每种疾病设计定制模型。随着大语言模型(LLM)在人工智能领域的出现及其在各种下游任务中的卓越表现，自然语言处理相关任务的学习范式已从传统的特征工程和架构工程演变为更先进的学习范式，包括微调(fine-tuning)和提示工程(prompt engineering) (Liu et al. 2023)。这种方法的基础在于大语言模型已经通过大量训练数据、计算能力和算法能力，在各个领域取得了超越单独训练的特定任务模型的优异表现。鉴于这些技术进步，利用大语言模型和提示工程的巨大潜力，通过用户生成内容进行精神障碍检测，可以最大限度地降低训练和开发针对特定疾病或问题模型的成本。这种方法比传统方法提供了更有效、更高效的替代方案。

Nevertheless, challenges remain in utilizing prompt engineering and LLMs for mental disorder detection. Most current studies on prompt engineering in IS focus on discrete prompts-how to tweak the natural language used in prompts for downstream tasks. However, we argue that this approach is not the most effective in the context of mental disorder detection: a binary classification task where classification performance is paramount, and human comprehension of the prompt itself is not crucial. Moreover, detecting mental disorders using user-generated content presents two unique challenges. (1) Detecting heterogeneity among various diseases and how each individual presents their conditions poses significant challenges for prompt engineering. Different mental disorders exhibit distinct characteristics, including unique symptoms, risk factors, and treatment approaches, all of which shape the content of user-generated material. Additionally, each patient has a unique persona, characterized by individual patterns, habits, and disease progression, which collectively influence the creation of user-generated content. As a result, user-generated content can be lengthy, noisy, and highly complex, making it challenging for LLMs to efficiently detect various mental disorders at the subject level. This task extends beyond simply identifying explicit mentions of mental disorders and presents significant challenges for LLM models in comprehending nuances, extracting, understanding, and inferring the implicit information related to mental disorders. (2) Incorporating structured medical domain knowledge into prompt engineering to enhance an LLM's predictive performance remains under-explored. Within the medical domain, there is a wealth of medical knowledge that can be closely linked to the content reported in user-generated content and pertains to mental disorders, which can provide significant assistance in employing LLMs for mental disorder detection. However, existing medical knowledge is often organized in tree or network structures (e.g., ontologies, one of the most prevalent forms of domain knowledge). Current discrete prompt methods, which refine questions in natural language, lack effective mechanisms to leverage such structured knowledge.

然而，在利用提示工程和大语言模型进行精神障碍检测方面仍存在挑战。目前信息系统领域关于提示工程的研究大多集中于离散提示（discrete prompts）——即如何调整自然语言提示以适应下游任务。但我们认为，这种方法在精神障碍检测（二元分类任务）中并非最优解：当分类性能至关重要而人类对提示本身的理解无关紧要时，现有方法存在局限。此外，利用用户生成内容检测精神障碍面临两大独特挑战：(1) 疾病异质性与个体表现差异对提示工程构成重大挑战。不同精神障碍具有独特的症状特征、风险因素和治疗方式，这些都会影响用户生成内容。同时，每位患者的个人特质（包括行为模式、习惯和病程发展）会共同塑造其生成内容。这导致用户生成内容往往冗长、含噪且高度复杂，使得大语言模型难以在个体层面有效检测多种精神障碍——该任务不仅需要识别明确提及的精神障碍，更要求模型理解细微差异、提取并推断与精神障碍相关的隐含信息。(2) 如何将结构化医学知识融入提示工程以提升大语言模型预测性能仍待探索。医学领域存在大量与用户生成内容密切相关且涉及精神障碍的专业知识，这些知识能显著提升检测效果。但现有医学知识通常以树状或网状结构组织（如本体论这类典型领域知识形式），而当前基于自然语言问题优化的离散提示方法缺乏有效利用此类结构化知识的机制。

These two unique challenges in detecting mental disorders using user-generated content motivate us to propose a novel method that aims to: (1) maximize the performance of binary classification for mental disorder detection; (2) account for significant individual differences in both the disorder and the individuals during the prompt engineering process to improve predictive performance; and (3) enhance the effectiveness of LLM predictions by injecting medical knowledge in the form of ontologies during the prompting process. Specifically, our proposed framework utilizes continuous ensemble prompt engineering techniques to interact with LLMs and generate accurate mental disorder detection results. We incorporate prefix-tuning to create personalized prompts tailored to individual patients. Additionally, to account for the unique characteristics of each mental disorder and leverage medical knowledge, we integrate a novel rule-based prompting method that incorporates disease-related medical ontologies.

利用用户生成内容检测心理障碍时面临的这两大独特挑战，促使我们提出一种新方法，旨在：(1) 最大化心理障碍检测二元分类的性能；(2) 在提示工程过程中兼顾障碍与个体的显著个体差异，以提升预测性能；(3) 通过在提示过程中注入本体形式的医学知识，增强大语言模型预测的有效性。具体而言，我们提出的框架采用连续集成提示工程技术与大语言模型交互，生成准确的心理障碍检测结果。通过融入前缀调优技术，我们为个体患者创建个性化提示。此外，为兼顾每种心理障碍的独特性并利用医学知识，我们整合了一种新型基于规则的提示方法，该方法融合了疾病相关的医学本体。

Our key contributions are twofold. From the healthcare domain perspective, we propose a novel approach using prompt engineering and LLMs for the detection of mental disorders through user-generated textual content and achieve few-shot learning. The key advantage lies in eliminating the need for a substantial amount of labeled training data or customized architecture engineering for each specific disease or research problem. From the methodology perspective, we have two innovations. (1) We propose an ensemble prompt method, syne rg i zing prefix tuning and rule-based prompt engineering to address challenges in healthcare: personalized prompts and medical knowledge injection, which enhance method accuracy and efficacy. (2) We propose a new rule-based prompt method that efficiently tackles complex detection problems, integrating ontology-format domain knowledge, and its design principles can be extended to other problem domains, maximizing the potential of LLMs for real-world problem-solving.

我们的核心贡献体现在两方面。从医疗领域视角，我们提出了一种创新方法，通过提示工程 (prompt engineering) 和大语言模型，基于用户生成文本内容实现精神障碍检测，并达成少样本学习。其核心优势在于无需为每种特定疾病或研究问题准备大量标注训练数据或定制架构工程。从方法论视角，我们有两项创新：(1) 提出集成提示方法，协同前缀调优 (prefix tuning) 与基于规则的提示工程，解决医疗领域两大挑战：个性化提示与医学知识注入，从而提升方法准确性与有效性；(2) 提出新型基于规则的提示方法，通过整合本体格式领域知识高效解决复杂检测问题，其设计原则可拓展至其他问题领域，最大化释放大语言模型解决现实问题的潜力。

We position our work as computational design science research (Gregor and Hevner 2013, Hevner et al. 2004). In the context of machine learning in IS research (Padma nab han et al. 2022), our work represents a Type I contribution focused on method development. Specifically, we introduce a new continuous ensemble prompt engineering method for personalized context and medical knowledge injection. Given that mental disorders are one of the major contributors to the overall global disease burden, our approach addresses this societal challenge by providing a tailored machine learning framework along with accompanying algorithms (Padma nab han et al. 2022). In line with design research pathways for artificial intelligence (Abbasi et al., 2024), our 'artifact typologies"” are a new “predictive model". The “abstraction spectrum” of our work includes: (1) emphasizing individual differences in the prompt-based prediction process to enhance accuracy, and (2) incorporating existing domain knowledge in ontology format into the prompting process can significantly improve performance. Both contribute valuable “salient design insights? for future research.

我们将本研究定位为计算设计科学研究 (Gregor and Hevner 2013, Hevner et al. 2004)。在信息系统研究中机器学习应用的背景下 (Padmanabhan et al. 2022)，我们的工作属于专注于方法开发的I类贡献。具体而言，我们提出了一种新型连续集成提示工程方法，用于个性化上下文和医学知识注入。鉴于精神障碍是全球疾病负担的主要诱因之一，我们的方法通过提供定制化机器学习框架及配套算法 (Padmanabhan et al. 2022) 来应对这一社会挑战。遵循人工智能领域的设计研究路径 (Abbasi et al., 2024)，我们的"人工制品类型学"是一种新型"预测模型"。本研究的"抽象谱系"包括：(1) 在基于提示的预测过程中强调个体差异以提升准确性，(2) 将以本体形式存在的领域知识融入提示过程可显著提升性能。二者均为未来研究提供了宝贵的"关键设计洞见"。

Practically, our work has significant implications for mental disorder detection. It provides an accurate detection method that can provide complementary information to existing mental disorder screening procedures. For public health management, our method enables large-scale analyses of a population's mental health beyond what has previously been possible with traditional methods.

实践中，我们的研究对精神障碍检测具有重大意义。它提供了一种精准的检测方法，可为现有精神障碍筛查流程提供补充信息。在公共卫生管理层面，该方法实现了超越传统手段的大规模人群心理健康分析能力。

2. 相关工作

Our work aims to leverage user-generated text content for detecting mental disorders, framing it as a binary classification problem. We also propose a novel prompting method that utilizes large LLMs and overcomes the limitations of current mental disorder detection approaches, which often rely heavily on large amounts of labeled training data. We begin by reviewing the evolution of supervised machine learning techniques in natural language processing (NLP) and explaining why prompt-based methods hold promise in this research area. Next, we provide an overview of existing prompting techniques, justifying the motivations behind our research design and emphasizing the novelty of our proposed method.

我们的工作旨在利用用户生成的文本内容检测心理障碍，并将其视为二元分类问题。我们还提出了一种新颖的提示方法，利用大语言模型克服当前心理障碍检测方法的局限性，这些方法通常严重依赖大量带标签的训练数据。首先，我们回顾了自然语言处理 (NLP) 中监督机器学习技术的演变，并解释了为什么基于提示的方法在该研究领域具有前景。接着，我们概述了现有的提示技术，论证了研究设计背后的动机，并强调了我们提出的方法的新颖性。

2.1. NLP相关监督式机器学习范式

Supervised learning, a subcategory of machine learning and AI, has found extensive applications across diverse domains, facilitating tasks such as classifications, detections, and predictions. It is characterized by its use of labeled training datasets to supervise algorithms that produce outcomes accurately. In NLP (i.e., textual content-related machine learning), supervised learning has its paradigms and has evolved through various stages (Figure 1): from feature engineering and architecture engineering to pre-training and fine-tuning, and finally, to pre-training and prompt engineering (Liu et al. 2023).

监督学习 (Supervised learning) 作为机器学习和人工智能的子领域，已在分类、检测和预测等任务中展现出广泛应用。其核心特征是通过标注训练数据集来指导算法生成准确结果。在自然语言处理 (NLP) （即与文本内容相关的机器学习）领域，监督学习形成了独特范式并经历了多个演进阶段 (图 1)：从特征工程与架构工程，到预训练与微调，最终发展为预训练与提示工程 (Liu et al. 2023)。

Until recently, most studies have focused on fully supervised learning. Since fully supervised learning requires a substantial amount of labeled data to train high-performing models, and large-scale labeled data for specific NLP or healthcare-related tasks are limited, researchers have primarily focused on feature engineering before the advent of deep learning. Feature engineering involves extracting meaningful features from data using domain knowledge. For instance, Chau et al. (2020) focus on identifying emotional distress in user-generated content by employing a

直到最近，大多数研究都集中在全监督学习上。由于全监督学习需要大量标注数据来训练高性能模型，而针对特定自然语言处理（NLP）或医疗健康相关任务的大规模标注数据有限，在深度学习兴起之前，研究人员主要专注于特征工程。特征工程指利用领域知识从数据中提取有意义的特征。例如，Chau等人 (2020) 通过采用......

combination of feature extraction, feature selection, rules derived from domain experts, and machine learning classification.

特征提取、特征选择、领域专家推导规则与机器学习分类的结合。

With the emergence of deep learning, which has the capacity to automatically extract features from data without feature engineering, researchers shifted their focus to model architecture engineering. These approaches involve designing appropriate deep learning structures to introduce inductive biases into models, facilitating the learning of useful features. A notable work is Yang et al. (2022), in which the authors develop a deep learning architecture for personality detection using user-generated content. Their research design is deliberately crafted to incorporate advanced deep learning architecture engineering, including transfer learning and hierarchical attention network architectures, alongside concepts from relevant psycho linguistic theories.

随着深度学习(Deep Learning)的出现，其能够无需特征工程即可从数据中自动提取特征，研究人员将注意力转向了模型架构工程。这些方法通过设计合适的深度学习结构，为模型引入归纳偏置，从而促进有用特征的学习。Yang等人(2022)的研究是其中的代表性工作，作者开发了一种用于人格检测的深度学习架构，利用用户生成内容。他们的研究设计精心结合了先进的深度学习架构工程，包括迁移学习和分层注意力网络架构，同时融合了相关心理语言学理论的概念。

gure 1. Different NLP Supervised Learning Paradigms and Their Key Scholarly Contributions

图 1: 不同NLP监督学习范式及其关键学术贡献

Since 2018, NLP-related machine learning models transitioned to a new paradigm known as pre-train and fine-tune (Devlin et al. 2018), where a fixed architecture language model (e.g..

自2018年起，NLP相关的机器学习模型转向了一种称为预训练与微调（pre-train and fine-tune）的新范式（Devlin等人2018），其中采用固定架构的语言模型（例如...

BERT, T5, and GPT) can be pre-trained on a massive amount of text data. Pre-training typically involves tasks such as completing contextual sentences (e.g., fill-in-the-blank tasks), which do not require expert knowledge and can be directly performed on pre-existing large-scale data (i.e., self-supervised learning). The pre-trained model is then adapted to downstream tasks by fine-tuning (i.e., introducing additional parameters). This shift led researchers to focus on objective engineering, involving designing better objective functions for both pre-training and fine-tuning tasks (Sanh et al. 2020).

BERT、T5和GPT) 可以在海量文本数据上进行预训练。预训练通常涉及完成上下文句子等任务（例如填空任务），这些任务不需要专业知识，可以直接在已有的大规模数据上执行（即自监督学习）。随后通过微调（即引入额外参数）使预训练模型适配下游任务。这一转变促使研究者聚焦于目标工程，涉及为预训练和微调任务设计更好的目标函数 (Sanh et al. 2020)。

Table 1. Example of Prompts inNLPTasks
Task	Originalinput	Prompt	Inputtoan LLM	Feedbackfroman LLM
Sentiment prediction	missedthe bus today.	Ifeltso	missed the bus today. feltso [mask].	TheLMfillsinthe [mask] with anemotionword e.g..,f frustrating.
Translation	missedthe bus today.	English: French:	English:I missed the ebus today. French: [mask]	TheLMfillsinthe [mask] with the corresponding Frenchsentence,e.g., J'airatelebusaujourd'hui

表 1: NLP任务中的提示词示例

任务	原始输入	提示词	大语言模型输入	大语言模型反馈
情感预测	missed the bus today.	Ifeltso	missed the bus today. feltso [mask].	模型用情感词(如frustrating)填充[mask]。
翻译		English: French:	English: I missed the bus today. French: [mask]	模型用法语句子(如J'airatelebusaujourd'hui)填充[mask]。

During the process of objective engineering, using different prompts to frame the same input (templates $T$ surrounding the input) could facilitate various tasks and improve predictive performance. Table 1 provides examples using discrete prompts (natural language). However, it is important to note that the templates $T$ surrounding the input can also be continuous prompts (i.e., numeric vectors). Subsequently, it was discovered that even using different prompts for the same task can result in a variance in the prediction performance. That is, not only how the language model is trained but also how the prompt is designed can have a significant impact on the performance. Therefore, many researchers have shifted their focus to prompt engineering, exploring the design of effective prompts for downstream tasks (Liu et al. 2023).

在目标工程过程中，使用不同的提示词（围绕输入的模板$T$）可以促进多种任务并提升预测性能。表1展示了使用离散提示（自然语言）的示例。但需注意，围绕输入的模板$T$也可以是连续提示（即数值向量）。后续研究发现，即使针对同一任务使用不同提示词，也会导致预测性能的波动。这表明不仅大语言模型的训练方式，提示词的设计同样对性能有重大影响。因此，许多研究者将重心转向提示工程，探索针对下游任务的有效提示设计（Liu et al. 2023）。

Although both fine-tuning and prompt engineering leverage the capabilities of LLMs for various downstream prediction tasks, researchers have found that prompt engineering can outperform fine-tuning in terms of predictions. This is because the objectives of pre-training of LLMs (e.g., masked language modeling) and fine-tuning (e.g., binary classification) are not always aligned, making it difficult to fully exploit the knowledge embedded in pre-trained LLMs, which can lead to suboptimal performance on downstream tasks (Wang et al. 2022). Moreover, empirical evidence suggests that prompt learning is computationally efficient, as it is well-suited for zero-shot or few-shot learning scenarios where data is limited (Gao et al. 2021).

尽管微调 (fine-tuning) 和提示工程 (prompt engineering) 都能利用大语言模型的能力完成各种下游预测任务，但研究人员发现提示工程在预测效果上可能优于微调。这是因为大语言模型的预训练目标（如掩码语言建模）与微调目标（如二元分类）并不总是一致，导致难以充分挖掘预训练模型中嵌入的知识，从而影响下游任务性能 (Wang et al. 2022)。此外，实证研究表明提示学习具有计算效率优势，尤其适合数据有限的零样本或少样本学习场景 (Gao et al. 2021)。

2.2. Prompt Engineering

2.2. 提示工程

We first elucidate the key differences between prompt engineering and other NLP-related supervised learning paradigms. Feature enginering, architecture enginering, and fine-tuning share a common pattern: training a machine learning model to process labeled training examples ( $x,y)$ and predict an output $y$ as $p(\boldsymbol{y}|\boldsymbol{x})$ . In contrast, prompt engineering follows a distinct learning process: using an LLM, it directly models the probability of an outcome $z$ (see Table 1: Feedback from LM). To leverage these models for prediction tasks, the original input $x$ undergoes modification using a template $T$ to create a new input $x'$ (i.e., the prompt). The template $T$ can be either discrete (natural language) or continuous (numeric vectors).

我们首先阐明提示工程 (prompt engineering) 与其他NLP相关监督学习范式之间的关键区别。特征工程、架构工程和微调遵循相同范式：通过训练机器学习模型处理带标签的训练样本 ( $x,y)$ 并预测输出 $y$ 作为 $p(\boldsymbol{y}|\boldsymbol{x})$ 。而提示工程采用截然不同的学习过程：利用大语言模型直接建模结果 $z$ 的概率 (见表1: LM反馈)。为将这些模型应用于预测任务，原始输入 $x$ 会通过模板 $T$ 修改为新输入 $x'$ (即提示)。模板 $T$ 可以是离散的 (自然语言) 或连续的 (数值向量)。

To clarify the operational mechanisms of prompt engineering, we employ a discrete, human-readable prompt as an illustrative example. This new input $x'$ of LLMs has unfilled slots $[m a s k]$ , and the LLM is then utilized to probabilistic ally fill in the missing information $[m a s k]$ resulting in $z$ , from which, the ultimate output $y$ can be derived through $p(\boldsymbol{\hat{y}}|z)$ . Take the sentiment prediction task in Table 1 as an example. The original input $x$ is the text $^{\circ}I$ missed the bus today? or its vector representation. The corresponding label, denoted as $y$ , is “negative (sentiment)." In learning paradigms other than prompt engineering, $y$ is directly derived through P("I missed the bus today"). However, in prompt engineering, researchers first design a prompt denoted as $x'$ (i.e., a new input to an LLM): "I missed the bus today. I felt so [mask]." The LLM fills the unfilled slots of $x'$ , resulting in $z$ ,"frustrating." The prediction result $y$ is determined by the value of $z$ .As $z$ is closely associated with a negative sentiment, the prediction outcome $y$ is consequently classified as "negative." The determination of which words are more closely

为阐明提示工程(prompt engineering)的运作机制，我们采用一个离散化、人类可读的提示作为示例。大语言模型的新输入$x'$包含未填充的槽位$[mask]$，随后利用大语言模型以概率方式填补缺失信息$[mask]$得到$z$，最终输出$y$可通过$p(\boldsymbol{\hat{y}}|z)$推导得出。以表1中的情感预测任务为例，原始输入$x$是文本"I missed the bus today"或其向量表示，对应标签$y$为"negative"(情感)。在非提示工程的学习范式中，$y$直接通过P("I missed the bus today")推导获得；而在提示工程中，研究者首先设计提示$x'$(即大语言模型的新输入)："I missed the bus today. I felt so [mask]."大语言模型填补$x'$的未填充槽位得到$z$："frustrating"。预测结果$y$由$z$的值决定，由于$z$与负面情感高度关联，因此预测结果$y$被判定为"negative"。具体哪些词汇更紧密...

associated with “negative” or “positive? can be either pre-defined or learned automatically, which is referred to as the verbalizer V.

与"负面"或"正面"相关的分类可以预先定义或自动学习，这被称为verbalizer V。

Table 2. Representative Prompt Engineering Methods and Comparison with Our Method

(a) Classification based onPrompt Design
Cat.	Shape of prompts		Manual/automated		Discrete/continuous		Static/dynamic prompts
	Cloze	Prefix	prompts Manual	Automated	prompts Discrete	Continuous	Static	Dynamic

LAMA(Petronietal.2019) TemplateNER(Cuietal.2021)
GPT-3(Brownetal.2020)
Prefix-Tuning(LiandLiang2021)
Prompttuning(Lesteretal.2021)
AutoPrompt(Shinetal.2020)
Ours
	(b)		Classification based on Multi-prompt Learning
			Prompt				Prompt
Cat.	PromptEnsemble		Augmentation		Prompt Composition		Decomposition
BARTScore(Yuanetal.2021)
GPT-3(Brownetal.2020)
PTR(Hanetal.2022) TemplateNER(Cuietal.2021)
Ours

表 2. 代表性提示工程方法及与本文方法的对比

(a) 基于提示设计的分类

Cat.	Shape of prompts	Manual/automated	Discrete/continuous	Static/dynamic prompts
	Cloze	Prefix	prompts Manual	Automated
LAMA (Petroni et al. 2019) TemplateNER (Cui et al. 2021)
GPT-3 (Brown et al. 2020)
Prefix-Tuning (Li and Liang 2021)
Prompt tuning (Lester et al. 2021)
AutoPrompt (Shin et al. 2020)
Ours

(b) 基于多提示学习的分类

Cat.	Prompt Ensemble	Prompt Augmentation	Prompt Composition	Prompt Decomposition
BARTScore (Yuan et al. 2021)
GPT-3 (Brown et al. 2020)
PTR (Han et al. 2022) TemplateNER (Cui et al. 2021)
Ours

Note:

注：

The objective of prompt engineering is to develop a prompting function, denoted as $x'=f_{prompt}(x)$ , to achieve optimal performance in the subsequent task. Prompt engineering can significantly enhance the efficiency and effectiveness of the prediction process since it enables LLMs to undergo pre-training on vast amounts of pre-existing textual data. Moreover, by defining $f_{prompt}(x)$ , the model can facilitate few-shot or even zero-shot learning, seamlessly adapting to new scenarios with minimal or no labeled data.

提示工程的目标是开发一个提示函数，记为 $x'=f_{prompt}(x)$ ，以在后续任务中实现最佳性能。提示工程可以显著提高预测过程的效率和效果，因为它使大语言模型能够对大量现有文本数据进行预训练。此外，通过定义 $f_{prompt}(x)$ ，该模型可以促进少样本甚至零样本学习，只需极少或无需标注数据即可无缝适应新场景。

As the literature underscores, the design of a prompt can have substantial influence on the overall performance of a prompt-based method (Liu et al. 2023). Therefore, various prompt engineering methods have been proposed, which can be categorized based on the shape of prompts, manual/automated prompts, discrete/continuous prompts, and static/dynamic prompts, each with distinct characteristics and associated pros and cons. The choice of prompt engineering method depends on both the task at hand and the specific LLMs employed to address the task. We provide a summary of representative studies in Table 2. Recently, many studies have highlighted the significant improvement in the effectiveness of prompt engineering methods through the utilization of multiple prompts—a concept known as multi-prompt engineering. Several key strategies for multi-prompt learning have been identified, including prompt ensembling, prompt augmentation, prompt composition, and prompt decomposition (Liu et al. 2023).

文献研究表明，提示(prompt)设计对基于提示的方法整体性能具有显著影响(Liu et al. 2023)。因此，研究者提出了多种提示工程方法，这些方法可按提示形态、人工/自动提示、离散/连续提示、静态/动态提示等维度进行分类，各类方法具有不同特性及优缺点。提示工程方法的选择需同时考虑具体任务和所用大语言模型(LLM)。我们在表2中总结了代表性研究。近期许多研究表明，通过使用多重提示(即多提示工程)可显著提升提示工程方法的有效性。目前已识别出多提示学习的若干关键策略，包括提示集成(prompt ensembling)、提示增强(prompt augmentation)、提示组合(prompt composition)和提示分解(prompt decomposition)(Liu et al. 2023)。

Although prompt engineering has shown significant potential among different tasks and scenarios, many challenges remain (Liu et al. 2023). Two of the most significant technical challenges in this field are as follows. (1) Prompt design for complex tasks: the formulation and design of prompts for complex tasks are not straightforward (Liu et al. 2023). Particularly, prompt design in mental disorder detection using textual data is under-explored. Each patient possesses unique characteristics and patterns, including but not limited to, linguistic styles (such as a tendency to complain, convey setbacks; or a tendency to endure, face challenges positively), habits of using social media, and the extent to which one is willing to openly discuss their own illnesses, a unique course of progression in their illness, and so on. Moreover, different types of mental disorders exhibit distinct (but sometimes similar) symptoms, risk factors, and treatments. (2) Prompt engineering with structured domain knowledge (Han et al. 2022): in many NLP tasks, inputs may exhibit various structures (e.g., syntax trees or relational structures from relationship extraction); effectively expressing these structures in prompt engineering poses a significant challenge. In the realm of mental disorder management, chronic disease management, and healthcare in general, a substantial volume of medical knowledge exists in structured formats (e.g.. ontologies, which are tree or network structures). Leveraging the existing domain knowledge can greatly enhance disease detection using textual data. However, this domain remains largely under explored, presenting a potentially crucial and interesting avenue for research. These two

尽管提示工程(prompt engineering)在不同任务和场景中展现出巨大潜力，但仍存在诸多挑战(Liu et al. 2023)。该领域最显著的两个技术挑战如下：(1) 复杂任务的提示设计：针对复杂任务的提示制定与设计并非易事(Liu et al. 2023)。特别是在利用文本数据进行精神障碍检测时，提示设计研究尚不充分。每位患者都具有独特特征和行为模式，包括但不限于：语言风格(如倾向于抱怨、诉说挫折；或倾向于忍耐、积极面对挑战)、社交媒体使用习惯、对公开讨论自身疾病的接受程度、独特的病情发展轨迹等。此外，不同类型的精神障碍表现出独特(但有时相似)的症状、风险因素和治疗方案。(2) 结合结构化领域知识的提示工程(Han et al. 2022)：在许多NLP任务中，输入可能呈现多种结构(如句法树或关系抽取得到的关系结构)；如何在提示工程中有效表达这些结构构成重大挑战。在精神障碍管理、慢性病管理及医疗健康领域，大量医学知识以结构化形式存在(如本体论中的树状或网状结构)。利用现有领域知识可显著提升基于文本数据的疾病检测效果。然而这一领域仍存在大量研究空白，可能成为关键而有趣的研究方向。这两个

challenges represent the core issues that this work aims to address. In the following sections, we will explore our proposed solutions within our research context.

这些挑战代表了本工作旨在解决的核心问题。在接下来的章节中，我们将在研究背景下探讨提出的解决方案。

2.2.1. Mental Disorder Detection and Continuous Prompt Engineering

2.2.1. 精神障碍检测与连续提示工程

One method of classifying prompts is to categorize them as either continuous or discrete. A discrete prompt modifies the input to an LLM using natural language. In contrast, a continuous prompt operates directly in the model's embedding space, allowing it to (1) relax the constraint that template embeddings $T$ must correspond to natural language (e.g., English) words and (2) eliminate the restriction that the template $T$ is parameterized by the pre-trained LLM's parameter

一种对提示进行分类的方法是将它们划分为连续型或离散型。离散型提示通过自然语言修改大语言模型(LLM)的输入。相比之下，连续型提示直接在模型的嵌入空间中操作，使其能够：(1) 放宽模板嵌入 $T$ 必须对应自然语言(如英语)单词的约束；(2) 消除模板 $T$ 由预训练大语言模型参数参数化的限制

As mentioned earlier, this study focuses on detecting mental disorders using user-generated content. The problem is a binary classification task. We argue that continuous prompts are more effective than discrete ones for our research problem for the following reasons: (1) Enhanced expressiveness: continuous prompts leverage high-dimensional vector representations, enabling the model to learn latent semantic features tailored to the classification task (Lester et al. 2021). Unlike discrete prompts, continuous prompts capture nonlinear relationships that rigid textual templates (discrete prompt) often miss (Liu et al. 2024). (2) Flexibility via optimization: continuous prompts consist of trainable parameters optimized through back propagation, allowing them to align directly with the prediction task's loss function (Lester et al. 2021). In contrast, discrete prompts rely on heuristic tuning, which may lead to misalignment with the LLMs' internal representations and the objectives of the prediction task (Shin et al. 2020). (3) Mitigating ambiguity in classifications: binary tasks, such as mental disorder detection (positive/negative), require probabilistic boundaries. Continuous prompts excel in these domains by providing probabilities as feedback from LLMs. In our work, we leverage these probabilities for result interpretation, which can offer significant benefits to stakeholders (see Section 4.5). In contrast, discrete prompts impose rigid mappings, only yielding binary outcomes (benign/malignant) (Shin et al. 2020). (4) Computational efficiency and performance improvement: although discrete

如前所述，本研究聚焦于利用用户生成内容检测心理障碍。该问题属于二分类任务。我们认为连续提示(continuous prompts)比离散提示更适合本研究问题，原因如下：(1) 表现力增强：连续提示利用高维向量表示，使模型能够学习针对分类任务的潜在语义特征(Lester et al. 2021)。与离散提示不同，连续提示能捕捉刚性文本模板(离散提示)常忽略的非线性关系(Liu et al. 2024)。(2) 通过优化实现灵活性：连续提示由可训练参数组成，通过反向传播进行优化，使其能直接对齐预测任务的损失函数(Lester et al. 2021)。相比之下，离散提示依赖启发式调优，可能导致与大语言模型内部表征及预测任务目标不一致(Shin et al. 2020)。(3) 缓解分类模糊性：心理障碍检测(阳性/阴性)等二分类任务需要概率边界。连续提示通过提供大语言模型的概率反馈，在此类任务中表现优异。我们在工作中利用这些概率进行结果解释，可为利益相关者带来显著价值(参见第4.5节)。而离散提示采用刚性映射，仅产生二元结果(良性/恶性)(Shin et al. 2020)。(4) 计算效率与性能提升：尽管离散

prompts may seem straightforward because they are easily understandable by humans, their heuristic tuning (e.g., words tweaking for better performance) can be time-consuming. In contrast, continuous prompts automate the tuning process with minimal computational overhead (add minimal trainable parameters, e.g., prefix embeddings, avoiding full model fine-tuning). This is critical for binary tasks with limited labeled data (Lester et al. 2021).

提示词看似简单易懂，但其启发式调优（如调整用词以提升效果）可能耗时。相比之下，连续提示通过最小计算开销（仅添加少量可训练参数，如前缀嵌入，避免全模型微调）实现了调优自动化。这对于标注数据有限的二元分类任务至关重要（Lester et al. 2021）。

Continuous prompts have one primary limitation: they are typically represented as high-dimensional numerical vectors, which makes it difficult to directly understand or interpret their specific meanings. This contrasts with the more intuitive nature of discrete prompts. However, these limitations are not critical in the context of our research-binary classification. The understand ability of the prompt per se (the input) is not the ultimate goal. Our goal is to predict mental disorders (the output). Even with continuous prompts, we can still offer explain able insights for this prediction outcome, which we will show in Section 4.5.

连续提示有一个主要限制：它们通常表示为高维数值向量，这使得难以直接理解或解释其具体含义。这与离散提示更直观的特性形成对比。然而，这些限制在我们的研究——二元分类背景下并不关键。提示本身（输入）的可理解性并非最终目标，我们的目标是预测精神障碍（输出）。即使使用连续提示，我们仍可为该预测结果提供可解释的洞见，这将在第4.5节展示。

Moreover, there are two key research challenges in our research problem: (1) addressing heterogeneity among individuals and disorders, and (2) incorporating structured medical knowledge into prompts to enhance LLM prediction. We aim to tackle this research problem by enhancing the performance of LLMs in binary mental disorder detection tasks through a prompt ensemble approach that combines two continuous prompt engineering techniques: prefix tuning and rule-based prompting.

此外，我们的研究问题存在两个关键挑战：(1) 解决个体与疾病间的异质性问题，(2) 将结构化医学知识融入提示词以增强大语言模型预测性能。我们计划通过提示集成方法提升大语言模型在二元精神障碍检测任务中的表现，该方法结合了两种连续提示工程技术：前缀调优和基于规则的提示构建。

2.2.2. Prefix Tuning for Personalized Prompts

2.2.2. 个性化提示的前缀调优 (Prefix Tuning)

Prefix tuning is a continuous prompt method that optimizes a sequence of trainable vectors prepended to the input embeddings (Li and Liang 2021). The underlying intuition of this method lies in the idea that by providing an appropriate context to an LLM, which can influence the encodingof $x$ and direct the LLM on what information to extract from $x$ .Therefore, the context can guide the LLM to effectively solve downstream tasks. Nevertheless, it is not clear whether such a context exists or how to identify such a context for each individual $x$ . Therefore, the authors propose the prefix tuning method to automatically optimize continuous prefixes for inputs as the context. Formally, for a training example $(x,y)$ ,theydefine

前缀调优是一种连续提示方法，它优化了预置在输入嵌入前的可训练向量序列 (Li and Liang 2021)。该方法的核心思想在于：通过为大语言模型提供适当的上下文，可以影响 $x$ 的编码过程，并指导模型从 $x$ 中提取哪些信息。因此，这种上下文能引导大语言模型有效解决下游任务。然而，目前尚不清楚这种上下文是否存在，或如何为每个独立的 $x$ 识别这种上下文。为此，作者提出前缀调优方法来自动优化作为上下文输入的连续前缀。形式上，对于训练样本 $(x,y)$ ，他们定义

$$
f_{prompt}=[Prefix;x;Prefix';y]
$$

where Prefix and Prefix are placeholder s for values associated with the training example $(x,y)$ The Prefix and $Prefix'$ for all training examples consist of a trainable matrix $P_{\theta}[i,:]$ , where $ i \in P_{idx}$ and $P_{idx}$ denotes the sequence of prefix indices. Therefore, the feedback from LLM is

其中 Prefix 和 Prefix' 是与训练样本 $(x,y)$ 相关联的值的占位符。所有训练样本的 Prefix 和 $Prefix^{\prime}$ 由一个可训练矩阵 $P_{\theta}[i,:]$ 组成，其中 $i \in P_{idx}$ 且 $P_{idx}$ 表示前缀索引序列。因此，大语言模型 (LLM) 的反馈为
$$
z_i =\begin{cases} P_{\theta}[i:], & \text{if } i \in P_{idx},\\LM_{\phi}(f_{prompt}(x), z_{<i}), & \text{o.w.} \end{cases}
$$

$$
z_i =\begin{cases} P_{\theta}[i:], & \text{if } i \in P_{idx},\\LM_{\phi}(f_{prompt}(x), z_{<i}), & \text{o.w.} \end{cases}
$$

$z_{_ i}$ represens funetion O the raiable $P_{\theta}$ (of dimension $\left|P_{i d x}\right|\times d i m(z_{i}))$ When $i\in P_{i d x},z_{i}$ directly opies fom $P_{\theta};$ when $i\notin P_{i d x},z_{i}$ $P_{theta}$ asitis the prefix contextand subsequent feedback from $L M_{\Phi}$ relies on the activation s of the preceding feedback. Empirically $P_{\mathrm{{\scriptsize6}}}[i,:]$ $P_{\mathrm{\scriptsize~\theta~}}^{\prime}[i,:]$ $\left|{{P}_ {i d x}}\right|\times k$ whero $k$ hyper para met r)using a feed forward neural etwork for stable traning: $P_{\mathrm{_ \theta}}[i,:]=M L P_{\mathrm{_ \theta}}(P_{\mathrm{_\theta}}^{\prime}[i,:])$ The learning goal is

$z_{_ i}$ 表示关于变量 $P_{\theta}$ 的函数 (维度为 $\left|P_{i d x}\right|\times d i m(z_{i}))$。当 $i\in P_{i d x}$ 时，$z_{i}$ 直接从 $P_{\theta}$ 复制；当 $i\notin P_{i d x}$ 时，$z_{i}$ 将 $P_{\theta}$ 作为前缀上下文，而 $L M_{\Phi}$ 的后续反馈依赖于先前反馈的激活。经验上，$P_{\mathrm{{\scriptsize6}}}[i,:]$ 和 $P_{\mathrm{\scriptsize~\theta~}}^{\prime}[i,:]$ 的维度为 $\left|{{P}_ {i d x}}\right|\times k$ (其中 $k$ 为超参数)，使用前馈神经网络进行稳定训练：$P_{\mathrm{_ \theta}}[i,:]=M L P_{\mathrm{_ \theta}}(P_{\mathrm{_\theta}}^{\prime}[i,:])$。学习目标是

$$
max_{\phi} \log p_{\phi(y|x)} = \sum_{i \in y_{idx}} \log p_{\phi}(f_{prompt}(x, y)* i | z* {<i})
$$

where $p_{\phi}$ is an LLM distribution, $\Phi$ is the LLM parameters that are fixed.

其中 $p_{\phi}$ 是大语言模型 (LLM) 的分布，$\Phi$ 是固定的大语言模型参数。

One significant advantage of this method is that it provides flexibility and adaptability to individual input $x$ . Given that the prepended vectors (i.e., Prefix and Prefix') are automatically updated during training, each prefix vector (i.e., $P_{\mathrm{{\scriptsize~_{\theta}}}}[i,:])$ is customized for individual input $x$ simultaneously. We exploit this feature of prefix tuning to generate personalized prompts for user textual data. In the context of mental disorder detection using user-generated content, it is desirable to provide a distinct prompt for each user for optimal performance since different users have unique characteristics and underlying patterns. Thus, prefix tuning represents a promising research direction for developing continuous prompts optimized for each individual, thereby

这种方法的一个显著优势在于它为单个输入$x$提供了灵活性和适应性。由于前置向量(即Prefix和Prefix')在训练过程中会自动更新,每个前缀向量(即$P_{\mathrm{{\scriptsize~_{\theta}}}}[i,:]$)都能同时为单个输入$x$进行定制。我们利用前缀调优的这一特性来为用户文本数据生成个性化提示。在使用用户生成内容进行精神障碍检测的场景中,由于不同用户具有独特的特征和潜在模式,为每个用户提供不同的提示以获得最佳性能是可取的。因此,前缀调优代表了一个有前景的研究方向,可以开发针对每个个体优化的连续提示,从而...

enhancing the performance of mental disorder detection. Specifically, our intentions are twofold: (1) to refine the design of $f_{p r o m p t}$ (Eq. 1) to better fulfill the role of a personalized prompt tailored to individual users for mental disorder detection; and (2) to seamlessly integrate the learning objective of prefix tuning (Eq. 3) with other prompt learning goals through a multi-prompt approach (i.e., prompt ensemble) to more effectively address the challenges associated with mental disorder detection.

提升心理障碍检测的性能。具体而言，我们的目标有两个：(1) 改进 $f_{p r o m p t}$ (公式1) 的设计，使其更好地作为针对个体用户量身定制的个性化提示 (prompt) 用于心理障碍检测；(2) 通过多提示方法 (即提示集成) 将前缀调优 (prefix tuning) (公式3) 的学习目标与其他提示学习目标无缝结合，以更有效地应对心理障碍检测相关的挑战。

2.2.3. Knowledge Injection Through Rule-based Prompts

2.2.3. 基于规则提示的知识注入

The rule-based prompt is another continuous prompt method that incorporates logical rules (e.g. "'If the text contains word $X.$ it likely belongs to class $\mathit{\nabla}Y^{\rangle}$ ) into the prompt tuning process. It uses continuous prompts as its foundation while constraining their optimization through rule-based losses (Han et al. 2022). Rule-based prompt is proposed to address the limitations of other widely-used prompt engineering methods in addressing complex text classification tasks: (1) manual prompt design is both laborious and prone to errors, (2) and for auto-generated prompts the validation of the efficacy is a resource-intensive and time-consuming process.

基于规则的提示 (rule-based prompt) 是另一种连续提示方法，它将逻辑规则 (例如 "如果文本包含单词 $X.$ 则很可能属于类别 $\mathit{\nabla}Y^{\rangle}$") 融入提示调优过程。该方法以连续提示为基础，同时通过基于规则的损失函数来约束其优化 (Han et al. 2022)。提出基于规则的提示是为了解决其他广泛使用的提示工程方法在处理复杂文本分类任务时的局限性：(1) 手动设计提示既费力又容易出错，(2) 对于自动生成的提示，验证其有效性是一个资源密集且耗时的过程。

The essence of the rule-based prompt to solve challenging classification tasks is threefold. First, for a highly challenging text classification problem (i.e., given $(x,y)$ and predict $p(\boldsymbol{\hat{y}}|\boldsymbol{x}).$ ， rule-based prompt breaks down the classification question into several simpler sub-classification tasks, namely, breaking down $p(y|x)$ to $p(y^{1}|x)...p(y^{f}|x)...p(y^{k}|x)$ , where $k$ indicates the number of subtasks. Then, the rule-based prompt method incorporates logical rules to compose task-specific prompts with several simpler sub-prompts and accomplish the complex classification task. Formally, for each sub-class fi cation task $p(y^{f}|x)$ , the rule-based prompt method sets a template $T^{f}(x)$ and a set of verbalizer words $\boldsymbol{V}^{f}={v_{_ {1}},...,v_{_{n}}}$ . The template $T^{f}(x)$ and verbalizer Vf constitute the prompting function f fpromp(x). The logical rule isdefined as

基于规则的提示方法解决复杂分类任务的核心可归纳为三点。首先，针对高难度文本分类问题（即给定$(x,y)$并预测$p(\boldsymbol{\hat{y}}|\boldsymbol{x})$），该方法将分类问题拆解为多个更简单的子分类任务，即将$p(y|x)$分解为$p(y^{1}|x)...p(y^{f}|x)...p(y^{k}|x)$，其中$k$表示子任务数量。随后，该方法通过逻辑规则组合任务特定的提示模板与若干简单子提示，最终完成复杂分类任务。形式上，对于每个子分类任务$p(y^{f}|x)$，规则提示方法会设定模板$T^{f}(x)$和对应的标签词集$\boldsymbol{V}^{f}={v_{_ {1}},...,v_{_{n}}}$。模板$T^{f}(x)$与标签词集Vf共同构成提示函数f fpromp(x)。逻辑规则定义为

$$
p(y^{1}|x)\wedge p(y^{2}|x)...p(y^{f-1}|x)\wedge p(y^{f}|x)...p(y^{k-1}|x)\wedge p(y^{k}|x)\to p(y|x)
$$

Second, the rule-based prompt method incorporates prior knowledge for each sub-classification task, reducing the laborious and error-prone nature of manual prompt construction and mitigating the uncertainties associated with auto-generated prompts. Formally, when con strut ing each sub-promt f $f_{_ {p r o m p t}}^{f}(x)$ for each sub-classification task $p(y^{f}|x)$ , prior knowledge can be injected in both the design of $T^{f}(x)$ and verbalizer V’ to facilitate the prediction and performance of $p(y^{f}|x)$ . For instance, consider a classical sub-classification problem in named entity recognition. Let $T^{f}(x)="x$ is the [mask] entity" and V= (pn, "gazaton his ub task il v nad enttygnit, t templates and verbalize rs can be meticulously customized to assist an LLM in accurately identifying the entity category. For a classical relation prediction problem, let $T^f(x) = "x \ entity_1 \ [mask] \ entity_2" \text{ and } V^f = {"was born in", "is parent of",...}$ . Again,the templates and verbalize rs for this sub-classification problem can be tailored to assist an LLM in completing a relation prediction task.

其次，基于规则的提示方法为每个子分类任务融入了先验知识，既减少了人工构建提示的繁琐与易错性，也降低了自动生成提示的不确定性。具体而言，在构建每个子分类任务 $p(y^{f}|x)$ 的次级提示 $f_{_ {p r o m p t}}^{f}(x)$ 时，可通过设计 $T^{f}(x)$ 和标签词表V'来注入先验知识，从而提升 $p(y^{f}|x)$ 的预测效果。例如命名实体识别中的经典子分类问题，设 $T^{f}(x)="x$ 是[mask]实体"且V=(pn, "gazaton his ub task il v nad enttygnit, t，通过精心定制模板与标签词表可帮助大语言模型准确识别实体类别。对于经典关系预测问题，设 $T^{f}(x)="x e n t i t y_{_ 1}[m a s k]e n t i t y_{_2}"{\mathrm{and}}V^{f}={"𝑤𝑎𝑠 𝑏𝑜𝑟𝑛 𝑖𝑛", "𝑖𝑠 𝑝𝑎𝑟𝑒𝑛𝑡 𝑜𝑓", .}$ ，同样可通过定制该子分类问题的模板与标签词表，辅助大语言模型完成关系预测任务。

Lastly, the rule-based prompt method composes sub-prompts of various sub-problems into a complete task-specific prompt,

最后，基于规则的提示方法将各种子问题的子提示组合成一个完整的任务特定提示。

$$
f_{prompt}(x)=\begin{cases} T(x)=[T^1(x);...;T^f(x);...;T^k(x)],\\V[mask]_1 = {v^1_1, v^1_2,...},..., V[mask]_2 = {v^2_1, v^2_2,...},..., V[mask]_k = {v^k_1, v^k_2,...}.
\end{cases}
$$

where $[\cdot;\cdot;\cdot]$ is the aggregation function of sub-templates. The learning objective of the rule-based prompt method is

其中 $[\cdot;\cdot;\cdot]$ 是子模板的聚合函数。基于规则的提示方法的学习目标是

$$
m a x_{\Phi}p_{\Phi(y|x)}=l o g\prod_{f=1}^{r}p_{\Phi}\Big([m a s k]_ {_ f}=L M_{_\Phi}(y)|T(x)\Big)
$$

Iwhere $r$ is the muabero masked posion n $T(x)$ and $\left[m a s k\right]_ {f}=L M_{_ \Phi}(y)$ is to map the las to the set of label words $V[m a s k]_{f}$

其中 $r$ 是掩码位置的数量，$T(x)$ 和 $\left[mask\right]_ {f}=LM_{\Phi}(y)$ 的作用是将最后一个 token 映射到标签词集合 $V[mask]_{f}$。

In our research context, predicting whether an individual has a specific mental disorder by directly utilizing an LLM and ultra-long user-generated content as inputs (since the task is at the individual level) presents a significant challenge. As mentioned, various mental disorders exhibit distinct or sometimes similar symptoms, risk factors, and treatments. Therefore, the rule-based prompt method is an efficient way to design sub-prompts to capture different aspects of mental disorders (e.g., symptoms, risk factors, and treatments) to simplify the detection task using user-generated content. Furthermore, the rule-based prompt method is an ideal method to incorporate the existing domain knowledge which is widely available and essential in mental disorder diagnosis and healthcare. Hence, in this study, we attempt to encode and incorporate existing medical knowledge by proposing a new rule-based prompt engineering method for improved mental disorder detection performance. Specifically, our key innovations include: (1) modifying the logic rules implemented in the original method (Eq. 4) and the learning goal of the rule-based prompt method (Eq. 6) to transfer it to the mental disorder detection task; (2) exploring an effective mechanism to inject existing medical knowledge of mental disorder detection in the rule-based prompt engineering process (Eq. 5), and (3) seamlessly integrating the learning objective of the rule-based prompt method (Eq. 6) with other prompt learning goals through a multi-prompt approach (i.e., prompt ensemble and prompt composition) to more effectively address the challenges associated with mental disorder detection.

在我们的研究背景下，直接利用大语言模型(LLM)和超长用户生成内容作为输入(由于任务处于个体层面)来预测个体是否患有特定精神障碍，是一项重大挑战。如前所述，各类精神障碍表现出不同或有时相似的症状、风险因素和治疗方法。因此，基于规则的提示(prompt)方法能高效设计子提示来捕捉精神障碍的不同方面(如症状、风险因素和治疗方案)，从而简化基于用户生成内容的检测任务。此外，基于规则的提示方法也是整合现有领域知识的理想方式——这些知识在精神障碍诊断和医疗保健中既广泛可得又至关重要。为此，本研究尝试通过提出新型基于规则的提示工程方法，对现有医学知识进行编码整合，以提升精神障碍检测性能。具体而言，我们的核心创新包括：(1) 改进原方法中的逻辑规则(公式4)和基于规则提示方法的学习目标(公式6)，使其适配精神障碍检测任务；(2) 探索在基于规则的提示工程过程中注入现有精神障碍检测医学知识的有效机制(公式5)；(3) 通过多提示方法(即提示集成和提示组合)将基于规则提示方法的学习目标(公式6)与其他提示学习目标无缝整合，以更有效应对精神障碍检测的相关挑战。

2.3. Key Novelties

2.3. 关键创新点

From the perspective of design science, we make three technical contributions with our main IT artifact developed for mental disorder detection using textual data. First, we present a novel framework grounded in LLMs and prompt engineering, facilitating the few-shot detection of multiple mental disorders through user-generated text content. Notably, this framework confers a significant advantage by obviating the necessity for an extensive volume of labeled training data or the intricate engineering of customized architectures for each distinct disease or research problem. The proposed framework can be extended to tasks related to detecting other mental disorders and chronic diseases, especially those exhibiting discernible characteristics within

从设计科学的角度出发，我们为基于文本数据的心理障碍检测开发了主要IT构件，并做出三项技术贡献。首先，我们提出一个基于大语言模型和提示工程的新框架，通过用户生成文本内容实现多种心理障碍的少样本检测。值得注意的是，该框架具有显著优势：既无需大量标注训练数据，也不必为每种特定疾病或研究问题复杂地定制架构。所提框架可扩展至其他心理障碍和慢性疾病的检测任务，尤其是那些在...

user-generated textual content. Second, within our framework, we propose a multi-prompt engineering approach, effectively syne rg i zing various continuous prompt engineering techniques, including prefix tuning and rule-based prompt engineering. This strategic amalgamation is specifically tailored to address the unique technical challenges within the healthcare domain. It involves the utilization of personalized prompts and the integration of existing medical domain knowledge, thereby markedly enhancing the accuracy and efficacy of our method. Third, as an integral component of our framework, we propose a new rule-based prompt engineering method, adept at efficiently dissecting complex textual content-based detection problems. This method seamlessly integrates domain knowledge existing in the ontology format—one of the widely adopted formats for domain knowledge. The design principle extends to other research problems necessitating the decomposition of challenging tasks and maximizes the utilization of LLM's potential to address real-world challenges.

用户生成的文本内容。其次，在我们的框架中，我们提出了一种多提示工程方法，有效协同了包括前缀调优和基于规则的提示工程在内的多种连续提示工程技术。这种策略性融合专门针对医疗领域的独特技术挑战而设计，涉及个性化提示的使用以及现有医学领域知识的整合，从而显著提升了我们方法的准确性和有效性。第三，作为我们框架的一个组成部分，我们提出了一种新的基于规则的提示工程方法，能够高效剖析基于复杂文本内容的检测问题。该方法无缝整合了以本体论格式（领域知识广泛采用的格式之一）存在的领域知识。其设计原则可延伸至其他需要分解复杂任务的研究问题，并最大限度地利用大语言模型的潜力来解决现实世界的挑战。

3. RESEARCH DESIGN

3. 研究设计

Figure 2. Research Design

图 2: 研究设计

In this study, we introduce a novel multi-prompt engineering method for detecting mental disorders through user-generated textual content. The innovative design of our multi-prompt engineering method aims to tackle two technical challenges: (1) personalized prompts for individual users and each mental disorder, capturing the unique characteristics and underlying patterns of each user, and (2) integrating prompts with structured medical knowledge to contextual ize the task, which instructs the LLMs on the learning objectives and operational ize s prediction goals. Subsequently, the outcomes of the prompts serve as the input for an LLM, which determines whether the targeted user exhibits signs of a mental disorder. The flowchart of our method is shown in Figure 2.

本研究提出了一种新颖的多提示工程方法，用于通过用户生成的文本内容检测心理障碍。我们多提示工程方法的创新设计旨在解决两个技术挑战：(1) 为每位用户和每种心理障碍定制个性化提示，捕捉每位用户的独特特征和潜在模式；(2) 将提示与结构化医学知识相结合，使任务情境化，从而指导大语言模型的学习目标并操作化预测任务。随后，这些提示的结果将作为大语言模型的输入，用于判断目标用户是否表现出心理障碍迹象。图2展示了我们方法的流程图。

3.1. Problem Formulation

3.1. 问题表述

We focus on user-generated textual content on online platforms (e.g., Reddit, Twitter, etc.), which potentially encompasses each user's self-reported information relevant to mental disorder detection. Each textual post is denoted as $x_{i}$ . We collect data from a user base $U$ with $l$ users. For given period of time, we observe the user-generated content of the focal user $u{\in}U$ from $N$ text posts, denoted by $\boldsymbol{x}_ {u}=\bigg(x_{_ {1}},x_{_ {2}},...,x_{_ {N_{u}}}\bigg),$ ordered in time, and $N_{{}_ {u}}={1,2,3,...}$ as each user has an arbitrary number of posts. Each user $u$ may suffer from one or more mental disorders in a disease set $D={d_{_ {1}},...,d_{_ {j}},...,d_{_ {N}}}$ For ach disease $d_{_ {j}},$ an ontology $O_{j}$ can be constructed t depithe symptoms, risk factors, and treatments of the disease $d_{_ j}.$ Given a user's text posts $\scriptstyle\left(x_{_ {1}},x_{_ {2}},\ldots,x_{_ {N_{u}}}\right)$ and a target disease $d_{_ {j}},$ we aim to design a new multi-prompt function $f_{_ {p r o m p t}}\biggl(\boldsymbol{x}_ {1},\boldsymbol{x}_ {2},...,\boldsymbol{x}_ {N_{u}}\biggr)$ to address two technique challenges: personalized prompts and prompts integrated with medical knowledge $O_{_ j}$ . As the foundation of our approach, we build upon LLMs, denoted as $L M_{_ \phi}$ with parameters $\Phi$ . The prediction outcome is $y_{d_{j}}={0,1}$ , where 1 suggests that the focal user $u$ suffers or will suffer from the target disease $d_{_ j}$ Formally, the mental disorder $d_{_j}$ detection problen is a binary probabilistic classification problem (Eq. 7), which applies to all diseases in $D$

我们关注在线平台(如Reddit、Twitter等)上的用户生成文本内容，这些内容可能包含每位用户自我报告的与精神障碍检测相关的信息。每个文本帖子表示为$x_{i}$。我们从包含$l$个用户的用户群$U$中收集数据。在给定时间段内，我们观察到焦点用户$u{\in}U$生成的$N$个文本帖子内容，按时间顺序记为$\boldsymbol{x}_ {u}=\bigg(x_{_ {1}},x_{_ {2}},...,x_{_ {N_{u}}}\bigg),$，且$N_{{}_ {u}}={1,2,3,...}$表示每位用户的帖子数量不定。每位用户$u$可能患有疾病集合$D={d_{_ {1}},...,d_{_ {j}},...,d_{_ {N}}}$中的一种或多种精神障碍。对于每种疾病$d_{_ {j}},$可以构建本体$O_{j}$来描述该疾病$d_{_ j}$的症状、风险因素和治疗方法。给定用户的文本帖子$\scriptstyle\left(x_{_ {1}},x_{_ {2}},\ldots,x_{_ {N_{u}}}\right)$和目标疾病$d_{_ {j}},$我们旨在设计一个新的多提示函数$f_{_ {p r o m p t}}\biggl(\boldsymbol{x}_ {1},\boldsymbol{x}_ {2},...,\boldsymbol{x}_ {N_{u}}\biggr)$来解决两个技术挑战：个性化提示和整合医学知识$O_{_ j}$的提示。作为方法基础，我们基于参数为$\Phi$的大语言模型$L M_{_ \phi}$构建方案。预测结果为$y_{d_{j}}={0,1}$，其中1表示焦点用户$u$患有或将患有目标疾病$d_{_ j}$。形式上，精神障碍$d_{_j}$检测问题是一个二元概率分类问题(公式7)，适用于集合$D$中的所有疾病。

$$
\widehat{y_{d_{j}}}=a r g m a x p\biggl(y_{d_{j}}|L M_{\Phi}\biggl(f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)\biggr)\biggr)
$$

3.2。 Multi-prompt Engineering with Personalization and Knowledge Injection

3.2 多提示工程与个性化及知识注入

As outlined in the literature review, the objective of the prompt engineering function, $f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)$ is to leverage the capabilities of LLMs while streamlining the complexity of disease- or problem-specific model design. The key scholarly contribution of this study lies in the development of $f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)$.

如文献综述所述，提示工程函数 $f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)$的目标是充分利用大语言模型的能力，同时简化针对特定疾病或问题的模型设计复杂性。本研究的关键学术贡献在于开发了 $f_{p r o m p t}\bigl(x_{1},x_{2},...,x_{N_{u}}\bigr)$。

3.2.1. Automated Continuous Dynamic Prefix Tuning for Personalized Prompts

3.2.1. 个性化提示的自动连续动态前缀调优

When patients describe their experience on social media, their patterns and persona' are distinct from each other. This motivates us to design a personalized prompt for each user for the mental disorder detection problem. We leverage the attributes of prefix tuning (Li and Liang 2021), where each prefix vector is customized for individual input simultaneously, and adapt it to our multi-prompt method to achieve this goal. Specifically, we designate a one-dimensional vector $v$ of length $k$ for each user $u$

当患者在社交媒体上描述他们的经历时，他们的行为模式和人物特征各不相同。这促使我们为每位用户设计个性化的提示(prompt)来解决心理健康障碍检测问题。我们借鉴了前缀调优(prefix tuning) [20] 的属性(每个前缀向量同时为单个输入定制)，并将其适配到我们的多提示方法中以实现这一目标。具体而言，我们为每个用户$u$指定一个长度为$k$的一维向量$v$

$$
f_{p r o m p t_p r e f i x}\left(x\right)=[v\oplus x]
$$

where $\circleddash$ denotes concatenation and $k$ is a hyper-parameter. In the user base $U$ with $l$ users, a trainable matrix $P$ (of dimension $l\times(k+L M_{\Phi^{-}}t o k e n i z i n g(x)))$ will be para met rize d using a feedorward neura network,. $P_{_ {\ominus}}=M L P_{_{\ominus}}$ during raining where $\theta$ is the raiabl paramers or the $M L P$ and is used to create unseen users' f prompt prefix (x).Each row of P can be trained (i.e,. re parameterized) simultaneously during the training process to reflect each user's unique characteristics, allowing for personalized prompts. Consequently, the expected feedback from LLM is

其中 $\circleddash$ 表示拼接操作，$k$ 是一个超参数。在包含 $l$ 个用户的用户库 $U$ 中，一个可训练的矩阵 $P$（维度为 $l\times(k+L M_{\Phi^{-}}tokenizing(x))$）将通过前馈神经网络进行参数化，即 $P_{_ {\ominus}}=MLP_{_ {\ominus}}$。在训练过程中，$\theta$ 是 $MLP$ 的可训练参数，用于生成未见用户的提示前缀 $f_{prompt}(x)$。矩阵 $P$ 的每一行都可以在训练过程中同时进行训练（即重新参数化），以反映每个用户的独特特征，从而实现个性化提示。因此，大语言模型的预期反馈为
$$
z_i =\begin{cases} P_{\theta}[:k], & \text{if } i \leq k, \\LM_{\phi}(f_{prompt}(x), z_{<i}), & \text{o.w.}
\end{cases}
$$

$$
z_i =\begin{cases} P_{\theta}[:k], & \text{if } i \leq k, \\LM_{\phi}(f_{prompt}(x), z_{<i}), & \text{o.w.}
\end{cases}
$$

where $i$ is the $i\cdot$ -th digits of $z$ When $i\leq k,z_{i}$ directly copies from $P_{\theta}$ when $i>k,z_{i}$ still depends $P_{\mathfrak{\theta}}$ $f_{p r o m p t}(x)$ (section 3.2.3) relies on the activation s of the preceding feedback.

其中 $i$ 是 $z$ 的第 $i$ 位数字。当 $i\leq k$ 时，$z_{i}$ 直接从 $P_{\theta}$ 复制；当 $i>k$ 时，$z_{i}$ 仍依赖于 $P_{\mathfrak{\theta}}$。$f_{prompt}(x)$ (第3.2.3节)依赖于前序反馈的激活状态。

3.2.2. A New Rule-Based Prompt for Injecting Structural Medical Knowledge

3.2.2 注入结构化医学知识的新型基于规则的提示

The mental disorder detection task is at the subject level using all the posts in a time period. Therefore, compared to other NLP tasks, the unique challenge of this task lies in the input to a machine learning prediction model, $x=\bigg(x_{_ 1},x_{_ 2},...,x_{_ {N_{u}}}\bigg),$ which contains ultra-long user-generated unstructured text content with variable lengths (see section 4.1 Table 4, around 1,583,227 ~ 44,077,018 tokens per user). However, extracting valid information from these data is challenging for traditional machine learning models that rely on feature engineering. On the other hand, conventional end-to-end deep learning models may not be able to remember and learn from ultra-long dependencies from $x=\bigg(x_{_ 1},x_{_ 2},...,x_{_ {N_{u}}}\bigg)$ (e.g., RNNs) or be constrained by a fixed sequence length established during training (e.g., Transformers), making handling a large amount of arbitrary number text posts challenging.

精神障碍检测任务是在一段时间内使用所有帖子在受试者层面进行的。因此，与其他NLP任务相比，该任务的独特挑战在于机器学习预测模型的输入$x=\bigg(x_{_ 1},x_{_ 2},...,x_{_ {N_{u}}}\bigg)$包含超长的用户生成非结构化文本内容，且长度可变(参见第4.1节表4，每位用户约1,583,227~44,077,018个token)。然而，对于依赖特征工程的传统机器学习模型而言，从这些数据中提取有效信息具有挑战性。另一方面，传统的端到端深度学习模型可能无法记忆和学习$x=\bigg(x_{_ 1},x_{_ 2},...,x_{_ {N_{u}}}\bigg)$中的超长依赖关系(例如RNN)，或受限于训练时建立的固定序列长度(例如Transformer)，这使得处理大量任意数量的文本帖子具有挑战性。

For standard prompt engineering methods, even with the assistance of highly potent LLMs, it remains a challenging task. This is due to (1) the presence of a significant amount of noise in the user-generated content (e.g., the text content can be unrelated to mental disorders, or similar symptoms shared across various mental disorders), making the prediction task difficult. (2) Considering the limited memory capacity of LLMs based on the number of parameters, most LLMs are insufficient to handle the extensive input required for user-level mental disorder detection. Even if the LLMs can handle such inputs, these models typically have a large number of parameters, imposing significant costs in applications. (3) If we employ a method, such as a discrete prompt that utilizes natural language and expects only binary outcomes (e.g., 1/0), it

对于标准的提示工程方法，即使借助高性能大语言模型，这仍然是一项具有挑战性的任务。原因在于：(1) 用户生成内容中存在大量噪声（例如文本内容可能与精神障碍无关，或不同精神障碍表现出相似症状），导致预测任务困难。(2) 考虑到基于参数数量的大语言模型记忆容量有限，大多数大语言模型不足以处理用户级精神障碍检测所需的大量输入。即便模型能够处理此类输入，这些模型通常参数量庞大，会带来高昂的应用成本。(3) 若采用离散提示等自然语言处理方法且仅预期二元输出（如1/0）

becomes challenging for stakeholders to determine which specific part of the user-generated content leads the LLM to conclude that the subject has a particular mental disorder. Motivated by these challenges, we propose a new rule-based prompt engineering method with three design principles, differing from existing approaches.

对于利益相关者来说，确定用户生成内容中哪一部分导致大语言模型得出受试者患有特定精神障碍的结论变得具有挑战性。基于这些挑战，我们提出了一种新的基于规则的提示工程方法，该方法遵循三个设计原则，与现有方法不同。

Design principle 1. Instead of using $x=\left(x_{_ 1},x_{_ 2},...,x_{N_{u}}\right)$ as the input, we expand the individual elements $x_{_ i}$ within $x$ and concatenate them into along list of tokens, $x={t_{_ {1}},t_{_ {2}},...,t_{_ {|x|}}}$ where $t_{_j}$ indicates a single token and $|x|$ is the magnitude of $x$ . We then design a sliding window $x$ and assume $m$ sliding windows in total. Therefore, a sliding window $x$ of size w can be represented as $x[i;~i+w]$ , for $i={0,w,2w,...,(m-1)w}$ . The number $m$ and the size w of moving windows are hyper parameters correlated and constrained by $|x|$ . Empirically, let $|{\overline{{x}}}|\ll|x|$ and the content in the slidng window x = (t, t.(-)] is used as the new input of prompt() and LM significantly reducing the negative impacts caused by ultra-long sequences in user-level disease detection tasks.

设计原则1：我们不直接使用$x=\left(x_{_ 1},x_{_ 2},...,x_{N_{u}}\right)$作为输入，而是将$x$中的各个元素$x_{_ i}$展开并拼接成一个长token序列$x={t_{_ {1}},t_{_ {2}},...,t_{_ {|x|}}}$，其中$t_{_j}$表示单个token，$|x|$表示$x$的规模。随后设计一个滑动窗口$x$，假设共有$m$个滑动窗口。因此，大小为w的滑动窗口$x$可表示为$x[i;~i+w]$，其中$i={0,w,2w,...,(m-1)w}$。移动窗口的数量$m$和大小w是与$|x|$相关且相互约束的超参数。经验表明，当$|{\overline{{x}}}|\ll|x|$时，将滑动窗口内容x = (t, t.(-))作为prompt()和大语言模型的新输入，能显著降低用户级疾病检测任务中超长序列带来的负面影响。

Another significant advantage of this approach is that the $L M_{\Phi}$ returns a probability for each $x$ which reflects Eq. 7, i.e., the probability that the LLM considers $x$ to be related to a mental disorder. In Section 4.5, we leverage this feature to interpret our prediction results. Additionally, through Design principles 2 and 3 discussed below, we can further determine which aspects of existing medical knowledge are related to x, thus facilitating stakeholders in utilizing our method and results.

该方法另一个显著优势在于 $L M_{\Phi}$ 会为每个 $x$ 返回反映式(7)的概率值，即大语言模型判定 $x$ 与精神障碍相关的概率。在第4.5节中，我们利用这一特性来解释预测结果。此外，通过下文讨论的设计原则2和3，还能进一步确定现有医学知识中哪些方面与x相关，从而帮助利益相关方运用我们的方法和成果。

Design principle 2. Instead of directly relying on $L M_{\Phi}$ to determine whether a user $u$ has a mental disorder, we can break down this task into three sub-tasks: assessing whether a focal user $u$ exhibits: (1) symptoms, such as anxiety, fatigue, low mood, reduced self-esteem, change in appetite or sleep, suicide attempt, etc. (APA 2022, Martin et al. 2006, Rush et al. 2003); (2) major life event changes, such as divorce, body shape change, violence, abuse, drug or alcohol use, and so on (Beck and Alford 2014); and (3) treatments, such as medication, therapy, or a combination of these two (Beck and Alford 2014). The construction of this ontology is supported by medical literature and has been rigorously validated through quantitative methods and expert evaluation, as detailed in Appendix 1.

设计原则2：我们不是直接依赖 $L M_{\Phi}$ 来判断用户 $u$ 是否存在精神障碍，而是将该任务分解为三个子任务：评估目标用户 $u$ 是否表现出 (1) 症状 (如焦虑、疲劳、情绪低落、自尊心下降、食欲或睡眠改变、自杀企图等) (APA 2022, Martin et al. 2006, Rush et al. 2003)；(2) 重大生活事件变化 (如离婚、体型改变、暴力、虐待、吸毒或酗酒等) (Beck and Alford 2014)；(3) 治疗方式 (如药物治疗、心理治疗或二者结合) (Beck and Alford 2014)。该本体的构建得到医学文献支持，并通过定量方法和专家评估进行严格验证 (详见附录1)。

We decompose the task of detecting subject-level mental disorders into three subtasks, relying on the following assumptions: if the subject $u$ displays an increased number of symptoms, self-reports a greater number of life events that may cause or exacerbate disease $d_{_ {j}},$ or discusses the use of treatments associated with the disease $d_{_ {j}},$ the accumulation of such mentions suggests a higher likelihood that the subject $u$ currently suffers from or will suffer from the target disease $d_{_ {j}}$.

我们将检测个体层面精神障碍的任务分解为三个子任务，基于以下假设：若受试者$u$表现出更多症状、自述更多可能引发或加剧疾病$d_{_ {j}}$的生活事件、或提及与疾病$d_{_ {j}}$相关的治疗手段，这些叙述的累积表明受试者$u$当前或未来罹患目标疾病$d_{_ {j}}$的可能性更高。

Meanwhile, the reason for decomposing the task into these three subtasks is that, in user-generated text, these three aspects are often self-reported by users with mental disorders and are detectable (Copper smith et al. 2015, Nadeem 2016, W. Zhang et al. 2024). It is worth noting that there are other indicators for detecting mental disorders, such as family history, genetics, and poor nutrition. However, since our research context revolves around user-generated text content, we concentrate on factors that are detectable.

同时，将任务分解为这三个子任务的原因是，在用户生成的文本中，这三个方面通常由心理健康障碍患者自行报告且可被检测到 (Coppersmith et al. 2015, Nadeem 2016, W. Zhang et al. 2024)。值得注意的是，检测心理健康障碍还存在其他指标，如家族史、遗传因素和营养不良等。但由于我们的研究场景围绕用户生成的文本内容展开，因此重点关注可被文本检测的因素。

We aim to design an inclusive method that can (1) identify users with easily noticeable signs of disease $d_{_ j{}}$ who are currently undergoing treatment and (2) provide early detection for users at risk of disease $d_{_ j{}}$ in the future, which includes detecting symptoms and life events that may exacerbate depression. In the disease detection task, treatment entities play a significant role. This is because, within the population of individuals discussing disease $d_{_ {j}},$ there is a subset of users who have already received clinical diagnoses and have undergone various treatments. When a user openly discusses treatments of disease $d_{j},$ it strongly indicates that the user likely has disease $d_{_j}$ To ensure the comprehensiveness of our mental disorder detection method, which aims to identify as many patients as possible, treatment-related entities can be highly effective. On the other hand, when it comes to the early disease detection task, our method leverages other disease-related factors for early detection, including symptoms and major life event changes.

我们旨在设计一种包容性方法，能够：(1) 识别当前正在接受治疗且具有明显疾病$d_{_ j{}}$症状的用户，(2) 为未来可能患$d_{_ j{}}$疾病风险的用户提供早期检测，包括检测可能加剧抑郁的症状和生活事件。在疾病检测任务中，治疗实体起着重要作用。这是因为在讨论疾病$d_{_ {j}}$的人群中，存在一部分已获得临床诊断并接受过各种治疗的用户。当用户公开讨论$d_{j}$疾病的治疗时，强烈表明该用户很可能患有$d_{_j}$疾病。为确保我们旨在识别尽可能多患者的精神障碍检测方法的全面性，治疗相关实体可以非常有效。另一方面，在早期疾病检测任务中，我们的方法利用其他疾病相关因素进行早期检测，包括症状和重大生活事件变化。

Formally, we define the following logic rule:

我们定义如下逻辑规则：

$$
p(y_{d_{j}}^{s y m p t o m}|\overline{{x}})\vee p(y_{d_{j}}^{l i f e_{-}e v e n t}|\overline{{x}})\vee p(y_{d_{j}}^{t r e a t m e n t}|\overline{{x}})\rightarrow p(y_{d_{j}}|\overline{{x}})
$$

where $\vee$ is the logical connective "or". The logical "or" (V) is inclusive, meaning that at least one of the $p(y_{d_{j}}^{f}\mid\overline{{x}})$ , where $f\in$ [symptom, life_event, treatment} must be true for the compound proposition, $p(y|\overline{{x}})=1$ , to be true. Pactically,. $p(y_{d_{j}}^{f}|\overline{{x}})$ represents th probability fdback from $L M_{\phi}.$ indicating whether $L M_{_ \phi}$ judges $x$ to be associated with the sub-task $f$ In actual calculations, for a user $u$ , if there are more $x$ within $x$ are determined by $L M_{_ \phi}$ that they are "symptom” or “life event'” or “treatment”? of mental disorder $d_{_j}$ there is a higher probability that our framework will predict a correspondingly higher probability that this focal user $u$ has mental disorder d

其中 $\vee$ 是逻辑连接词"或"。逻辑"或"(V)是包含性的，这意味着对于复合命题 $p(y|\overline{{x}})=1$ 为真，至少需要 $p(y_{d_{j}}^{f}\mid\overline{{x}})$ 中的一个为真，其中 $f\in$ [症状(symptom)、生活事件(life_event)、治疗(treatment)]。实际上，$p(y_{d_{j}}^{f}|\overline{{x}})$ 表示来自 $L M_{\phi}$ 的概率反馈，表明 $L M_{_ \phi}$ 是否判断 $x$ 与子任务 $f$ 相关联。在实际计算中，对于用户 $u$，如果在 $x$ 中有更多被 $L M_{_ \phi}$ 判定为精神障碍 $d_{_j}$ 的"症状(symptom)"、"生活事件(life event)"或"治疗(treatment)"的 $x$，我们的框架预测该焦点用户 $u$ 患有精神障碍 $d$ 的概率相应会更高。

Design principle 3. Having $L M_{\Phi}$ directly determine whether $x$ is a “symptom,” "life event,” or "treatment' of a mental disorder $d_{_ j{}}$ remains a challenging task as the features of $d_{_j{}}$ can be highly specific and complex, but sometimes, certain features of mental disorders can be remarkably similar. For instance, feelings of excessive guilt or self-blame are often linked with depression but can also manifest in other disorders, including PTSD, anorexia, and self-harm. Accurately distinguishing between different mental disorders and providing effective interventions is crucial. For depression, it is important to create a friendly environment and offer information related to treatment. For PTSD, it is essential to avoid triggering information related to trauma. In the case of anorexia nervosa and self-harm, users may be at risk of life-threatening situations, necessitating more immediate help and intervention.

设计原则3：让 $L M_{\Phi}$ 直接判断 $x$ 是精神障碍 $d_{_ j{}}$ 的"症状"、"生活事件"还是"治疗手段"仍具挑战性，因为 $d_{_j{}}$ 的特征可能高度特异且复杂，但某些精神障碍的特征有时会惊人地相似。例如，过度愧疚或自责感通常与抑郁症相关，但也可能出现在PTSD、厌食症和自残等其他障碍中。准确区分不同精神障碍并提供有效干预至关重要：针对抑郁症需营造友好环境并提供治疗相关信息；处理PTSD时必须避免触发创伤相关信息；对于神经性厌食症和自残行为，用户可能面临生命危险，需要更即时的帮助与干预。

We can further enhance the performance of $L M_{_ \phi}$ by employing prompt engineering to clearly instruct $L M_{\Phi}$ on the specific characteristics of the {symptom, life_event, treatment} associated with $d_{_ j}.$ It is noteworthy that the specificity of these three aspects in various mental disorders is well-documented in the medical literature. If appropriately integrated, such existing medical knowledge can significantly alleviate the challenges faced by $L M_{\Phi}$ and predictive models in detecting mental disorders at the user level using user-generated text content. The injection of medical knowledge into prompt design, therefore, is a significant and promising direction.

我们可以通过提示工程 (prompt engineering) 进一步优化 $L M_{_ \phi}$ 的表现，明确指导 $L M_{\Phi}$ 识别与 $d_{_ j}$ 相关的 {症状、生活事件、治疗方式} 具体特征。值得注意的是，这三种要素在不同精神障碍中的特异性已在医学文献中得到充分记载。若能恰当整合，这类现有医学知识将大幅缓解 $L M_{\Phi}$ 和预测模型在用户层级通过生成文本内容检测精神障碍时所面临的挑战。因此，将医学知识注入提示设计是一个重要且前景广阔的研究方向。

Accordingly, to leverage medical domain knowledge for mental disorder detection, we adhere to previous studies and adopt the established mental disorder ontology $O_{j}$ for each disease $d_{_ j{}}$ that explicitly explains the terminologies used in disease $d_{_ j{}}$ 's diagnosis and treatments (W. Zhang et al. 2024). The ontology $O_{_ j}$ focuses on specific aspects of disease $d_{_ j{}}$ particularly the medical terminologies used in diagnosing disease $d_{_ j{}}$ that are possible to detect from user-generated textual content, formally denoted as O, f E {symptom, life event, treatment}. The purpose of the J-f ontology is to facilitate the detection of symptoms, major life events, and treatments from user-generated text content. Based on the extensive literature review (APA 2022, Beck and Alford 2014, Martin et al. 2006, Rush et al. 2003), a list of concepts, $o_{j_{j_{j}}}$ related to $d_{_ j{}}$ diagnosis (e.g.. dejected mood, self-blame, fatal illness, psychotherapy, etc.) is compiled. Next, we organize the terminologies $o_{j_{j}}$ into three classes for each disease $d_{j}!$ symptom (0_ j_symptom? a collection of symptoms), life event fee lonfjlieet hang eth a mayr exacerbate $d_{_ j})$ or treatment $(O_{j_{-}t r e a t m e n t})$ medications and therapies). Meanwhile, we determine the relationships between terminologies $o_{j_{j_{j}}}$ and classes $O_{_ {j_{-}k}}$ as o. : relation $O_{_ {j_{-}k}}$ (e.g., for "depression"” and one of its symptoms “dejected mood", depression dec e nod : is a O depression symptom).

为利用医学领域知识进行精神障碍检测，我们遵循先前研究，采用既定的精神障碍本体$O_{j}$来描述每种疾病$d_{_ j{}}$的诊断和治疗术语(W. Zhang et al. 2024)。该本体$O_{_ j}$聚焦于疾病$d_{_j{}}$的特定方面，尤其是可从用户生成文本内容中检测到的诊断术语，形式化表示为O, f E {症状(symptom)、生活事件(life event)、治疗(treatment)}。J-f本体的目的是从用户生成文本中识别症状、重大生活事件和治疗方案。

基于广泛文献综述(APA 2022, Beck and Alford 2014, Martin et al. 2006, Rush et al. 2003)，我们汇编了与$d_{_ j{}}$诊断相关的概念列表$o_{j_{j_{j}}}$(如情绪低落、自责、绝症、心理治疗等)。随后将这些术语$o_{j_{j}}$按三类组织：症状集合$(O_{j_{-}symptom})$、生活事件$(O_{j_{-}lifeevent})$可能加剧病情的事件)和治疗方案$(O_{j_{-}treatment})$(药物与疗法)。同时建立术语$o_{j_{j_{j}}}$与类别$O_{_ {j_{-}k}}$的关系：$o_{j_{j_{j}}}$ : relation $O_{_ {j_{-}k}}$(例如"抑郁症"与其症状"情绪低落"的关系表示为：depression dec e nod : is a O depression symptom)。

Adhering to the three design principles, we formulate our new rule-based prompting function as follows:

遵循这三项设计原则，我们将新的基于规则的提示函数定义如下：

$$
f_{prompt_rule}(\bar{x}) =
\begin{cases}
T_{d_j}(\bar{x}) = [T_{d_j}^{symptom}(\bar{x});T_{d_j}^{life_event}(\bar{x});T_{d_j}^{treatment}(\bar{x})], \\
V_{d_j}[mask]_ {symptom} = {o_{j_i}}, o_{j_i} \in O_{j_symptom}, \text{ when } p(y_{d_j}^{symptom} = 1|\bar{x}), \\
V_{d_j}[mask]_ {life_event} = {o_{j_i}}, o_{j_i} \in O_{j_life_event}, \text{ when } p(y_{d_j}^{life_event} = 1|\bar{x}), \\
V_{d_j}[mask]_ {treatment} = {o_{j_i}}, o_{j_i} \in O_{j_treatment}, \text{ when } p(y_{d_j}^{treatment} = 1|\bar{x})
\end{cases}
$$

where $T^{f}(\overline{{x}})="\overline{{x}}$ : relat ${\cdot}i o n\left[m a s k\right]_ {f}o f{d_{_ j}}f^{"},V_{_ d}[m a s k]_ {_ f}={o_{_ j}}$ denotes althe concept n the ontology $O_{j}$ that belongs to class $f$ and $f\in{s y m p t o m,l i f e_e v e n t,t r e a t m e n t}.$

其中 $T^{f}(\overline{{x}})="\overline{{x}}$ : 关系 ${\cdot}i o n\left[m a s k\right]_ {f}o f{d_{_ j}}f^{"},V_{_ d}[m a s k]_ {_ f}={o_{_ j}}$ 表示本体 $O_{j}$ 中属于类 $f$ 的所有概念，且 $f\in{症状, 生活事件, 治疗}$。

Existing medical knowledge, represented as ontology $O_{j^{\flat}}$ is injected into prompt rue C) in two ways, and aids $f_{p r o m p t_r u l e}(\overline{{x}})$ to instrut the $L M_{\Phi}$ to beter aomlish teuse-evel mtal disorder detection task: (1) the relation (: relation) between concept $^o_{j_{j}}$ and concept class $O_{_ {j_{-}k}}$ .s injected into the prompt template $T_{d_{j}}(\overline{{x}})$ , therefore directly instructing the $L M_{\Phi}$ learning objective (i.e., filling the $[m a s k],$ 0, connecting text x, disease $d_{_ j{}}$ , and three aspects of disease $f;(2)$ the concepts $o_{j_{j_{j}}}$ of ontology $O_{_ j}$ is injected into the verbalizer $V_{d_{j}}$ , which projects the original prediction goals (i.e., x is a “symptom,” “life event,” or "treatment” of a mental disorder $d_{_ j})$ to a set of label words (i.e., $o_{j_{j}})$ . As the prediction goal is a binary classification problem, the verbalizer words for negative examples are designed using manual verb aliz ation methods, incorporating the most frequent words with the highest sentiment tendency.

现有医学知识以本体 $O_{j^{\flat}}$ 形式通过两种方式注入提示规则 C)，辅助 $f_{prompt_rule}(\overline{{x}})$ 指导 $LM_{\Phi}$ 更好地完成用例级精神障碍检测任务：(1) 概念 $^o_{j_{j}}$ 与概念类 $O_{_ {j_{-}k}}$ 的关系 (: relation) 被注入提示模板 $T_{d_{j}}(\overline{{x}})$，从而直接指导 $LM_{\Phi}$ 的学习目标 (即填充 $[mask],$ 0，连接文本 x、疾病 $d_{_ j{}}$ 及疾病三方面特征 $f；(2) $ 本体$o_{j_{j_{j}}}$的概念$O_{_ j}$被注入词表器$V_{d_{j}}$，将原始预测目标 (即 x 是精神障碍$d_{_ j})$的“症状”“生活事件”或“治疗”) 映射到一组标签词 (即$o_{j_{j}})$。由于预测目标是二分类问题，负例的词表器词汇采用人工词表化方法设计，结合了情感倾向最高的高频词。

Take “depression” as an example,

以“抑郁症”为例，

$$
\begin{align}
T_{depression}(\bar{x}) &= "\bar{x} \text{ is a } [mask]_ {depression} \text{ of depression symptom}; \bar{x} \text{ is a } [mask]_ {life_event} \text{ of depression life event}; \bar{x} \text{ is a } [mask]_ {treatment} \text{ of depression treatment.}" \\
V_{depression}[mask]_ {symptom} &= {anxiety, dejected mood,...} \\
V_{depression}[mask]_ {life_event} &= {divorce, domestic violence,...} \\
V_{depression}[mask]_{treatment} &= {supportive psychotherapy, abilify,...}
\end{align}
$$

Essentily, $f_{p r o m p t_r u l e}(\overline{{x}})$ instructs the $L M_{_ \phi}$ to evaluate whether $x$ discloses disease $d_{_ j{}}$ symptoms, life events, or treatments. This task is much simpler compared to the original user-level mental detection task and, therefore, aids the $L M_{_\phi}$ in performance improvement. For instance,

本质上，$f_{prompt_rule}(\overline{{x}})$ 指示 $LM_{_ \phi}$ 评估 $x$ 是否披露了疾病 $d_{_ j{}}$ 的症状、生活事件或治疗方法。相比原始的用户级心理检测任务，这项任务要简单得多，因此有助于提升 $LM_{_\phi}$ 的性能。例如，

consider the case where $\overline{{x}}="I$ feel so lost after my divorce .. The prompt rule(x) directs the $L M_{\Phi}$ to discern whether $\overline{{x}}$ represents a symptm, major lif vent, or a treatment aociatd with depression. The feedback from the $L M_{\Phi}$ yields probabilities: de preston l x) = 0.596, $p(y_{d e p r e s s i o n}^{l i f e_{-}e v e n t}|\overline{{x}})=0.8789,\mathrm{and}p(y_{d e p r e s s i o n}^{t r e a t m e n t}|\overline{{x}})=0.0001.$

考虑这样一个案例：$\overline{{x}}="I$ feel so lost after my divorce.." 提示规则(x)指导 $L M_{\Phi}$ 判断 $\overline{{x}}$ 是否代表抑郁症状、重大生活事件或治疗关联。$L M_{\Phi}$ 反馈的概率结果为：抑郁症状 $p(y_{depression}^{symptom}|\overline{{x}})=0.596$，生活事件 $p(y_{depression}^{life_event}|\overline{{x}})=0.8789$，治疗关联 $p(y_{depression}^{treatment}|\overline{{x}})=0.0001$。

This approach significantly reduces the difficulty for the $L M_{_ \phi}$ in determining whether a given $x$ is related to $d_{_ j}^{_ {\perp}}$ s symptoms, life events, or treatments, allowing the $L M_{_ \phi}$ to focus on key areas already summarized by existing medical knowledge as the verbalize rs are derived from disease ontology, helping the $L M_{\Phi}$ transform inputs into prediction labels (Liu et al. 2023).

这种方法显著降低了 $L M_{_ \phi}$ 判断给定 $x$ 是否与 $d_{_ j}^{_ {\perp}}$ 的症状、生活事件或治疗相关的难度，使 $L M_{_ \phi}$ 能够专注于现有医学知识已总结的关键领域，因为表述词源自疾病本体，帮助 $L M_{\Phi}$ 将输入转化为预测标签 (Liu et al. 2023)。

3.2.3. Prompt Ensembling of Multi-prompt Engineering for Mental Disorder Detection

3.2.3. 多提示工程的心理障碍检测提示集成

The prompt engineering methods, prompt prefix() and f p promptrue), focused on constructing a single prompt for different motivations for the mental disorder detection task using user-generated textual content. We now employ the prompt ensemble method to generate our multi-prompt function, fprompt() for two reasons: (1) both fp prompt prefix() and fp prompt rue) are crucial in the context of mental disorder detection, and we need to combine them to accomplish the task together, (2) a significant body of research has demonstrated that the use of multiple prompts can further improve the efficacy of prompting methods (Liu et al. 2023).

提示工程方法 prompt prefix() 和 f p promptrue() 专注于为利用用户生成文本内容进行精神障碍检测任务的不同动机构建单一提示。我们现采用提示集成方法生成多提示函数 fprompt()，原因有二：(1) 在精神障碍检测场景中，fp prompt prefix() 和 fp prompt rue() 都至关重要，需要协同完成任务；(2) 大量研究表明，使用多重提示能进一步提升提示方法的效能 (Liu et al. 2023)。

While there are several methods for creating a multi-prompt function, we have opted for the prompt ensemble method, which involves utilizing multiple prompts for a given input during the inference phase to make predictions. It serves three purposes: (1) leveraging the complementary advantages of both $f_{_ {p r o m p t_{-p r e f i x}}}(\cdot)$ and $f_{p r o m p t_r u l e}(\cdot)$ 2) addresing the challenges of prompt engineering by eliminating the need to select a single best-performing prompt, and (3) stabilizing performance on downstream tasks (Liu et al. 2023). Formally,

虽然创建多提示函数有几种方法，但我们选择了提示集成方法，即在推理阶段对给定输入使用多个提示进行预测。该方法有三个目的：(1) 利用 $f_{_ {p r o m p t_{-p r e f i x}}}(\cdot)$ 和 $f_{p r o m p t_r u l e}(\cdot)$ 的互补优势；(2) 通过无需选择单一最佳性能提示来解决提示工程面临的挑战；(3) 稳定下游任务的性能 (Liu et al. 2023)。形式上，

$$
f_{prompt}(\bar{x}) =\begin{cases} T_{d_j}(\bar{x}) = [v \oplus T_{d_j}^{symptom}(\bar{x});T_{d_j}^{life_event}(\bar{x});T_{d_j}^{treatment}(\bar{x})] \\
V_{d_j} = {V_{d_j}[mask]_ {symptom}, V_{d_j}[mask]_ {life_event}, V_{d_j}[mask]_{treatment}}
\end{cases}
$$

Note: if the $\bar{x} = {t_1, t_2, \cdots, t_{|\bar{x}|}}$ originates from the same $x = {t_1, t_2, \cdots, t_{|x|}}$, these $\bar{x}$ share the same prefix vector $v$

注意 $x={t_{_ {1}},t_{_ {2}},...,t_{_ {|x|}}}$ 源自相同的 $x={t_{_ {1}},t_{_ {2}},...,t_{_{|x|}}}$ 这些 $x$ 共享相同的前缀向量 $v$

The input to the $L M_{\Phi}$ is the numerical vector representation f $T_{d_{j}}(\overline{{x}})$ which depends on the tokenizing method of $L M_{\phi};$ namely,

输入到 $L M_{\Phi}$ 的是数值向量表示 $T_{d_{j}}(\overline{{x}})$ ，这取决于 $L M_{\phi}$ 的分词方法；即，

$[v\oplus L M_{-}t o k e n i z i n g(T_{d_{j}}^{s y m p t o m}(\overline{{{x}}});T_{d_{j}}^{l i f e_{-}e v e n t}(\overline{{{x}}});T_{d_{j}}^{t r e a t m e n t}(\overline{{{x}}}))].$ The expected feedback from the model is

$$
z_i =\begin{cases}
P_{\theta}[:k], & \text{if } i \leq k, \\
[mask]_ i z_{<i}, & \text{o.w.}
\end{cases}
$$

Th infomation withnthe mask is cntingent upon two key factors: $(1)P_{\theta},$ serving as the prefix context, where all subsequent fedback hinges on the activation s from the preceding feedback; (2) the template $T_{d_{j}}(\overline{{x}})$ which provides direct instructions and context to the $L M_{\phi};$ constraint how $L M_{\Phi}$ fills in the content of [mask] in $T_{d_{j}}(\overline{{x}})$

掩码内的信息取决于两个关键因素：(1) $P_{\theta}$ 作为前缀上下文，后续所有反馈都依赖于先前反馈的激活；(2) 模板 $T_{d_{j}}(\overline{{x}})$ 为 $L M_{\phi}$ 提供直接指令和上下文，约束 $L M_{\Phi}$ 如何填充 $T_{d_{j}}(\overline{{x}})$ 中 [mask] 的内容

The learning goal of our multi-prompt learning method is:

我们多提示学习方法的学习目标是：

$$
\begin{array}{r}{p(\boldsymbol{y}_ {d_{j}}|\boldsymbol{x})=\frac{1}{\lambda m}\displaystyle\sum_{f=1}^{r}\sum_{i=1}^{m}p_{\phi}\bigg([m a s k]_ {f}=L M_{\phi}\bigg(\boldsymbol{y}_{d_{j}}\bigg)|T_{d_{j}}(\overline{{\boldsymbol{x}}})\bigg)}\end{array}
$$

$$
\begin{array}{r}{p(\boldsymbol{y}_ {d_{j}}|\boldsymbol{x})=\frac{1}{\lambda m}\displaystyle\sum_{f=1}^{r}\sum_{i=1}^{m}p_{\phi}\bigg([m a s k]_ {f}=L M_{\phi}\bigg(\boldsymbol{y}*{d* {j}}\bigg)|T_{d_{j}}(\overline{{\boldsymbol{x}}})\bigg)}\end{array}
$$

where ris then be rf masked ps it ions if pr fprompe(x) (in our contextr = 3), $[m a s k]_ {f}=L M_{\Phi}\biggl(y_{d_{j}}\biggr)$ is to map the clas $y_{d}$ to the set o label woraes $V_{d_{j}}[m a s k]_{f}$ and $m$ is the number of sliding windows $x$ in $x$

其中 ris 为 rf masked ps it ions if pr fprompe(x) (在我们的上下文中 r = 3), $[m a s k]_ {f}=L M_{\Phi}\biggl(y_{d_{j}}\biggr)$ 的作用是将类别 $y_{d}$ 映射到标签词集合 $V_{d_{j}}[m a s k]_{f}$ ，而 $m$ 是滑动窗口 $x$ 在 $x$ 中的数量

The normalization term $\frac{1}{\lambda m}$ is introduced in Eq. 14 for two reasons: (1) $p(\cdot)$ represents probability feedback from $L M_{\phi};$ and the sum of multiple probabilities could exceed 1. The upper $r_m$ limit of $\Sigma{\textstyle\sum}p(\cdot)$ is $m\mathrm{+~}r$ (if each $p(\cdot)$ returns a value of 1). Since $r$ is a very small number in $\scriptstyle{j=1i=1}$ our setting, we simplify the upper limit to $m$ ; the lower limit of this summation of multiple

归一化项 $\frac{1}{\lambda m}$ 在公式 14 中引入有两个原因：(1) $p(\cdot)$ 表示来自 $L M_{\phi}$ 的概率反馈，多个概率之和可能超过 1。$\Sigma{\textstyle\sum}p(\cdot)$ 的上限 $r_m$ 为 $m\mathrm{+~}r$ (若每个 $p(\cdot)$ 返回值为 1)。由于 $r$ 在我们的设置 $\scriptstyle{j=1i=1}$ 中是非常小的数值，我们将上限简化为 $m$；该多重求和的下限

$m$ probabilities is O (if each $p(\cdot)$ returns a value of O). Consequently, if we normalize $\sum_{i=1}p(\cdot)$ to the ange [0, 1], $\begin{array}{r}{(\underset{i=1}{\overset{m}{\sum}}p(\cdot)-l o w e r_{-}l i m i t)/(u p p e r_{-}l i m i t-l o w e r_{-}l i m i t)(1-0)+0=\frac{1}{m}\underset{i=1}{\overset{m}{\sum}}p(\cdot).}\end{array}$ (2) Additionally, since we employ a sliding window $x$ to break down $x$ and thereby simplify the mental disorder detection task, a possibility arises: multiple sliding window $x$ instances may be describing the same symptoms, life events, or treatment for a disease $d_{{}_{j}}.$ Therefore, we also use $\frac{1}{m}$ as a pen aliz ation factor. Overall, we incorporate the normalization term $\frac{1}{\lambda m}$ into our learning goal, where $\lambda$ is a hyper parameter.

$m$ 个概率的总和为 O (若每个 $p(\cdot)$ 返回值为 O)。因此，若将 $\sum_{i=1}p(\cdot)$ 归一化至 [0, 1] 范围，则 $\begin{array}{r}{(\underset{i=1}{\overset{m}{\sum}}p(\cdot)-l o w e r_{-}l i m i t)/(u p p e r_{-}l i m i t-l o w e r_{-}l i m i t)(1-0)+0=\frac{1}{m}\underset{i=1}{\overset{m}{\sum}}p(\cdot).}\end{array}$ (2) 此外，由于我们采用滑动窗口 $x$ 来分解 $x$ 以简化精神障碍检测任务，可能出现多个滑动窗口 $x$ 实例描述同一疾病 $d_{j}$ 的症状、生活事件或治疗的情况。因此，我们也将 $\frac{1}{m}$ 作为惩罚因子。总体而言，我们将归一化项 $\frac{1}{\lambda m}$ 纳入学习目标，其中 $\lambda$ 为超参数。

4. EVALUATION

4. 评估

To evaluate the performance of the proposed method, we conduct the following examinations: (1) Comparison with benchmarks: we demonstrate the advantages of LLM-based prompt engineering over other machine learning paradigms in our research context. (2) Comparison with other prompt strategies: we highlight the benefits of continuous prompts over discrete prompts in maximizing the capabilities of LLMs within our binary mental disorder detection/classification task. (3) Ablation studies: we analyze the contribution of each component in our ensemble prompt to overall performance. (4) We present three experiments to illustrate the unique advantages of prompt engineering: few-shot learning, early identification, and general iz ability to scenarios with very limited labeled data. (5) Post-analysis to explain the effectiveness of our method.

为评估所提方法的性能，我们进行了以下实验：(1) 基准对比：在本文研究场景中，验证基于大语言模型(LLM)的提示工程相比其他机器学习范式的优势。(2) 提示策略对比：在二元精神障碍检测/分类任务中，证明连续提示相较离散提示更能充分释放大语言模型的潜力。(3) 消融实验：分析集成提示中各组件对整体性能的贡献度。(4) 通过少样本学习、早期识别和标签数据极稀缺场景下的泛化能力三项实验，展示提示工程的独特优势。(5) 后效分析以阐释本方法的有效性。

4.1. Experimental Setup

4.1. 实验设置

All prompt engineering methods rely on pre-trained LLMs to accomplish downstream tasks. Therefore, the major characteristics of pre-trained LLMs, including their main training objective, type of text noising, auxiliary training objective, attention mask, and typical architecture, can significantly influence the performance of downstream tasks in the pre-train and prompt engineering learning paradigm (Liu et al. 2023). Thus, we review mainstream LLM architectures and describe how we chose one for our experiments. Most widely used LLMs adopt the Transformer architecture, which can be classified into three types based on their characteristics (see Table 3): decoder models, encoder models, and encoder-decoder models.

所有提示工程方法都依赖于预训练的大语言模型来完成下游任务。因此，预训练大语言模型的主要特征——包括其核心训练目标、文本噪声类型、辅助训练目标、注意力掩码和典型架构——会显著影响预训练与提示工程学习范式中的下游任务性能 (Liu et al. 2023)。为此，我们回顾主流大语言模型架构，并说明实验中的选择依据。当前广泛采用的大语言模型基于Transformer架构，根据特性可分为三类（见表3）: 解码器模型、编码器模型和编码器-解码器模型。

Table 3. Summary of Mainstream Transformer-based LLMs

Category	Training	Examples
Decodermodels (Auto-regressive models)	UseonlythedecoderofaTransformermodel. Ateachstage,foragivenwordtheattentionlayerscanonlyaccessthewordspositioned beforeitinthesentence. Pretraining:predictingthenextwordinthesentence.	CTRL,GPT, GPT-2, TransformerXL， DeepSeek-R1
Encodermodels (auto-encoding models)	UseonlytheencoderofaTransformermodel. At each stage, the attention layers can access all the words in the initial sentence (i.e., bi-directionalattention). Pretraining: corrupting a given sentence(e.g.,masking random words) and tasking the model withfindingorreconstructingtheinitialsentence.	ALBERT,BERT, DistilBERT, ELECTRA, RoBERTa
Encoder-decodelo rmodels (sequence-to-se quencemodels)	UsebothpartsoftheTransformerarchitecture. Ateachstage,theencoderattentionlayerscanaccessallthewordsintheinitialsentence, whereasthedecoderattentionlayerscanonlyaccessthe wordspositionedbeforeagivenwordintheinput. Pretraining:usingtheobjectivesofencoderordecodermodels,orreplacingrandomwords andpredictingthemaskedwords.	BART,Marian,T5

表 3: 主流基于Transformer的大语言模型总结

类别	训练方式	示例
解码器模型 (自回归模型)	仅使用Transformer模型的解码器部分。在每个阶段，对于给定单词，注意力层只能访问句子中位于它之前的单词。预训练目标：预测句子中的下一个单词。	CTRL, GPT, GPT-2, TransformerXL, DeepSeek-R1
编码器模型 (自编码模型)	仅使用Transformer模型的编码器部分。在每个阶段，注意力层可以访问初始句子中的所有单词(即双向注意力)。预训练目标：破坏给定句子(如随机掩码单词)并让模型寻找或重建初始句子。	ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa
编码器-解码器模型 (序列到序列模型)	使用Transformer架构的两部分。在每个阶段，编码器注意力层可以访问初始句子中的所有单词，而解码器注意力层只能访问输入中位于给定单词之前的单词。预训练目标：使用编码器或解码器模型的目标，或替换随机单词并预测被掩码的单词。	BART, Marian, T5

Given the specific nature of our experimental environment, prompt design (continuous prompt, prefix tuning, and rule-based prompt), prediction task (binary classification for subject-level mental disorder detection), and research context (chronic disease predictive analytics), we employed an exclusion-based approach to select appropriate LLMs. (1) We exclude encoder models (e.g., BERT and RoBERTa) since our prompt design incorporates prefix-tuning to achieve personalized prompts. The implementation of prefix tuning requires LLMs to allow the injection of past key values of Hugging Face LLMs (caching and reusing the intermediate hidden states from the previous steps) into the model. Encoder models on Hugging Face do not permit such operations. (2) We exclude decoder models because previous studies show that for supervised learning tasks with input-label pair data (as opposed to self-supervised learning), encoder-decoder models offer advantages by requiring fewer parameters and achieving better results compared to decoder-only models (Jiang et al. 2023). (3) We exclude non-open-source LLMs (e.g., GPT-4o) since our approach relies on continuous prompt engineering, which requires open-source LLMs to operate in the embedding space (However, non-open-source LLMs with discrete prompt can still be used as benchmark models). Moreover, since healthcare analytics problems often involve user-generated data or fine-grained patient-level information, utilizing such models could pose potential privacy issues. As a result, we selected FLAN-T5 (one of the best-performing open-source encoder-decoder models) as the base LLM for our experiments.

鉴于实验环境的特殊性、提示设计（连续提示、前缀调优和基于规则的提示）、预测任务（针对主体层面精神障碍检测的二元分类）以及研究背景（慢性病预测分析），我们采用基于排除的方法来选择合适的LLM。（1）我们排除了编码器模型（如BERT和RoBERTa），因为我们的提示设计结合了前缀调优以实现个性化提示。前缀调优的实施要求LLM允许将Hugging Face LLM的过去键值（缓存并重用先前步骤的中间隐藏状态）注入模型中，而Hugging Face上的编码器模型不支持此类操作。（2）我们排除了仅解码器模型，因为先前研究表明，对于带有输入-标签对数据的监督学习任务（与自监督学习相对），编码器-解码器模型相比仅解码器模型具有参数需求更少且效果更好的优势（Jiang et al. 2023）。（3）我们排除了非开源LLM（如GPT-4o），因为我们的方法依赖于连续提示工程，这需要开源LLM在嵌入空间中运行（不过，使用离散提示的非开源LLM仍可作为基准模型）。此外，由于医疗健康分析问题常涉及用户生成数据或细粒度患者层面信息，使用此类模型可能带来潜在的隐私问题。因此，我们选择FLAN-T5（性能最佳的开源编码器-解码器模型之一）作为实验的基础LLM。

For evaluation, we mainly use three datasets (Table 4) from the eRisk database (Losada et al. 2018, Parapar et al. 2021). Specifically, we selected the detection of depression, anorexia, and pathological gambling as the primary tasks given their prevalence? and broad societal impact.

在评估中，我们主要使用来自eRisk数据库 (Losada et al. 2018, Parapar et al. 2021) 的三个数据集 (表4)。具体而言，我们选择抑郁症、厌食症和病理性赌博检测作为主要任务，考虑到它们的普遍性和广泛社会影响。

To assess the usability of the methods to new users, we use $60%$ of the data for training, $20%$ for validation, and $20%$ for testing. The reported results are the average performances of 10 experiment runs. We also report the standard deviation of these experiments to show our results? statistical significance. In our evaluations, we report AUC, F1-score, precision, and recall. The goal of the proposed method is to achieve the highest AUC and F1-score. It is important to note that in our research context, a model with high precision but low recall, or vice versa, does not necessarily indicate good overall performance. When precision (exactness: correctly identifying true patients) is low, false-positive patients will suffer from unnecessary mental burden and diagnostic costs. When recall (completeness: capturing as many patients as possible) is low, the practical utility of the model is compromised.

为了评估这些方法对新用户的可用性，我们使用60%的数据进行训练，20%用于验证，20%用于测试。报告结果是10次实验运行的平均性能。我们还报告了这些实验的标准差，以展示结果的统计显著性。在评估中，我们报告了AUC、F1分数、精确率和召回率。所提方法的目标是实现最高的AUC和F1分数。需要注意的是，在我们的研究背景下，一个高精确率但低召回率的模型，或者相反，并不一定代表整体性能良好。当精确率(准确性：正确识别真实患者)较低时，假阳性患者将承受不必要的心理负担和诊断成本。当召回率(完整性：尽可能多地捕捉患者)较低时，模型的实用性就会受到影响。

Table 4.Datasets Summary

Dataset					# of subjects# of posts# of words Avg # of posts per subjectAvg # of days from first to last post
P Depression	214	90,222	2,480,216	421	676
P Depression	1,493	986360	22.461,242	660	664
N P Anorexia N	134	42,493	1,583,227	317	679
N P Anorexia N	1,153	781,768	16,781,263	678	848
Pathologicall	245	69,301	2,119,872	282	545
gambling N	4,182	2,088,002	44.077.018	499	663

Note: $\mathrm{P}=$ positive examples; N= negative examples.

表 4: 数据集摘要

数据集					受试者数量/帖子数量/单词数量/每受试者平均帖子数/首次到最后一次发帖平均天数
P Depression	214	90,222	2,480,216	421	676
	1,493	986,360	22,461,242	660	664
N P Anorexia N	134	42,493	1,583,227	317	679
	1,153	781,768	16,781,263	678	848
Pathologicall	245	69,301	2,119,872	282	545
gambling N	4,182	2,088,002	44,077,018	499	663

注: $\mathrm{P}=$ 正例; N= 负例。

4.2. Comparison with Benchmark Methods

4.2. 与基准方法的比较

4.2.1. Comparison with Existing Mental Disorder Detection Methods

4.2.1. 与现有精神障碍检测方法的对比

We begin by comparing our proposed method with state-of-the-art approaches from other machine learning paradigms other than prompt engineering. The goal is to highlight the advantages of LLM-based prompt engineering over these alternative paradigms in the context of our research, which focuses on detecting mental disorders through user-generated content. Table 5 reports our results and compares with state-of-the-art methods for benchmarking (Benton et al. 2017, Chau et al. 2020, Choudhury et al. 2013, Coppersmith et al. 2014, Khan et al. 2021, Lin et al. 2020, Malviya et al. 2021, Preotiuc-Pietro et al. 2015, Reece et al. 2017).

我们首先将提出的方法与除提示工程(prompt engineering)外其他机器学习范式的最新方法进行比较，旨在凸显基于大语言模型的提示工程在我们研究背景下的优势——该研究专注于通过用户生成内容检测心理障碍。表5报告了我们的实验结果，并与以下基准测试的最新方法进行了对比(Benton et al. 2017, Chau et al. 2020, Choudhury et al. 2013, Coppersmith et al. 2014, Khan et al. 2021, Lin et al. 2020, Malviya et al. 2021, Preotiuc-Pietro et al. 2015, Reece et al. 2017)。

Table5.ComparisonwithState-of-the-artMentalDisorderDetectionStudies
Dataset		Model	AUC	F1	Precision	Recall
Depression	Traditional	Choudhury et al. (2013)	0.569 ± 0.001	0.588 ± 0.002	0.716 ± 0.001	0.569 ±0.003
	Traditional	Coppersmith et al. (2014)	0.685 ±0.002	0.705 ± 0.001	0.735 ± 0.003	0.685 ±0.002
	Machine Learning with Feature	Preotiuc-Pietro et al.(2015)	0.723±0.003	0.760 ± 0.002	0.820 ± 0.001	0.720 ± 0.001
	Machine Learning with Feature	Benton et al. (2017)	0.716 ± 0.029	0.722 ± 0.026	0.730 ± 0.022	0.716 ± 0.029
	Engineering	Reece et al. (2017)	0.729 ± 0.012	0.717 ± 0.011	0.708 ± 0.013	0.729 ± 0.011
	Engineering	Chau et al. (2020)	0.623±0.006	0.573 ±0.005	0.570 ± 0.002	0.622 ±0.005
	Deep Learning with	CNN-based (Lin et al. 2020)	0.710 ± 0.005	0.711 ± 0.004	0.728 ± 0.018	0.710 ± 0.005
	Deep Learning with	LSTM-based(Khanetal.2021)	0.765 ± 0.004	0.756 ± 0.004	0.751± 0.008	0.765±0.004
	Representati on Learning	Transformer-based (Malviya et al. 2021)	0.751 ± 0.002	0.734 ± 0.005	0.724 ± 0.010	0.751 ± 0.002
	Ours	Flan-T5+PrefixTuning+Rule	0.915 ±0.006	0.913±0.003	0.888 ± 0.005	0.939 ±0.005
	Machine Anorexia	Traditional	basedprompt Choudhury et al. (2013)	0.669 ± 0.007	0.708 ± 0.009	0.793 ± 0.005	0.669 ± 0.007
Coppersmith et al. (2014)			0.800 ±0.001	0.812 ±0.003	0.826 ± 0.001	0.800 ±0.001
Preotiuc-Pietro et al. (2015) Learning with			0.748 ± 0.006	0.758 ± 0.007	0.770 ± 0.002	0.748 ± 0.006
Benton et al. (2017)			0.714 ± 0.005	0.759 ± 0.007	0.877 ± 0.007	0.714 ± 0.000
Feature Engineering		Reece et al. (2017)	0.704 ± 0.005	0.770 ± 0.007	0.963 ± 0.007	0.704 ± 0.005
		Chau et al. (2020)	0.707 ± 0.009	0.733 ± 0.002	0.770 ± 0.005	0.707 ± 0.009
		CNN-based (Lin et al. 2020)	0.481 ± 0.009	0.601 ± 0.009	0.488 ± 0.007	0.783 ± 0.001
Deep Learning with Representati		LSTM-based (Khan et al. 2021)	0.500 ± 0.000	0.666 ± 0.007	0.500± 0.000	1.000 ± 0.000
Deep Learning with Representati		Transformer-based(Malviyaetal.	0.771 ± 0.001	0.743 ± 0.002	0.846 ± 0.002
onLearning Ours		2021) Flan-T5+PrefixTuning+Rule	0.886 ± 0.005	0.877 ± 0.006	0.824 ± 0.003	0.662 ±0.007 0.938 ± 0.004
Pathological gambling		based prompt Choudhury et al. (2013)	0.637 ±0.005	0.704 ± 0.012	0.637 ± 0.001	0.976 ± 0.003
	Traditional Coppersmith et al. (2014)	0.534 ±0.006	0.549 ±0.003	0.534 ± 0.002	0.970 ± 0.002
	Machine Preotiuc-Pietro et al. (2015) Learning with	0.586 ± 0.005	0.633 ± 0.003	0.586 ± 0.005	0.978 ± 0.004
	Feature Benton et al.(2017)	0.575 ±0.109
	Engineering Reece et al.(2017)		0.547 ± 0.066 0.573 ± 0.005	0.575 ± 0.109 0.551± 0.008	0.585 ± 0.119 0.871± 0.001
		Chau et al. (2020)	0.551 ± 0.001 0.917 ±0.034		0.717 ± 0.040
	Deep		0.818 ± 0.002	0.772 ± 0.048 0.779 ± 0.006	0.917 ± 0.034 0.989 ± 0.005
	Deep	CNN-based (Lin et al. 2020)			0.643 ±0.002
Learning with Representati	LSTM-based (Khan et al. 2021) Transformer-based(Malviyaetal.	0.610 ± 0.002	0.713 ± 0.009	0.563 ± 0.009	0.972 ± 0.007
onLearning Ours	2021) Flan-T5 +Prefix Tuning + Rule based prompt	0.687 ± 0.005 0.989 ±0.000	0.551 ± 0.004 0.985 ±0.002	0.976 ± 0.009 0.985 ± 0.002	0.384 ± 0.001 0.990 ± 0.003

表 5: 与最先进的精神障碍检测研究对比

数据集		模型	AUC	F1	精确率	召回率
抑郁症	传统方法	Choudhury et al. (2013)	0.569 ± 0.001	0.588 ± 0.002	0.716 ± 0.001	0.569 ±0.003
		Coppersmith et al. (2014)	0.685 ±0.002	0.705 ± 0.001	0.735 ± 0.003	0.685 ±0.002
	基于特征的机器学习	Preotiuc-Pietro et al.(2015)	0.723±0.003	0.760 ± 0.002	0.820 ± 0.001	0.720 ± 0.001
		Benton et al. (2017)	0.716 ± 0.029	0.722 ± 0.026	0.730 ± 0.022	0.716 ± 0.029
	特征工程	Reece et al. (2017)	0.729 ± 0.012	0.717 ± 0.011	0.708 ± 0.013	0.729 ± 0.011
		Chau et al. (2020)	0.623±0.006	0.573 ±0.005	0.570 ± 0.002	0.622 ±0.005
	基于深度学习的表示学习	CNN-based (Lin et al. 2020)	0.710 ± 0.005	0.711 ± 0.004	0.728 ± 0.018	0.710 ± 0.005
		LSTM-based(Khanetal.2021)	0.765 ± 0.004	0.756 ± 0.004	0.751± 0.008	0.765±0.004
		Transformer-based (Malviya et al. 2021)	0.751 ± 0.002	0.734 ± 0.005	0.724 ± 0.010	0.751 ± 0.002
	我们的方法	Flan-T5+PrefixTuning+Rule	0.915 ±0.006	0.913±0.003	0.888 ± 0.005	0.939 ±0.005
机器性厌食症	传统方法	basedprompt Choudhury et al. (2013)	0.669 ± 0.007	0.708 ± 0.009	0.793 ± 0.005	0.669 ± 0.007
		Coppersmith et al. (2014)	0.800 ±0.001	0.812 ±0.003	0.826 ± 0.001	0.800 ±0.001
		Preotiuc-Pietro et al. (2015) Learning with	0.748 ± 0.006	0.758 ± 0.007	0.770 ± 0.002	0.748 ± 0.006
		Benton et al. (2017)	0.714 ± 0.005	0.759 ± 0.007	0.877 ± 0.007	0.714 ± 0.000
	基于特征工程的机器学习	Reece et al. (2017)	0.704 ± 0.005	0.770 ± 0.007	0.963 ± 0.007	0.704 ± 0.005
		Chau et al. (2020)	0.707 ± 0.009	0.733 ± 0.002	0.770 ± 0.005	0.707 ± 0.009
		CNN-based (Lin et al. 2020)	0.481 ± 0.009	0.601 ± 0.009	0.488 ± 0.007	0.783 ± 0.001
	基于深度学习的表示学习	LSTM-based (Khan et al. 2021)	0.500 ± 0.000	0.666 ± 0.007	0.500± 0.000	1.000 ± 0.000
		Transformer-based(Malviyaetal.	0.771 ± 0.001	0.743 ± 0.002	0.846 ± 0.002
	我们的方法	2021) Flan-T5+PrefixTuning+Rule	0.886 ± 0.005	0.877 ± 0.006	0.824 ± 0.003	0.662 ±0.007 0.938 ± 0.004
病态赌博	传统方法	based prompt Choudhury et al. (2013)	0.637 ±0.005	0.704 ± 0.012	0.637 ± 0.001	0.976 ± 0.003
		Coppersmith et al. (2014)	0.534 ±0.006	0.549 ±0.003	0.534 ± 0.002	0.970 ± 0.002
	基于特征工程的机器学习	Preotiuc-Pietro et al. (2015) Learning with	0.586 ± 0.005	0.633 ± 0.003	0.586 ± 0.005	0.978 ± 0.004
		Benton et al.(2017)	0.575 ±0.109
		Reece et al.(2017)		0.547 ± 0.066 0.573 ± 0.005	0.575 ± 0.109 0.551± 0.008	0.585 ± 0.119 0.871± 0.001
		Chau et al. (2020)	0.551 ± 0.001 0.917 ±0.034		0.717 ± 0.040
	基于深度学习的表示学习	CNN-based (Lin et al. 2020)	0.818 ± 0.002	0.772 ± 0.048 0.779 ± 0.006	0.917 ± 0.034 0.989 ± 0.005	0.643 ±0.002
		LSTM-based (Khan et al. 2021) Transformer-based(Malviyaetal.	0.610 ± 0.002	0.713 ± 0.009	0.563 ± 0.009	0.972 ± 0.007
	我们的方法	2021) Flan-T5 +Prefix Tuning + Rule based prompt	0.687 ± 0.005 0.989 ±0.000	0.551 ± 0.004 0.985 ±0.002	0.976 ± 0.009 0.985 ± 0.002	0.384 ± 0.001 0.990 ± 0.003

Existing mental disorder detection methods that utilize user-generated content can be mainly classified into two categories: (1) traditional machine learning methods with feature engineering and (2) deep neural networks with representation learning. Our proposed method shows significant performance improvement compared to models in both categories. Compared with the best-performing model in traditional machine learning, our method improves the F1 score by 0.153 on the depression dataset (Preotiuc-Pietro et al., 2015), 0.065 on the anorexia dataset (Copper smith et al., 2014), and 0.213 on the pathological gambling dataset (Chau et al. 2020). Compared with the best-performing deep learning model, our method improves the F1 score by 0.157 on the depression dataset (Khan et al. 2021), 0.134 on the anorexia dataset (Malviya et al. 2021), and 0.206 on the pathological gambling dataset (Lin et al. 2020).

现有基于用户生成内容的精神障碍检测方法主要分为两类：(1) 基于特征工程的传统机器学习方法 (2) 基于表征学习的深度神经网络。相比这两类模型，我们提出的方法均展现出显著性能提升。与传统机器学习中表现最佳的模型相比，我们的方法在抑郁症数据集 (Preotiuc-Pietro et al., 2015) 上 F1 值提升 0.153，在厌食症数据集 (Coppersmith et al., 2014) 上提升 0.065，在病态赌博数据集 (Chau et al. 2020) 上提升 0.213。与表现最佳的深度学习模型相比，我们的方法在抑郁症数据集 (Khan et al. 2021) 上 F1 值提升 0.157，在厌食症数据集 (Malviya et al. 2021) 上提升 0.134，在病态赌博数据集 (Lin et al. 2020) 上提升 0.206。

Our method consistently outperforms state-of-the-art mental disorder detection methods across various disorders and datasets for the following reasons: (1) Noisy data: Capturing valid information from noisy user-generated content is a significant challenge for traditional machine learning models that rely on feature engineering. Similarly, conventional deep learning models may struggle to memorize text sequences (e.g., LSTM) that are exceptionally lengthy and exhibit a wide range of lengths (e.g., Transformers). In contrast, our method using LLMs with prompt engineering is capable of extracting task-specific information from noisy text data. (2) Class imbalance: Moreover, the dataset is highly imbalanced (around six times more negative samples compared to positive samples), which reflects the real-world distribution of depression patients. It can be highly challenging for the baseline models to excel in this context, while our method can resolve this issue with personalized prompts and knowledge injection.

我们的方法在各种精神障碍和数据集上始终优于最先进的精神障碍检测方法，原因如下：(1) 噪声数据：从用户生成内容的噪声中捕获有效信息对依赖特征工程的传统机器学习模型构成重大挑战。同样，传统深度学习模型(如LSTM)可能难以记忆异常冗长且长度差异极大的文本序列(如Transformer)。相比之下，我们采用提示工程的大语言模型方法能够从噪声文本数据中提取任务特定信息。(2) 类别不平衡：此外，数据集存在高度不平衡(阴性样本约为阳性样本的六倍)，这反映了抑郁症患者在现实世界中的分布。基线模型在此情境下表现优异极具挑战性，而我们的方法通过个性化提示和知识注入能够解决这一问题。

However, we acknowledge that standard machine learning and deep learning methods are trained using only the given dataset. Moreover, they have relatively fewer model parameters compared to the LLMs used for the experiment (i.e., Flan-T5). Thus, the significant performance gain could be partially due to the superior capability of the LLMs themselves. Nonetheless, we would like to clarify that the inherent ability of LLMs is not our main contribution. Our contribution focuses on prompt engineering design. which can further enhance the capability of

然而，我们承认标准的机器学习和深度学习方法仅使用给定数据集进行训练。此外，与实验中使用的LLM（即Flan-T5）相比，它们的模型参数相对较少。因此，显著的性能提升可能部分归因于LLM本身的卓越能力。尽管如此，我们需要澄清的是，LLM的固有能力并非我们的主要贡献。我们的贡献集中在提示工程（prompt engineering）设计上，这可以进一步增强其能力。

LLMs. To support this, we will demonstrate the effectiveness of multi-prompt engineering by comparing it with other prompt strategies (Section 4.2.2) and presenting the results of ablation studies (Section 4.3).

大语言模型。为验证这一点，我们将通过对比多提示工程与其他提示策略（第4.2.2节）的效果，并展示消融实验结果（第4.3节）来论证其有效性。

4.2.2. Comparison with Other Prompt Engineering Strategies

4.2.2. 与其他提示工程策略的对比

To further validate the effectiveness of our prompt engineering method, we compare it with other prompt engineering strategies (Table 6). (1) Continuous prompting vs. discrete prompting: Our method employs a continuous prompting strategy to achieve optimal classification results. In contrast, another approach is to directly query the LLM using a discrete prompt (i.e., human-understandable language) for mental disorder detection. (2) Prefix tuning in continuous prompting vs. personalized context in discrete prompting: Within our ensemble prompt design, we incorporate prefix tuning to capture personalized context among social media users and patients. A similar motivation in discrete prompting is context tailored to the social media user (i.e.,using a small set of labeled cases as examples to provide context before the LLM generates feedback). We compare the effects of these two approaches. (3) Rule-based prompting in continuous prompting vs. chain-of-thought reasoning in discrete prompting: Our ensemble prompt design includes rule-based prompting to break down the complex mental disorder detection problem into subtasks including identifying symptoms, life events, and treatments. In discrete prompting, a similar motivation is achieved through chain-of-thought reasoning (i.e., prompting an LLM with examples that break down a task into several sub-components can enhance model performance) (Wei et al. 2022). Following this approach, we provided brief reasoning as a chain of thoughts within a discrete prompt, guided by high-quality examples of mental disorder diagnoses that include relevant symptoms, major life events, and treatment. Building on (2), we further compare the effects of these two approaches.

为了进一步验证我们提示工程方法的有效性，我们将其与其他提示策略进行了对比（表6）。(1) 连续提示 vs 离散提示：我们的方法采用连续提示策略来获得最优分类结果。相比之下，另一种方法是直接使用离散提示（即人类可理解的语言）查询大语言模型进行精神障碍检测。(2) 连续提示中的前缀调优 vs 离散提示中的个性化上下文：在我们的集成提示设计中，我们采用前缀调优来捕捉社交媒体用户和患者的个性化上下文。离散提示中类似的动机是为社交媒体用户定制上下文（即在生成反馈前使用少量标注案例作为上下文示例）。我们比较了这两种方法的效果。(3) 连续提示中的基于规则提示 vs 离散提示中的思维链推理：我们的集成提示设计包含基于规则的提示，将复杂的精神障碍检测问题分解为识别症状、生活事件和治疗等子任务。在离散提示中，类似的动机通过思维链推理实现（即通过将任务分解为多个子组件的示例来提示大语言模型可以提升性能）(Wei et al. 2022)。基于此方法，我们在离散提示中提供了由高质量精神障碍诊断示例（包含相关症状、重大生活事件和治疗方案）引导的简要推理思维链。在(2)的基础上，我们进一步比较了这两种方法的效果。

According to our results, we observed several important findings: When comparing the performance of continuous and discrete prompts using the same LLM, continuous prompts significantly outperformed discrete prompts across all three datasets in terms of F1 score (0.611 vs. 0.529 for depression, 0.727 vs. 0.551 for anorexia, and 0.705 vs. 0.600 for pathological gambling). This improvement can be attributed to the fundamental difference between the two approaches: discrete prompts rely on predefined, natural language-based templates to achieve the prediction goal, whereas continuous prompts leverage trainable parameters to adjust prompt embeddings (Liu et al. 2023). In our task, optimizing predictive performance is of primary importance. Because continuous prompts can adjust their embeddings through trainable parameters, they have a greater potential to minimize loss.

根据我们的结果，我们观察到几个重要发现：当使用相同大语言模型比较连续提示(continuous prompts)和离散提示(discrete prompts)的性能时，连续提示在所有三个数据集的F1分数上都显著优于离散提示（抑郁症0.611 vs. 0.529，厌食症0.727 vs. 0.551，病态赌博0.705 vs. 0.600）。这种改进可归因于两种方法的根本差异：离散提示依赖预定义的基于自然语言的模板来实现预测目标，而连续提示利用可训练参数来调整提示嵌入(Liu et al. 2023)。在我们的任务中，优化预测性能是最重要的。由于连续提示可以通过可训练参数调整其嵌入，因此它们具有更大的潜力来最小化损失。

Table 6.ComparisonwithOtherPromptEngineering Strategies
Dataset		PromptStrategies	AUC	F1	Precision	Recall
Depression	Continuous	Flan-T5	0.673±0.005	0.611 ±0.002	0.578 ± 0.002	0.647 ± 0.006
Depression	Discreteprompt	Flan-T5	0.673±0.002	0.529 ± 0.001	0.368 ± 0.009	0.940 ± 0.001
Anorexia	Continuous	Flan-T5	0.783±0.000	0.727±0.003	0.578 ± 0.003	0.979 ± 0.006
Anorexia	Discreteprompt	Flan-T5	0.623±0.007	0.551±0.007	0.524±0.006	0.581±0.008
Pathological gambling	Continuous	Flan-T5	0.772±0.007	0.705±0.009	0.545±0.005	1.000 ±0.000
Pathological gambling	Discreteprompt	Flan-T5	0.746 ±0.003	0.600±0.008	0.908 ±0.005	0.448±0.008
Depression	Continuous	Flan-T5+PrefixTuning	0.704 ±0.007	0.625±0.003	0.488 ±0.004	0.869 ±0.000
Depression	Discreteprompt	Flan-T5+Personalizedcontext	0.601±0.000	0.368±0.009	0.236 ±0.008	0.833±0.003
Anorexia	Continuous	Flan-T5+PrefixTuning	0.787±0.005	0.784 ±0.008	0.775 ±0.000	0.794 ± 0.009
Anorexia	Discreteprompt	Flan-T5+Personalizedcontext	0.474±0.005	0.310±0.007	0.262 ±0.003	0.381 ±0.000
Pathological gambling	Continuous	Flan-T5+PrefixTuning	0.881±0.008	0.868±0.007	0.781 ±0.008	0.977±0.003
Pathological gambling	Discreteprompt	Flan-T5+Personalizedcontext	0.722±0.005	0.590±0.002	0.658 ±0.005	0.534±0.007
Depression	Continuous	Flan-T5+PrefixTuning +Rule basedprompt(Ours)	0.915±0.000	0.913±0.000	0.888±0.000	0.939 ±0.000
Depression	Discreteprompt	Flan-T5 + Personalized context + Chain of thought	0.614 ± 0.001	0.401：士 0.006	0.263 ±0.002	0.847 ±0.005
Anorexia	Continuous	Flan-T5 + Prefix Tuning + Rule	0.886 ±0.005	0.877±0.006	0.824 ±0.003	0.938±0.004
Anorexia	Discreteprompt	based prompt(Ours) Flan-T5+Personalizedcontext	0.560 ±0.003	0.454 ± 0.005	0.409 ± 0.008	0.510 ± 0.002
Pathological gambling	Continuous	+ Chain of thought Flan-T5+PrefixTuning+Rule based prompt(Ours)	0.989 ± 0.000	0.985 ± 0.002	0.985 0.002	0.990 ± 0.003
Pathological gambling	Discreteprompt	Flan-T5+Personalizedcontext + Chain of thought	0.634 ± 0.006	0.489 ±0.001	0.682 ±0.009	0.381±0.000

表 6: 与其他提示工程策略的对比

数据集		提示策略	AUC	F1	精确率	召回率
Depression	Continuous	Flan-T5	0.673±0.005	0.611±0.002	0.578±0.002	0.647±0.006
	Discrete prompt	Flan-T5	0.673±0.002	0.529±0.001	0.368±0.009	0.940±0.001
Anorexia	Continuous	Flan-T5	0.783±0.000	0.727±0.003	0.578±0.003	0.979±0.006
	Discrete prompt	Flan-T5	0.623±0.007	0.551±0.007	0.524±0.006	0.581±0.008
Pathological gambling	Continuous	Flan-T5	0.772±0.007	0.705±0.009	0.545±0.005	1.000±0.000
	Discrete prompt	Flan-T5	0.746±0.003	0.600±0.008	0.908±0.005	0.448±0.008
Depression	Continuous	Flan-T5+PrefixTuning	0.704±0.007	0.625±0.003	0.488±0.004	0.869±0.000
	Discrete prompt	Flan-T5+Personalized context	0.601±0.000	0.368±0.009	0.236±0.008	0.833±0.003
Anorexia	Continuous	Flan-T5+PrefixTuning	0.787±0.005	0.784±0.008	0.775±0.000	0.794±0.009
	Discrete prompt	Flan-T5+Personalized context	0.474±0.005	0.310±0.007	0.262±0.003	0.381±0.000
Pathological gambling	Continuous	Flan-T5+PrefixTuning	0.881±0.008	0.868±0.007	0.781±0.008	0.977±0.003
	Discrete prompt	Flan-T5+Personalized context	0.722±0.005	0.590±0.002	0.658±0.005	0.534±0.007
Depression	Continuous	Flan-T5+PrefixTuning+Rule based prompt(Ours)	0.915±0.000	0.913±0.000	0.888±0.000	0.939±0.000
	Discrete prompt	Flan-T5+Personalized context+Chain of thought	0.614±0.001	0.401±0.006	0.263±0.002	0.847±0.005
Anorexia	Continuous	Flan-T5+Prefix Tuning+Rule based prompt(Ours)	0.886±0.005	0.877±0.006	0.824±0.003	0.938±0.004
	Discrete prompt	Flan-T5+Personalized context+Chain of thought	0.560±0.003	0.454±0.005	0.409±0.008	0.510±0.002
Pathological gambling	Continuous	Flan-T5+PrefixTuning+Rule based prompt(Ours)	0.989±0.000	0.985±0.002	0.985±0.002	0.990±0.003
	Discrete prompt	Flan-T5+Personalized context+Chain of thought	0.634±0.006	0.489±0.001	0.682±0.009	0.381±0.000

Note.

注.

Example 1: " ." This individual is identified as having {depression, anorexia, pathological gamblingy.

示例1："<插入单个用户的一条帖子>"。该用户被识别为患有{抑郁症、厌食症、病态赌博}。

Below is a collection of messages from a single user on a social media platform. Based on the content of these messages, can it be determined whether this individual is experiencing {depression, anorexia, pathological gamblingy? Messages:“"

以下是一位社交媒体平台用户的留言集。根据这些留言的内容，能否判断此人是否正在经历 {抑郁症、厌食症、病态赌博}？留言：“<插入个人用户的留言>”

Discrete prompt $^+$ Personalized context $^+$ Chain of thought: Researchers utilize social media to detect whether individuals exhibi signs of {depression, anorexia, pathological gambling}. Below are examples:

离散提示 $^+$ 个性化上下文 $^+$ 思维链：研究人员利用社交媒体检测个体是否表现出{抑郁、厌食症、病态赌博}迹象。示例如下：

Next, we compare two different prompt strategies for providing personalized context: prefix tuning in continuous prompts and directly incorporating personalized context in discrete prompts. As shown in our results, prefix tuning in continuous prompts consistently outperformed directly incorporating personalized context in discrete prompts across all three datasets in terms of F1 score (0.625 vs. 0.368 for depression, 0.784 vs. 0.310 for anorexia, and 0.868 vs. 0.590 for pathological gambling). We argue that the performance improvements observed in continuous prompts can be attributed to several key factors. First, prefix tuning as a continuous prompt method leverages trainable parameters to adjust prompt embeddings, allowing each individual's personalized context to be effectively captured; additionally, the relationships between individuals are also embedded through the tuning of these trainable parameters, ultimately enhancing the prediction outcomes (Li and Liang 2021). In contrast, discrete prompts rely on pre-established templates to achieve the prediction goal. In our research context, a complex mental disorder detection task, the patient's context is incorporated into the prompt alongside the prediction task. This approach significantly increases the difficulty of comprehending the task for the LLM. Moreover, information cannot be shared across individuals, since LLMs do not retain memory of previous cases, further limiting predictive performance. Our experimental results show that adding personalized context to discrete prompts not only fails to improve prediction accuracy but actually degrades performance.

接下来，我们比较两种提供个性化上下文的提示策略：连续提示中的前缀调优与离散提示中直接融入个性化上下文。结果显示，在三个数据集的F1分数上，连续提示的前缀调优均优于离散提示的直接融入（抑郁症0.625 vs. 0.368，厌食症0.784 vs. 0.310，病态赌博0.868 vs. 0.590）。我们认为连续提示的性能提升可归因于几个关键因素：首先，前缀调优作为连续提示方法利用可训练参数调整提示嵌入，能有效捕捉每个个体的个性化上下文；同时，个体间关系也通过这些可训练参数的调优被嵌入，最终提升预测效果 (Li and Liang 2021)。相比之下，离散提示依赖预设模板实现预测目标。在我们的研究场景——复杂精神障碍检测任务中，患者上下文与预测任务同时被纳入提示，这大幅增加了大语言模型理解任务的难度。此外，由于大语言模型不会保留先前案例记忆，信息无法跨个体共享，进一步限制了预测性能。实验表明，在离散提示中添加个性化上下文不仅无法提升预测准确率，反而会降低表现。

Next, we compare two different prompt strategies for incorporating domain knowledge: rule-based prompts in continuous prompting and chain-of-thought prompting in discrete prompting. As shown in our results, rule-based prompts in continuous prompting significantly outperformed chain-of-thought prompting in discrete prompting across all three datasets in terms of F1 score (0.913 vs. 0.401 for depression, 0.877 vs. 0.454 for anorexia, and 0.985 vs. 0.489 for pathological gambling). We argue that several factors contribute to this performance improvement. First, the essence of rule-based prompts lies in leveraging domain knowledge to break down

接下来，我们比较了两种融入领域知识的提示策略：连续提示(continuous prompting)中的基于规则的提示(rule-based prompts)与离散提示(discrete prompting)中的思维链提示(chain-of-thought prompting)。结果显示，在三个数据集的F1分数上，连续提示中的基于规则提示均显著优于离散提示中的思维链提示（抑郁症0.913 vs. 0.401，厌食症0.877 vs. 0.454，病态赌博0.985 vs. 0.489）。我们认为这种性能提升源于多个因素：首先，基于规则提示的核心在于利用领域知识分解

complex problems into smaller, more manageable ones (Han et al. 2022). In our research design, these sub problems, informed by a structured mental disorder ontology from medical knowledge, lead to clearer prediction targets, a more comprehensive understanding of medical concepts, and more efficient use of domain knowledge. Collectively, this enhances the model's predictive capabilities. In contrast, chain-of-thought prompting provides reasoning steps and examples to the LLM (e.g., “This patient is diagnosed with depression because they experienced life event B and exhibit symptom A.."). However, this method introduces medical knowledge in a fragmented and incomplete manner, offering LLMs only limited assistance from domain knowledge. As a result, its contribution to performance improvement remains minimal.

将复杂问题分解为更小、更易处理的部分 (Han et al. 2022) 。我们的研究设计中，这些子问题基于医学知识构建的精神障碍本体论，形成了更清晰的预测目标、更全面的医学概念理解以及更高效的领域知识运用，从而整体提升了模型的预测能力。相比之下，思维链提示 (chain-of-thought prompting) 虽然为大语言模型提供了推理步骤和示例 (例如"该患者被诊断为抑郁症，因为他们经历了生活事件B并表现出症状A...") ，但这种方法以碎片化且不完整的方式引入医学知识，仅能为大语言模型提供有限的领域知识辅助，因此对性能提升的贡献微乎其微。

Note: Ours: (1) Continuous prompt: Prefix Tuning $^+$ Rule-based prompt; (2) Number of parameters: Flan-T5 (Base): 250 Million ChatGPT4o: (1) Discrete prompt: Personalized context $^+$ Chain of thought; (2) Number of parameters: ChatGPT4o: 1800,000 Million

注: 我们的方法: (1) 连续提示: Prefix Tuning $^+$ 基于规则的提示; (2) 参数量: Flan-T5 (Base): 2.5亿 ChatGPT4o: (1) 离散提示: 个性化上下文 $^+$ 思维链; (2) 参数量: ChatGPT4o: 18000亿

To further highlight the advantages of our proposed continuous prompt-based strategy over discrete prompting, we compare our method using FLAN-T5 (Continuous Prompt: Prefix Tuning $^+$ Rule-based Prompt) with a discrete prompt approach using ChatGPT-4o (Discrete Prompt: Personalized Context $+$ Chain of Thought). The difference in the number of parameters between these two LLMs is substantial: FLAN-T5 has 250 million parameters, whereas ChatGPT-4o has 1.8 trillion. Generally speaking, the number of parameters reflects an LLM's baseline capability when no additional techniques are applied. Despite the significant disparity in model size, our approach outperforms ChatGPT-4o with discrete prompting across three datasets in most cases in terms of F1 score (e.g., 0.913 vs. 0.646 for depression and 0.985 vs. 0.898 for pathological

为进一步凸显我们提出的基于连续提示策略相较于离散提示的优势，我们将采用FLAN-T5（连续提示：前缀调优$^+$基于规则的提示）的方法与使用ChatGPT-4o（离散提示：个性化上下文$+$思维链）的离散提示方法进行对比。这两个大语言模型的参数量差异显著：FLAN-T5拥有2.5亿参数，而ChatGPT-4o则达到1.8万亿。一般而言，参数量反映了未应用额外技术时大语言模型的基准能力。尽管模型规模存在巨大差距，我们的方法在三个数据集的F1分数上大多数情况下优于采用离散提示的ChatGPT-4o（例如抑郁症检测0.913 vs. 0.646，病理分类0.985 vs. 0.898）。

gambling). On the anorexia dataset, the performance of our method is comparable to that of ChatGPT-4o with discrete prompting (0.877 vs. 0.885).

赌博)。在厌食症数据集上，我们方法的性能与采用离散提示的ChatGPT-4o相当(0.877 vs. 0.885)。

The experiments described above provide several insights. Our research focuses on subject-level user-generated content-based mental disorder detection, which is a binary classification problem. Given this context, continuous prompts can fully leverage the capabilities of LLMs to achieve optimal classification and prediction performance. Additionally, user-generated content contains significant noise, and there is substantial individual variability among patients. Given this context, effectively incorporating existing medical knowledge and highlighting individual differences can enhance the performance of LLMs. Continuous prompts have advantages over discrete prompts in both aspects: In our research design, prefix tuning enables the effective capture of personalized context, and it also embeds relationships between individuals through parameter tuning. Moreover, our rule-based tuning approach systematically integrates medical knowledge, further improving the model's performance. Furthermore, unlike discrete prompts, which increase the model's workload when incorporating personalized context and chains of thought, our approach alleviates the burden of LLMs in understanding the predictio1 task, thereby improving predictive performance.

上述实验提供了几点重要发现。我们的研究聚焦于基于用户生成内容的主体级心理障碍检测，这是一个二分类问题。在此背景下，连续提示(continuous prompts)能充分发挥大语言模型的潜力，实现最优的分类预测性能。此外，用户生成内容存在显著噪声，且患者个体差异较大。这种情况下，有效整合现有医学知识并突出个体差异可提升大语言模型表现。连续提示在两方面优于离散提示(discrete prompts)：在我们的研究设计中，前缀调优(prefix tuning)能有效捕捉个性化语境，同时通过参数调优嵌入个体间关联关系。此外，基于规则的调优方法系统整合了医学知识，进一步提升了模型性能。更重要的是，与离散提示不同（后者在融入个性化语境和思维链时会增加模型负担），我们的方法减轻了大语言模型理解预测任务的压力，从而提升了预测表现。

In the experiments above, we observed that continuous prompts demonstrate significant performance improvements over discrete prompts. At the same time, they are computationally efficient. Discrete prompts may initially appear “low-cost? in computational terms due to the absence of training requirements; however, optimizing them for performance demands substantial time investments. In contrast, continuous prompts automate this optimization process with negligible computational overhead. Compared to full fine-tuning of LLMs, continuous prompts provide a streamlined alternative: they involve training only small, parameterized vectors appended to the input embeddings, reducing the training time to approximately $10%$ Ofthat required for full LLM fine-tuning. This efficiency-performance trade-off is consistent with

在上述实验中，我们观察到连续提示(continuous prompts)相比离散提示(discrete prompts)展现出显著的性能提升，同时具备更高的计算效率。离散提示由于无需训练，在计算层面看似"低成本"，但为优化其性能往往需要投入大量时间。相比之下，连续提示通过自动化优化过程实现了可忽略的计算开销。与完整的大语言模型微调相比，连续提示提供了一种更高效的替代方案：仅需训练附加在输入嵌入层的小型参数化向量，将训练时间缩短至完整微调所需时间的约$10%$。这种效率与性能的平衡与

4.3. Ablation Studies

4.3. 消融研究

Since our prompt engineering approach employs an ensemble method with multiple prompt components, we conduct ablation studies to assess the relative impact of each component on mental disorder detection tasks.

由于我们的提示工程方法采用了包含多个提示组件的集成方法，我们进行了消融研究以评估每个组件在精神障碍检测任务中的相对影响。

We first assess the impact of our ensemble prompt engineering method compared to just using an LLM for prediction (Table 7). With its more sophisticated structure and a large number of parameters, Flan-T5 outperforms most baseline methods as we expected (see Table 6). However, adding our prompt engineering method further improves the predictive performance by a significant margin in terms of F1 score (i.e., 0.913 vs. 0.529 for depression, 0.877 vs. 0.724 for anorexia, and 0.985 vs. 0.705 for pathological gambling). This result ensures that the performance gain of our method is not just achieved by the predictive capabilities of LLMs. The innovative aspects of the proposed prompting solution further enhance the predictive performance.

我们首先评估了集成提示工程方法相比仅使用大语言模型进行预测的效果（表7）。正如预期，Flan-T5凭借其更复杂的结构和大量参数，在大多数基线方法中表现更优（见表6）。然而，加入我们的提示工程方法后，预测性能在F1分数上有了显著提升（抑郁症：0.913 vs. 0.529，厌食症：0.877 vs. 0.724，病态赌博：0.985 vs. 0.705）。这一结果证明，我们方法的性能提升并非仅源自大语言模型的预测能力，所提出的提示解决方案的创新性进一步增强了预测性能。

Table7.Abl ation Studies

Dataset	Prompt Strategies		AUC	F1	Recall	Precision
Depression	Ours-Prefixtuning-Rule basedprompt	Flan-T5	0.672±0.005	0.529 ± 0.002	0.368±0.002	0.940 ±0.006
	Ours-Rulebasedprompt	Flan-T5+Prefixtuning	0.704 ± 0.007	0.625±0.003	0.488±0.004	0.869 ±0.000
	Ours-Prefixtuning	Flan-T5+Rulebasedprompt	0.757±0.008	0.794 ± 0.007	0.938±0.000	0.689 ±0.005
	Ours	Flan-T5+Prefixtuning +	0.915±0.006	0.913±0.003	0.888 ±0.005	0.939 ±0.005
	Ours-Prefixtuning-Rule	Rulebasedprompt Flan-T5	0.783 ± 0.000	0.727 ± 0.003	0.578 ±0.003	0.979 ± 0.006
Anorexia	basedprompt Ours-Rulebasedprompt	Flan-T5+Prefixtuning	0.787 ± 0.005	0.784 ± 0.008	0.775 ± 0.000	0.794 ± 0.009
	Ours-Prefixtuning	Flan-T5+Rulebasedprompt	0.850±0.000	0.858±0.008	0.912 ± 0.005	0.811±0.001
	Ours	Flan-T5+Prefixtuning+	0.886 ±0.005	0.877 ± 0.006	0.824 ±0.003	0.938 ±0.004
	Ours-Prefixtuning-Rule	Rulebased prompt
Pathological gambling	basedprompt	Flan-T5	0.772 ± 0.007	0.705 ± 0.009	0.545 ±0.005	1.000 ±0.000
	Ours-Rulebasedprompt Ours-Prefixtuning	Flan-T5+Prefixtuning Flan-T5+Rulebasedprompt	0.881± 0.008	0.868±0.007	0.781 ± 0.008	0.977±0.003
	Ours	Flan-T5+Prefix tuning +	0.935±0.002	0.932±0.009	0.900±0.000	0.968 ±0.002
		Rulebasedprompt	0.989±0.000	0.985±0.002	0.985±0.002	0.990 ±0.003

表7: 消融实验

数据集	提示策略	模型	AUC	F1	召回率	精确率
抑郁症	Ours-Prefixtuning-Rule basedprompt	Flan-T5	0.672±0.005	0.529±0.002	0.368±0.002	0.940±0.006
	Ours-Rulebasedprompt	Flan-T5+Prefixtuning	0.704±0.007	0.625±0.003	0.488±0.004	0.869±0.000
	Ours-Prefixtuning	Flan-T5+Rulebasedprompt	0.757±0.008	0.794±0.007	0.938±0.000	0.689±0.005
	Ours	Flan-T5+Prefixtuning+Rulebasedprompt	0.915±0.006	0.913±0.003	0.888±0.005	0.939±0.005
	Ours-Prefixtuning-Rule	Rulebasedprompt Flan-T5	0.783±0.000	0.727±0.003	0.578±0.003	0.979±0.006
厌食症	basedprompt Ours-Rulebasedprompt	Flan-T5+Prefixtuning	0.787±0.005	0.784±0.008	0.775±0.000	0.794±0.009
	Ours-Prefixtuning	Flan-T5+Rulebasedprompt	0.850±0.000	0.858±0.008	0.912±0.005	0.811±0.001
	Ours	Flan-T5+Prefixtuning+Rulebasedprompt	0.886±0.005	0.877±0.006	0.824±0.003	0.938±0.004
	Ours-Prefixtuning-Rule	Rulebasedprompt	-	-	-	-
病态赌博	basedprompt	Flan-T5	0.772±0.007	0.705±0.009	0.545±0.005	1.000±0.000
	Ours-Rulebasedprompt Ours-Prefixtuning	Flan-T5+Prefixtuning Flan-T5+Rulebasedprompt	0.881±0.008	0.868±0.007	0.781±0.008	0.977±0.003
	Ours	Flan-T5+Prefixtuning+Rulebasedprompt	0.935±0.002	0.932±0.009	0.900±0.000	0.968±0.002
		Rulebasedprompt	0.989±0.000	0.985±0.002	0.985±0.002	0.990±0.003

Next, we assess the impact of removing the rule-based component from our proposed method This removal significantly reduces prediction performance in terms of F1 score, as evidenced by the following results: depression $(0.913\rightarrow0.625)_{,}$ ,anorexia $(0.877\rightarrow0.784)$ , and pathological gambling $(0.985\rightarrow0.868)$ . Similarly, we evaluate the effect of removing the prefix-tuning

接下来，我们评估从所提方法中移除基于规则的组件所产生的影响。如下结果所示，这一移除显著降低了F1分数方面的预测性能：抑郁症 $(0.913\rightarrow0.625)_{,}$、厌食症 $(0.877\rightarrow0.784)$ 和病态赌博 $(0.985\rightarrow0.868)$。同样地，我们评估了移除前缀调优

component, which also leads to a substantial decline in performance: depression $(0.913\rightarrow0.794)$ anorexia( $(0.877\to0.858)$ , and pathological gambling $(0.985\rightarrow0.932)_{,}$ 0. Overall, removing any components hinders the prediction performance, indicating the effectiveness of both components in our method.

组件, 这也会导致性能大幅下降: 抑郁 $(0.913\rightarrow0.794)$ 厌食 $(0.877\to0.858)$ 以及病态赌博 $(0.985\rightarrow0.932)_{,}$ 0. 总体而言, 移除任何组件都会阻碍预测性能, 表明我们方法中两个组件的有效性。

4.4. The Advantage of LLM-based Prompt Engineering

4.4. 基于大语言模型 (LLM) 的提示工程优势

In this section, we present three experiments that highlight the unique advantages of prompt engineering: few-shot learning, early identification, and general iz ability to scenarios with limited labeled data. Due to page limits, we focus on the depression dataset for the few-shot learning and early identification tasks, as it is the most common mental disorder (WHO 2022b). Additionally, we include a self-harm dataset, which contains only 41 labeled positive cases, to demonstrate our method's ability to generalize to rare or emerging diseases with limited labeled data.

在本节中，我们通过三个实验展示提示工程 (prompt engineering) 的独特优势：少样本学习 (few-shot learning)、早期识别以及在标注数据有限场景下的泛化能力。由于篇幅限制，我们选择抑郁症数据集进行少样本学习和早期识别任务分析，因为这是最常见的精神障碍 (WHO 2022b)。此外，我们引入了一个仅包含41个标注阳性案例的自残数据集，以证明该方法在标注数据稀缺情况下对罕见或新兴疾病的泛化能力。

4.4.1. Few-shot Learning Capability

4.4.1. 少样本 (Few-shot) 学习能力

One salient benefit of using prompt engineering compared to existing mental disorder detection methods is the ability to few-shot learn with minimal labeled data, powered by general-purpose language understanding capabilities. Thus, in this section, we assess the efficacy of our proposed prompt engineering method in accomplishing few-shot learning on the depression detection task. To this end, we substantially decreased the number of training examples from the original training set of 1,000 subjects (i.e., original training set 1 $,707\times60%)$ to 100, 10, and 2 subjects. Figure 4 shows the results of this experiment.

与现有精神障碍检测方法相比，使用提示工程 (prompt engineering) 的一个显著优势是能够通过通用语言理解能力，以极少的标注数据实现少样本学习。因此，本节我们评估所提出的提示工程方法在抑郁症检测任务中实现少样本学习的有效性。为此，我们将训练样本数量从原始的1,000名受试者训练集（即原始训练集1 $,707\times60%$) 大幅减少至100、10和2名受试者。图4展示了该实验的结果。

Figure 4. Few-shot Learning Performance on the Depression Dataset

图 4: 抑郁症数据集上的少样本学习性能

The results indicate that, in contrast to the baseline models, our prompt engineering approach can achieve satisfactory performance even with a significantly smaller training set. In comparison to the best-performing benchmarks, our method demonstrates significant improvement in terms of F1 scores. For instance, the best-performing deep learning model and machine learning model on the depression dataset showed F1-scores of 0.756 and 0.760, respectively, while our method achieved 0.874 with just two training samples. Moreover, as the training size increases, there is a corresponding enhancement in terms of F1 score.

结果表明，与基线模型相比，我们的提示工程方法即使在训练集显著缩小的情况下也能取得令人满意的性能。相较于表现最佳的基准模型，我们的方法在F1分数上展现出显著提升。例如，在抑郁症数据集上表现最优的深度学习模型和机器学习模型的F1分数分别为0.756和0.760，而我们的方法仅用两个训练样本就达到了0.874。此外，随着训练规模扩大，F1分数也呈现相应提升。

It is noteworthy that other models fail to converge or achieve satisfactory performance when the number of training examples is within the range of 2, 10, or 100. The superior performance of our prompt engineering method in few-shot learning has several important implications in the context of mental disorder detection. First, it reduces the demand for a large amount of labeled training data, significantly alleviating the burden on researchers and platforms and paving the way for large-scale, continuous monitoring of mental disorders. Furthermore, for emerging diseases with rare cases and scarce examples, our proposed method holds promise in quickly capturing the necessary predictive information through personalized prompts and medical knowledge injection with minimal training data, thus achieving satisfactory prediction outcomes (we also demonstrate the general iz ability of our method to identifying rare mental disorders in a later section). It provides unique opportunities for flexible and rapid monitoring of the emergence and spread of new diseases in the population through user-generated content, which existing methods are unable to achieve.

值得注意的是，当训练样本数量在2、10或100的范围内时，其他模型无法收敛或取得令人满意的性能。我们的提示工程方法在少样本学习中的卓越表现，对精神障碍检测领域具有多重重要意义。首先，该方法降低了对大量标注训练数据的需求，显著减轻了研究者和平台的负担，为大规模持续监测精神障碍铺平了道路。此外，对于病例稀少的新发疾病，我们提出的方法有望通过个性化提示和最小训练数据下的医学知识注入，快速捕捉必要的预测信息（我们将在后续章节展示该方法在识别罕见精神障碍方面的泛化能力），从而获得理想的预测结果。这为通过用户生成内容灵活快速地监测人群中新发疾病的出现与传播提供了独特机遇，这是现有方法无法实现的。

4.4.2. Early Identification of Depression

4.4.2. 抑郁症早期识别

In addition to few-shot learning, early identification is essential when detecting mental disorders to provide timely warning cues to platforms, medical professionals, and potential patients. It facilitates timely reminders and proactive treatments, thereby preventing the exacerbation of mental disorders and the resulting burden of disease. Using depression as a research case, by taking the onset of depression as the anchor point, we incorporate data from various time frames before the onset of depression, specifically $x$ weeks before the onset of depression, where

除了少样本学习外，早期识别对于检测心理障碍至关重要，以便为平台、医疗专业人员和潜在患者提供及时预警信号。这有助于及时提醒和主动治疗，从而防止心理障碍恶化及其导致的疾病负担。以抑郁症为研究案例，我们将抑郁症发作作为锚点，纳入抑郁症发作前不同时间段的数据，具体为抑郁症发作前 $x$ 周，其中

$x={2,4,8,24}$ , while using a one-month (4 weeks) duration of data as input. A significant challenge in this early prediction lies in the fact that we only use a one-month duration of data as input, the average number of posts per user significantly decreases, and the information pertaining to depression becomes increasingly sparse (it may even lack relevant information associated with depression). The experiment results (see Figure 4) indicate that even in such a challenging context, our proposed prompt engineering method shows a relatively stable F1-score (ranging between 0.727 and 0.738). This result stands in comparison to traditional machine learning and deep learning models that rely on the entire available data for prediction.

$x={2,4,8,24}$，同时使用一个月（4周）的数据时长作为输入。这种早期预测的一个重大挑战在于，我们仅使用一个月时长的数据作为输入，每位用户的平均发帖量显著下降，与抑郁相关的信息也愈发稀疏（甚至可能缺失与抑郁相关的信息）。实验结果表明（见图4），即便在这种极具挑战性的情境下，我们提出的提示工程方法仍能保持相对稳定的F1分数（介于0.727至0.738之间）。这一结果与依赖全部可用数据进行预测的传统机器学习和深度学习模型形成了对比。

Figure5.Early Prediction Analyses

图 5: 早期预测分析

We also examine how far back in time our framework's input needs to be traced regarding the prediction. For real-world applications, utilizing the user's entire historical data for prediction can result in lengthy sequences and substantial data, posing challenges in data gathering and storage. We utilize data from $x$ weeks before the prediction until the prediction time as input ( $x={2,4,8,24})$ . The challenge in this prediction task also lies in that, as the input period shortens, the number of posts decreases, and signals related to depression also weaken. The experimental results (Figure 5) show that the performance of our method is not significantly affected by the shortened time window. It achieves F1 scores over 0.8 using data 12-24 weeks (i.e., 3-6 months) before the prediction ( $(\mathrm{F}1=0.826)$ and it shows an F1 score close to 0.8 even when

我们还研究了预测时需要回溯框架输入的时间范围。对于实际应用而言，使用用户全部历史数据进行预测会导致序列过长、数据量庞大，给数据收集和存储带来挑战。我们采用预测前$x$周至预测时刻的数据作为输入（$x={2,4,8,24}$）。该预测任务的难点还在于：随着输入周期缩短，发帖数量减少，与抑郁相关的信号也会减弱。实验结果（图5）表明，我们的方法性能未受时间窗口缩短的显著影响，使用预测前12-24周（即3-6个月）数据时F1分数仍超过0.8（$\mathrm{F}1=0.826$），甚至在...

using only two weeks’ data $(\mathrm{F}1=0.793)$ ). These results indicate that our method can yield a satisfactory prediction performance when there is data available from a specific period before the prediction rather than the entire historical data, enhancing the applicability of our method.

仅使用两周数据 $(\mathrm{F}1=0.793)$ )。这些结果表明，当预测前特定时间段有可用数据而非全部历史数据时，我们的方法仍能取得令人满意的预测性能，从而增强了该方法的适用性。

Figure 6. Time Window of Input Posts

图 6: 输入帖子的时间窗口

4.4.3. General iz ability to Rare Disease

4.4.3. 对罕见病的泛化能力

We further demonstrate how our method, leveraging its few-shot learning capability, can be generalized to emerging or rare diseases using the self-harm dataset from the eRisk database, where labeled positive data records are scarce (i.e., 41 subjects).

我们进一步展示了如何利用我们方法的少样本学习能力，将其推广至eRisk数据库中自残数据集上的新兴或罕见疾病，其中标记的阳性数据记录稀缺（即41名受试者）。

Table8.RareDiseaseDatasetSummary
Dataset	# of subjects# of posts\|# of wordsAvg # of posts per subject					Avg#ofdaysfromfirsttolastpost
Dataset		6,927		171,789	169	495
Self-harm	299	163,506	3,073.912	546	500

Note: $\mathrm{P}=$ positive examples; N= negative examples.

表 8: 罕见疾病数据集统计

数据集	受试者数量	帖子数量	单词总数	每受试者平均帖子数	正例 (P)	负例 (N)	首末帖平均间隔天数
Self-harm	299	163,506	3,073.912	546	500	-
-	6,927	-	171,789	169	495	-

注: $\mathrm{P}=$ 正例; N= 负例。

We report the evaluation results in comparison to other studies on mental disorder detection (Table 9). Our method consistently outperforms others in terms of F1 score (ours: 0.917 vs. the best traditional machine learning with feature engineering: 0.755 (Reece et al., 2017), and the best deep learning with representation learning: 0.715 (Malviya et al., 2021).

我们报告了与其他精神障碍检测研究相比的评估结果(表9)。我们的方法在F1分数上始终优于其他方法(我们的方法:0.917 vs. 最佳传统机器学习特征工程方法:0.755 (Reece et al., 2017), 以及最佳深度学习表征学习方法:0.715 (Malviya et al., 2021))。

We also conducted ablation studies to validate the contribution of each component in our ensemble prompt method to the prediction results. Removing either prefix tuning or the rule-based prompt significantly degrades performance in terms of F1 score $(0.917\rightarrow0.849$ and $0.917\rightarrow$ 0.886, respectively). Removing both components, relying solely on the LLM (Flan-T5), leads to an even greater performance drop $(0.917\rightarrow0.789)_{,}$ . These ablation studies demonstrate that our ensemble prompt method effectively leverages prefix tuning to capture personalized context, even when the number of labeled samples is extremely limited. Additionally, rule-based tuning ensures the stable incorporation of existing medical knowledge, thereby maximizing the LLM's capabilities and enhancing its ability to detect mental disorders using user-generated content.

我们还进行了消融实验，以验证集成提示方法中每个组件对预测结果的贡献。移除前缀调优或基于规则的提示都会显著降低F1分数性能$(0.917\rightarrow0.849$和$0.917\rightarrow$0.886)。若同时移除这两个组件，仅依赖大语言模型(Flan-T5)，性能下降更为明显$(0.917\rightarrow0.789)_{,}$。这些消融研究表明，即使在标注样本数量极其有限的情况下，我们的集成提示方法也能有效利用前缀调优捕捉个性化上下文。此外，基于规则的调优确保了现有医学知识的稳定整合，从而最大化大语言模型的能力，并提升其利用用户生成内容检测精神障碍的水平。

Table9.ComparisonwithOtherMethodsonRareDiseaseDataset
	Model		AUC	F1	Precision	Recall
Other Mental Disorder Detection Studies	Traditional Machine Learning with Feature Engineering	Choudhury et al. (2013)	0.739 ±0.003	0.710 ±0.002	0.690 ±0.005	0.739 ±0.003
		Coppersmithetal.(2014)	0.711 ± 0.005	0.727 ± 0.001	0.748± 0.002	0.711 ± 0.001
		Preotiuc-Pietroetal.(2015)	0.789 ± 0.007	0.671± 0.002	0.647±0.001	0.789 ±0.003
		Benton etal.(2017)	0.639 ±0.024	0.662± 0.018	0.725 ± 0.003	0.639 ±0.024
		Reece etal.(2017)	0.783 ± 0.071	0.755 ± 0.027	0.746 ± 0.002	0.783 ± 0.071
	Chau etal.(2020)	0.847±0.001	0.748 ± 0.001	0.708 ± 0.008	0.847 ± 0.001
	DeepLearning with Representation Learning	CNN-based(Linetal.2020)	0.699 ±0.043	0.653 ± 0.042	0.778 ± 0.075	0.563±0.023
		LSTM-based (Khanetal.2021)	0.534 ±0.012	0.582± 0.054	0.529 ± 0.006	0.668 ± 0.146
Transformer-based(Malviyaetal.2021)		0.761 ± 0.032	0.715 ± 0.041	0.883±0.035	0.601 ± 0.042
Flan-T5		0.806 ±0.008	0.789 ± 0.003	0.723± 0.005
LLMwith Ablation Prompt Studies Engineering	Flan-T5+PrefixTuning	0.892 ± 0.000	0.886 ± 0.002	0.840 ± 0.009	0.868±0.002 0.936 ±0.007
	Flan-T5+Rule-basedprompt	0.833±0.003	0.849 ±0.003	0.939 ± 0.004	0.775 ± 0.000
	Flan-T5+PrefixTuning+Rule-based	0.924±0.000	0.917±0.008	0.848 ± 0.001
	prompt(Ours)				1.000 ± 0.000

表 9: 罕见病数据集上与其他方法的比较

		模型	AUC	F1	Precision	Recall
其他精神障碍检测研究	传统机器学习与特征工程	Choudhury et al. (2013)	0.739 ±0.003	0.710 ±0.002	0.690 ±0.005	0.739 ±0.003
		Coppersmith et al. (2014)	0.711 ± 0.005	0.727 ± 0.001	0.748± 0.002	0.711 ± 0.001
		Preotiuc-Pietro et al. (2015)	0.789 ± 0.007	0.671± 0.002	0.647±0.001	0.789 ±0.003
		Benton et al. (2017)	0.639 ±0.024	0.662± 0.018	0.725 ± 0.003	0.639 ±0.024
		Reece et al. (2017)	0.783 ± 0.071	0.755 ± 0.027	0.746 ± 0.002	0.783 ± 0.071
		Chau et al. (2020)	0.847±0.001	0.748 ± 0.001	0.708 ± 0.008	0.847 ± 0.001
	深度学习与表征学习	CNN-based (Lin et al. 2020)	0.699 ±0.043	0.653 ± 0.042	0.778 ± 0.075	0.563±0.023
		LSTM-based (Khan et al. 2021)	0.534 ±0.012	0.582± 0.054	0.529 ± 0.006	0.668 ± 0.146
		Transformer-based (Malviya et al. 2021)	0.761 ± 0.032	0.715 ± 0.041	0.883±0.035	0.601 ± 0.042
		Flan-T5	0.806 ±0.008	0.789 ± 0.003	0.723± 0.005
大语言模型与消融提示工程研究		Flan-T5+PrefixTuning	0.892 ± 0.000	0.886 ± 0.002	0.840 ± 0.009	0.868±0.002 0.936 ±0.007
		Flan-T5+Rule-based prompt	0.833±0.003	0.849 ±0.003	0.939 ± 0.004	0.775 ± 0.000
		Flan-T5+PrefixTuning+Rule-based	0.924±0.000	0.917±0.008	0.848 ± 0.001
		prompt (Ours)				1.000 ± 0.000

4.5. Explaining Prediction Results

4.5. 预测结果解释

To explain our prediction results and demonstrate how our prompt engineering method enhances model performance, we conduct two post hoc analyses.

为了解释我们的预测结果并展示提示工程方法如何提升模型性能，我们进行了两项事后分析。

In the first analysis, we randomly select the predicted probabilities for certain segments of user-generated content (i.e., denoted as $x$ in our method). These probabilities represent our method's assessment of the relevance of $x$ to a particular mental disorder. We present the content $x$ and the corresponding probabilities as bar charts in Figure 7. For example, in the case of depression, the model assigns a relevance score of 0.9916 to the statement *“I don't have many friends." Our model not only can determines the relevance between $x$ and a mental disorder but also identifies whether $x$ represents symptoms, life events that may exacerbate the disorder, or whether a social media user is self-reporting ongoing treatment. To validate whether our method's judgment of these statements aligns with medical professionals’ evaluations, we invited two psychiatrists (both having an MD and more than 4 years of clinical experience) to review the randomly selected set of $\overline{{x}}$ . Each psychiatrist independently rated these $x$ based on their medical expertise. The ratings were on a scale from 1 to 5: a score of 1 indicated that the content was unlikely to have been made by a person with a particular mental disorder, while a score of 5 indicated that the content was highly likely to have been made by a person with a particular mental disorder. We used the Pearson correlation coefficient to measure the inter-rater reliability between the two psychiatrists: their assessments showed a high level of agreement (p=0.9636) (Table 10). Next, we evaluated the alignment between the ratings of two psychiatrists and the predictions generated by our model. The Pearson correlation coefficients indicate that the correlation between our method's results and Expert 1's ratings is 0.8467, while the correlation with Expert 2's ratings is 0.8688. Additionally, the correlation between our method's results and the average psychiatrists' ratings is 0.8656. These findings demonstrate that our method closely aligns with expert assessments in detecting the relevance of user-generated content to a particular mental disorder, thereby confirming the reliability of our approach.

在首次分析中，我们随机选取了用户生成内容特定片段(即方法中表示为$x$)的预测概率。这些概率反映了我们的方法对$x$与特定精神障碍相关性的评估。如图7所示，我们将内容$x$及对应概率以条形图形式呈现。例如针对抑郁症案例，模型对陈述*"我朋友不多"给出了0.9916的相关性评分。我们的模型不仅能判定$x$与精神障碍的关联程度，还能识别$x$是否表征症状、可能加剧病态的生活事件，或是社交媒体用户的治疗自述。为验证方法对这些陈述的判断是否符合医学专家评估，我们邀请两位精神科医师(均拥有医学博士学位及4年以上临床经验)对随机选取的$\overline{{x}}$集合进行评审。每位医师基于专业认知独立为$x$评分，量表范围为1-5分：1分表示内容极不可能来自特定精神障碍患者，5分表示极可能来自患者。我们采用皮尔逊相关系数衡量医师间评分者信度：其评估结果呈现高度一致性(p=0.9636) (表10)。随后我们评估了医师评分与模型预测的吻合程度：皮尔逊系数显示，本方法结果与专家1评分的相关性为0.8467，与专家2评分的相关性为0.8688，与医师平均评分的相关性达0.8656。这些发现证明，在检测用户生成内容与特定精神障碍相关性方面，本方法与专家评估高度吻合，从而验证了方法的可靠性。

In practical applications, the identified set of x with high probabilities can be provided to stakeholders (e.g., medical professionals or platforms), along with an explanation of why a social media user may be at risk for a particular mental disorder, thereby enabling appropriate interventions. Over a longer time period and at the population level, a collection of x with high probabilities can be aggregated, enabling researchers to examine mental disorder status across different platforms or regions with varying user profiles. This analysis can reveal the most common mental health issues (e.g., symptoms) and track changes in the prevalence of mental disorders within a population group (e.g., changes in symptoms before and after major events such as the COVID-19 pandemic, presidential elections, or wars). Such capabilities cannot be achieved with simple discrete prompts using LLMs, as these models cannot provide probabilities or explain why they identify a user as having a mental disorder.

在实际应用中，可将识别出的高概率特征x集合提供给利益相关方（如医疗专业人员或平台），并解释社交媒体用户可能面临特定精神障碍风险的原因，从而促成适当干预。在更长时间跨度和群体层面，可聚合高概率特征x集合，使研究人员能够考察不同平台或用户画像差异地区的精神障碍状况。该分析能揭示最常见心理健康问题（如症状表现），并追踪特定人群（如COVID-19疫情、总统选举或战争等重大事件前后）精神障碍患病率的变化。此类能力无法通过简单的大语言模型离散提示实现，因为这些模型既无法提供概率值，也不能解释其判定用户患有精神障碍的依据。

Figure 7. Model'sPredicted Probability of a Post Being Related to a Mental Disorder

Table 10.Alignment Between Model Prediction and Expert Ratings
	Pearsoncorrelationcoefficient
Inter-raterreliability	0.9636
	Pearsoncorrelationcoefficient
	PredictionVSExpertsaverage	PredictionVSExpert1c	PredictionVSExpert2d
Alignmentofpredictionand expertratingsb	0.8656	0.8467	0.8688

Note: a Pearson's correlation coefficient is used to measure the pairwise correlation between the two raters. The rating scale ranges from 1 to 5, with 5 indicating the highest relevance to a disorder and 1 indicating the lowest. The rating scale is ordinal and assumed to be continuous. Pearson's correlation coefficient is used to measure the alignment between the model's predicted probability of a post being related to the disorder and the expert ratings. ° The rater is a psychiatrist with an MD and four years of clinical experience. d The rater is a psychiatrist with an MD, a PhD, and eight years of clinical experience.

图 7: 模型预测帖子与心理障碍相关的概率

表 10: 模型预测与专家评分的对齐性

	Pearson相关系数
评分者间信度	0.9636
	Pearson相关系数
	预测 vs 专家平均分
预测与专家评分的对齐性b	0.8656

注: a Pearson相关系数用于衡量两位评分者之间的两两相关性。评分范围为1至5分，5分表示与障碍相关性最高，1分表示最低。评分量表为有序量表并假设为连续变量。Pearson相关系数用于衡量模型预测帖子与障碍相关的概率与专家评分之间的对齐性。c 评分者为拥有医学博士学位和四年临床经验的精神科医生。d 评分者为拥有医学博士学位、哲学博士学位和八年临床经验的精神科医生。

In the second analysis, we visualize and compare the vector representations of both the original user-generated content and the content transformed by our prompt method. The visualization results are presented in Figure 8. Through visualization, we can easily observe that the vector representations of user-generated content for both positive and negative examples (with or without a mental disorder) are intermingled. This aligns with our intuition, as user-generated content online typically covers a wide range of topics, with only a small portion related to mental disorders. Consequently, mental disorder detection faces a high level of noise. By contrast, when we visualize the vector representations of user-generated content processed through our prompt method, we observe a significantly increased separation between positive and negative examples. For the task of mental disorder detection, our prompt engineering method effectively reduces the complexity of classification/detection. This analysis explains why our prompt engineering method optimizes LLM performance and improves prediction accuracy.

在第二次分析中，我们对原始用户生成内容和经过提示方法转换后的内容进行向量表示的可视化比较。可视化结果如图8所示。通过可视化可以直观观察到，无论正例(存在心理障碍)还是负例(不存在心理障碍)的用户生成内容，其向量表示都相互混杂。这与我们的直觉一致，因为网络上的用户生成内容通常涵盖广泛话题，仅有一小部分与心理障碍相关，因此心理障碍检测面临着较高的噪声干扰。相比之下，当我们可视化经过提示方法处理的用户生成内容向量表示时，观察到正负样本之间的分离度显著增加。对于心理障碍检测任务而言，我们的提示工程方法有效降低了分类/检测的复杂度。这一分析解释了为何我们的提示工程方法能够优化大语言模型性能并提升预测准确率。

Dataset	VectorRepresentationsof Original Social MediaPosts	VectorRepresentationsofSocialMediaPosts Transformed by Our Prompt Method
Depression
Anorexia
Pathological Gambling
Self-harm
Figure8.VectorRep ofOriainalUser-GeneratedConten

数据集	原始社交媒体帖子的向量表示	经过我们提示方法转换后的社交媒体帖子向量表示
抑郁症
厌食症
病态赌博
自残

图8: 原始用户生成内容的向量表示

5. DISCUSSION

5. 讨论

In this work, we adapt advanced AI techniques, including large language models and prompt engineering, to explore how AI can contribute to mental disorder detection on user-generated textual content. The significant advantages of our research lie in its pioneering approach to overcoming long-standing barriers in this field: eliminating the need to collect large amounts of labeled training data for each specific disease or to design specialized supervised learning model architectures for each research problem. At the same time, we place a strong emphasis on addressing the technical challenges using user-generated text data in the healthcare domain. This includes developing personalized methods to represent each user's uniqueness, as well as seamlessly integrating and leveraging disease-related medical knowledge to provide context for the task, instruct learning objectives, and operational ize prediction goals for various types of mental disorders.

在本研究中，我们采用包括大语言模型和提示工程在内的高级AI技术，探索AI如何助力于用户生成文本内容中的心理障碍检测。我们研究的显著优势在于其开创性地解决了该领域的长期障碍：无需为每种特定疾病收集大量标注训练数据，也无需为每个研究问题设计专门的监督学习模型架构。同时，我们高度重视解决医疗领域使用用户生成文本数据的技术挑战。这包括开发个性化方法来表征每个用户的独特性，以及无缝整合并利用疾病相关的医学知识，为任务提供背景、指导学习目标，并为各类心理障碍实现可操作的预测目标。

We evaluate the effectiveness of our research design by employing multiple mental disorders as research cases. The experimental results reveal several key findings. First, the performance of our approach, which combines prompt engineering with LLMs, significantly outperforms other supervised learning paradigms, including feature engineering and architecture engineering. While LLMs inherently are more powerful than many supervised learning models, even using the same LLM as the base model, our designed prompt engineering approach still outperforms alternative prompting strategies, which is the main focus of our work. Second, our method successfully accomplishes few-shot learning for various mental disorders. In other words, we only need to provide the AI model with a small number of examples, and it can effectively identify potential users at risk of mental disorders within user-generated textual content. Third, by breaking down the original complex prediction task into subtasks and intentionally incorporating medical knowledge into prompt engineering, our approach not only significantly reduces the prediction

我们通过采用多种精神障碍作为研究案例来评估研究设计的有效性。实验结果揭示了几个关键发现。首先，我们结合提示工程与大语言模型的方法，其性能显著优于其他监督学习范式，包括特征工程和架构工程。虽然大语言模型本质上比许多监督学习模型更强大，但即使使用相同的大语言模型作为基础模型，我们设计的提示工程方法仍然优于其他提示策略，这是我们工作的主要重点。其次，我们的方法成功实现了针对各种精神障碍的少样本学习。换句话说，我们只需向AI模型提供少量示例，它就能有效识别用户生成文本内容中潜在的精神障碍风险用户。第三，通过将原始复杂预测任务分解为子任务，并有意将医学知识融入提示工程，我们的方法不仅显著降低了预测...

difficulty and improves the performance of LLMs but also enhances the interpret ability of our prediction results, making it easier for stakeholders to understand and utilize the prediction results.

难度并提升大语言模型的性能，还能增强预测结果的可解释性，使利益相关者更容易理解和利用预测结果。

From a design science perspective, our contributions are threefold. First, we introduce a novel framework rooted in LLMs and prompt engineering, which enables the few-shot detection of multiple mental disorders through user-generated text content. Notably, it bestows a significant advantage by eliminating the need for an extensive volume of labeled training data or the intricate engineering of customized architectures for each distinct disease or research problem. Second, within our framework, we employ a multi-prompt engineering approach, effectively syne rg i zing various prompt engineering techniques—specifically, prefix tuning and rule-based prompt engineering. This strategy is tailored to tackle the distinctive technical challenges within the healthcare domain, entailing the use of personalized prompts and the integration of existing medical domain knowledge, which significantly elevates the accuracy and efficacy of our methods. Third, as part of our framework, we propose a new rule-based prompt engineering method, which efficiently breaks down complex textual content-based detection problems and seamlessly integrates domain knowledge existing in ontology format (one of the widely adopted formats for domain knowledge). The versatility of this new rule-based prompt engineering method extends to other research problems that require the decomposition of challenging tasks and can maximize the utilization of LLM's potential to address real-world issues.

从设计科学的角度来看，我们的贡献有三方面。首先，我们提出了一种基于大语言模型(LLM)和提示工程的新框架，通过用户生成的文本内容实现少样本检测多种心理障碍。值得注意的是，该框架无需大量标注训练数据或为每种特定疾病或研究问题定制复杂架构，具有显著优势。其次，我们在框架中采用多提示工程方法，有效协同多种提示工程技术——特别是前缀调优和基于规则的提示工程。该策略专门针对医疗领域的独特技术挑战，通过使用个性化提示并整合现有医学领域知识，显著提升了方法的准确性和有效性。第三，作为框架的一部分，我们提出了一种新的基于规则的提示工程方法，能高效分解基于文本内容的复杂检测问题，并无缝集成以本体论格式(领域知识广泛采用的格式之一)存在的领域知识。这种新型基于规则的提示工程方法具有普适性，可扩展到其他需要分解复杂任务的研究问题，并能最大限度发挥大语言模型解决实际问题的潜力。

In addition to technical contributions, our study has implications for IS such as computational design science, healthcare IS, and business analytics and intelligence. Our research also has significant practical implications. We have demonstrated the application of prompt engineering and the utilization of LLMs in the realm of detecting mental disorders through user-generated text content. User-generated content platforms and medical professionals can utilize similar methods to incorporate medical knowledge, enabling the large-scale discovery and monitoring of other chronic diseases in a low-cost way, which can facilitate early identification and intervention of

除了技术贡献外，我们的研究对信息系统学科（如计算设计科学、医疗信息系统、商业分析与智能）具有重要启示。本研究还具有显著的实践意义：我们展示了提示工程（prompt engineering）和大语言模型在通过用户生成文本内容检测精神障碍领域的应用。用户生成内容平台和医疗专业人员可采用类似方法整合医学知识，以低成本方式实现其他慢性疾病的大规模发现与监测，从而促进早期识别与干预。

chronic diseases. With minor adjustments to the applied medical knowledge, our approach can also be expanded to the management of more chronic diseases. Taking diabetes as an example, we can adapt the medical domain knowledge to observe, track, and intervene in users? lifestyles, diets physical activity, and sleep patterns; we can not only identify chronic diseases but also improve their management and potentially prevent them. Ultimately, this can enhance the well-being of individuals with chronic conditions, control healthcare costs, and ensure a more efficient and accessible healthcare system.

慢性疾病。通过对应用的医学知识进行微调，我们的方法还可以扩展到更多慢性疾病的管理。以糖尿病为例，我们可以调整医学领域知识来观察、跟踪和干预用户的生活方式、饮食、体力活动和睡眠模式；不仅能识别慢性疾病，还能改善其管理并可能预防这些疾病。最终，这将提升慢性病患者的健康水平，控制医疗成本，并确保医疗系统更高效、更可及。

Our work comes with limitations. First, the ethical and privacy considerations. The implementation of AI-based technology in healthcare necessitates a careful consideration of ethical and privacy concerns. It is imperative to ensure the responsible and ethical use of AI in the healthcare domain, safeguarding the privacy and well-being of patients and individuals. Our approach takes into account personalized prompts, which can to some extent address the issue of user privacy. This direction is still worthy of further research. Second, hallucination issues with LLMs can arise (e.g., fake, random, or irresponsible responses from LLMs). In healthcare applications, where accuracy in detection is of utmost importance, the need for careful scrutiny of detection results cannot be overstated. By integrating medical knowledge into our work, we have substantially reduced the occurrence of such issues. Exploring how the introduction of domain-specific knowledge can effectively curb irresponsible responses from LLMs is a promising avenue for future research. Another future research direction involves the exploration of multimodal large pre-trained models capable of processing text, images, or videos. Our framework can be extended to user-generated multimodal data, encompassing images and videos that capture users body shapes, food consumption, exercises, living environments, and more. Consequently, this extension enables the identification of chronic diseases associated with more lifestyle factors, which may be implicitly manifested in images and videos. Such insights can be harnessed for the detection and management of chronic diseases.

我们的工作存在一些局限性。首先，伦理和隐私考量。在医疗健康领域实施基于AI (Artificial Intelligence) 的技术需要审慎考虑伦理和隐私问题，必须确保AI在医疗领域的负责任和符合伦理的使用，保护患者及个人的隐私与福祉。我们的方法考虑了个性化提示，这在一定程度上可以解决用户隐私问题。这一方向仍值得进一步研究。其次，大语言模型可能出现幻觉问题（例如产生虚假、随机或不负责任的回答）。在检测准确性至关重要的医疗应用中，对检测结果进行仔细核查的重要性怎么强调都不为过。通过将医学知识融入我们的工作，这类问题的发生率已大幅降低。探索如何通过引入领域专业知识有效遏制大语言模型的不负责任回答，是未来研究的一个有前景的方向。另一个未来研究方向涉及探索能够处理文本、图像或视频的多模态大型预训练模型。我们的框架可扩展到用户生成的多模态数据，包括记录用户体型、饮食摄入、运动、生活环境等信息的图像和视频。因此，这一扩展能够识别与更多生活方式因素相关的慢性疾病，这些因素可能隐含地体现在图像和视频中。此类洞察可用于慢性疾病的检测和管理。

6. CONCLUSION

6. 结论

We explore whether state-of-the-art AI technologies can make a difference in chronic disease management within the context of user-generated content. Our study demonstrates the immense potential and promise of AI in the healthcare domain: it not only enhances chronic disease detection accuracy but also reduces the need for a large number of labels in machine learning and the necessity for architecture engineering for each specific research question.

我们探讨了在用户生成内容的背景下，最先进的AI技术能否改变慢性病管理的现状。研究表明AI在医疗健康领域展现出巨大潜力：不仅能提升慢性病检测准确率，还可减少机器学习所需的大量标注数据，并降低针对每个研究问题设计专用架构的需求。

AI's capacity to perform tasks effectively holds great potential for enhancing healthcare management. AI offers the advantages of (1) tailoring solutions to specific healthcare needs, enabling customization, and (2) automating a multitude of tasks, alleviating the workload on healthcare professionals and enhancing overall efficiency. The integration of AI into healthcare management has the potential to significantly enhance the precision and efficiency of administrative processes. By optimizing chronic disease management and streamlining healthcare procedures, AI can facilitate interventions and more effective treatments for patients. Through increased efficiency and early disease recognition, AI can aid healthcare institutions in cost savings——an especially noteworthy benefit, given the resource-intensive nature of the healthcare sector, with any funds saved being available for reinvestment in the enhancement of patient care.

AI高效执行任务的能力为提升医疗管理带来巨大潜力。AI具备两大优势：(1) 可根据具体医疗需求定制解决方案，实现个性化服务；(2) 能自动化处理大量任务，减轻医护人员负担并提升整体效率。AI与医疗管理的融合将显著提升行政流程的精确性和效率。通过优化慢性病管理和简化医疗程序，AI能促进干预措施并为患者提供更有效的治疗方案。凭借效率提升和早期疾病识别能力，AI可帮助医疗机构节省成本——鉴于医疗行业资源密集的特性，这项优势尤为显著，节省的资金可再投资于改善患者护理。

REFERENCE

参考文献

APPENDIX 1

附录1

Table A1. The Coverage Rate of Ontologies to Mental Disorder Diagnosis Scales
Mentaldisorderdiagnosisscales	DSM-5-TR	PHQ-9	QIDS-SR
Coveragerate	Depression	85.6%	95.2%	93.8%
	Anorexia	84.8%
	Pathologicalgambling	96.3%
	Self-harm	93.5%

表 A1: 本体对精神障碍诊断量表的覆盖率

精神障碍诊断量表	DSM-5-TR	PHQ-9	QIDS-SR
覆盖率
抑郁症	85.6%	95.2%	93.8%
厌食症	84.8%
病理性赌博	96.3%
自残	93.5%

The medical terminologies such as “angry” and “little energy” are not included in the ontologies. These terminologies are commonly used in populations other than mental disorders, and therefore, they are not suitable for determining whether a person has a mental disorder outside the context of using the mental disorder diagnosis scales. Overall, the coverage calculation results demonstrate that our ontology can comprehensively cover the widely used depression scales, indicating that it is suitable as the knowledge base for the proposed method.

"angry"和"little energy"等医学术语未被纳入本体。这些术语常见于非精神障碍人群，因此不适合在精神障碍诊断量表之外判断个体是否患有精神障碍。总体而言，覆盖率计算结果表明我们的本体能全面覆盖广泛使用的抑郁量表，证明其适合作为所提方法的知识库。

[论文翻译]用于精神障碍检测的少样本学习：一种结合医学知识注入的连续多提示工程方法

原文地址：https://arxiv.org/abs/2401.12988