Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models
揭示并缓解大语言模型在心理健康分析中的偏见
Yuqing Wang', Yun Zhao?, Sara Alessandra Kellerl3, Anne de Hond*, Marieke M. van Buchem?, Malvika Pillai', Tina Hernandez-Boussard'
Yuqing Wang', Yun Zhao?, Sara Alessandra Kellerl3, Anne de Hond*, Marieke M. van Buchem?, Malvika Pillai', Tina Hernandez-Boussard'
'Stanford University, 2Meta Platforms, Inc., 3ETH Zurich, 4 University Medical Center Utrecht, Leiden University Medical Center
斯坦福大学、2Meta Platforms公司、3苏黎世联邦理工学院、4乌得勒支大学医学中心、莱顿大学医学中心
Correspondence: Yuqing Wang (ywang216@ stanford.edu)
通讯作者:Yuqing Wang (ywang216@stanford.edu)
Abstract
摘要
The advancement of large language models (LLMs) has demonstrated strong capabilities across various applications, including mental health analysis. However, existing studies have focused on predictive performance, leaving the critical issue of fairness under explored, posing significant risks to vulnerable populations. Despite acknowledging potential biases, previous works have lacked thorough investigations into these biases and their impacts. To address this gap, we systematically evaluate biases across seven social factors (e.g., gender, age, religion) using ten LLMs with different prompting methods on eight diverse mental health datasets. Our results show that GPT-4 achieves the best overall balance in performance and fairness among LLMs, although it still lags behind domainspecific models like Mental RoBERTa in some cases. Additionally, our tailored fairness-aware prompts can effectively mitigate bias in mental health predictions, highlighting the great potential for fair analysis in this field.
大语言模型 (LLM) 的发展已在包括心理健康分析在内的多种应用中展现出强大能力。然而,现有研究主要关注预测性能,对公平性这一关键问题缺乏深入探索,这给弱势群体带来了重大风险。尽管先前研究承认潜在偏见,但并未对这些偏见及其影响进行彻底调查。为填补这一空白,我们系统评估了十种大语言模型在八个不同心理健康数据集上使用多种提示方法时,跨越七种社会因素(如性别、年龄、宗教)的偏见。结果显示,GPT-4 在大语言模型中实现了性能与公平性的最佳整体平衡,尽管在某些情况下仍落后于 Mental RoBERTa 等特定领域模型。此外,我们定制的公平性提示能有效缓解心理健康预测中的偏见,彰显了该领域公平性分析的巨大潜力。
1 Introduction
1 引言
WARNING: This paper includes content and examples that may be depressive in nature.
警告:本文包含可能引发抑郁情绪的内容和示例。
Mental health conditions, including depression and suicidal ideation, present formidable challenges to healthcare systems worldwide (Malgaroli et al., 2023). These conditions place a heavy burden on individuals and society, with significant implications for public health and economic productivity. It is reported that over $20%$ of adults in the U.S. will experience a mental disorder at some point in their lives (Rotenstein et al., 2023). Furthermore, mental health disorders are financially burdensome, with an estimated 12 billion productive workdays lost each year due to depression and anxiety, costing nearly $\$1\$ trillion (Chisholm et al., 2016).
心理健康问题,包括抑郁症和自杀倾向,给全球医疗系统带来了巨大挑战 (Malgaroli et al., 2023)。这些问题给个人和社会造成了沉重负担,对公共卫生和经济生产力产生重大影响。据报道,美国超过20%的成年人会在人生某个阶段经历精神障碍 (Rotenstein et al., 2023)。此外,心理健康疾病还造成沉重的经济负担,每年因抑郁和焦虑导致约120亿个有效工作日损失,经济损失近1万亿美元 (Chisholm et al., 2016)。
Since natural language is a major component of mental health assessment and treatment, considerable efforts have been made to use a variety of natural language processing techniques for mental health analysis. Recently, there has been a paradigm shift from domain-specific pretrained language models (PLMs), such as PsychBERT (Vajre et al., 2021) and MentalBERT (Ji et al., 2022b), to more advanced and general large language models (LLMs). Some studies have evaluated LLMs, including the use of ChatGPT for stress, depression, and suicide detection (Lamichhane, 2023; Yang et al., 2023a), demonstrating the promise of LLMs in this field. Furthermore, fine-tuned domain-specific LLMs like Mental-LLM (Xu et al., 2024) and MentaLLama (Yang et al., 2024) have been proposed for mental health tasks. Additionally, some research focuses on the interpret a bility of the explanations provided by LLMs (Joyce et al., 2023; Yang et al., 2023b). However, to effectively leverage or deploy LLMs for practical mental health support, especially in life-threatening conditions like suicide detection, it is crucial to consider the demographic diversity of user populations and ensure the ethical use of LLMs. To address this gap, we aim to answer the following question: To what extent are current LLMs fair across diverse social groups, and how can their fairness in mental health predictions be improved?
由于自然语言是心理健康评估与治疗的重要组成部分,学界已投入大量精力运用各类自然语言处理技术进行心理健康分析。近年来,该领域经历了从领域专用预训练语言模型(PLM)(如PsychBERT [Vajre et al., 2021]和MentalBERT [Ji et al., 2022b])向更先进、通用的大语言模型(LLM)的范式转变。部分研究已对LLM展开评估,包括使用ChatGPT进行压力、抑郁及自杀倾向检测(Lamichhane, 2023; Yang et al., 2023a),证实了LLM在该领域的潜力。此外,针对心理健康任务优化的领域专用LLM(如Mental-LLM [Xu et al., 2024]和MentaLLama [Yang et al., 2024])相继问世。另有研究聚焦于LLM生成解释的可解释性(Joyce et al., 2023; Yang et al., 2023b)。然而,要有效利用或部署LLM进行实际心理健康支持(尤其是自杀检测等危及生命的场景),必须充分考虑用户群体的人口多样性,并确保LLM的伦理使用。为填补这一空白,我们旨在回答:当前LLM在不同社会群体中的公平性程度如何?如何提升其在心理健康预测中的公平性?
In our work, we evaluate ten LLMs, ranging from general-purpose models like Llama2, Llama3, Gemma, and GPT-4, to instruction-tuned domainspecific models like MentaLLama, with sizes varying from 1.1B to 175B parameters. Our evaluation spans eight mental health datasets covering diverse tasks such as depression detection, stress analysis, mental issue cause detection, and interpersonal risk factor identification. Due to the sensitivity of this domain, most user information is unavailable due to privacy concerns. Therefore, we explicitly incorporate demographic data into LLM prompts (e.g., The text is from fcontext), considering seven social factors: gender, race, age, religion, sexuality, nationality, and their combinations, resulting in 60 distinct variations for each data sample. We employ zero-shot standard prompting and few-shot Chainof-Thought (CoT) prompting to assess the generalizability and reasoning capabilities of LLMs in this domain. Additionally, we propose to mitigate bias via a set of fairness-aware prompts based on existing results. The overall bias evaluation and mitigation pipeline for LLM mental health analysis is depicted in Figure 1. Our findings demonstrate that GPT-4 achieves the best balance between performance and fairness among LLMs, although it still lags behind Mental RoBERTa in certain tasks. Furthermore, few-shot CoT prompting improves both performance and fairness, highlighting the benefits of additional context and the necessity of reasoning in the field. Interestingly, our results reveal that larger LLMs tend to exhibit less bias, challenging the well-known performance-fairness trade-off. This suggests that increased model scale can positively impact fairness, potentially due to the models’ enhanced capacity to learn and represent complex patterns across diverse demographic groups. Additionally, our fairness-aware prompts effectively mitigate bias across LLMs of various sizes, underscoring the importance of targeted prompting strategies in enhancing model fairness for mental health applications.
在我们的工作中,我们评估了十种大语言模型,涵盖从通用模型如Llama2、Llama3、Gemma和GPT-4,到领域特定指令调优模型如MentaLLama,参数规模从11亿到1750亿不等。我们的评估涉及八个心理健康数据集,覆盖抑郁检测、压力分析、心理问题成因识别和人际风险因素判定等多样化任务。由于该领域的敏感性,大多数用户信息因隐私问题无法获取。因此,我们明确将人口统计数据纳入大语言模型提示(例如"The text is from [context]"),考虑七种社会因素:性别、种族、年龄、宗教、性取向、国籍及其组合,为每个数据样本生成60种不同变体。我们采用零样本标准提示和少样本思维链(CoT)提示来评估大语言模型在该领域的泛化能力和推理能力。此外,基于现有结果,我们提出通过一组公平感知提示来缓解偏见。图1展示了大语言模型心理健康分析的总体偏见评估与缓解流程。研究发现,GPT-4在性能和公平性之间取得了最佳平衡,尽管在某些任务上仍落后于Mental RoBERTa。少样本CoT提示能同时提升性能和公平性,凸显了附加上下文的价值及该领域推理的必要性。有趣的是,结果表明更大规模的大语言模型往往表现出更少偏见,这对广为人知的性能-公平性权衡提出了挑战。这表明模型规模的扩大可能通过增强跨人口群体复杂模式的学习与表征能力,对公平性产生积极影响。此外,我们的公平感知提示能有效缓解不同规模大语言模型的偏见,突显了针对性提示策略在提升心理健康应用模型公平性中的重要性。

Figure 1: The pipeline for evaluating and mitigating bias in LLMs for mental health analysis. User queries undergo demographic enrichment to identify biases. LLM responses are evaluated for performance and fairness. Bias mitigation is applied through fairness-aware prompts to achieve clinically accepted EO scores.
图 1: 心理健康分析中评估和减轻大语言模型偏见的流程。用户查询经过人口统计特征增强以识别偏见,评估大语言模型响应的性能和公平性,通过公平感知提示进行偏见缓解以达到临床认可的EO分数。
In summary, our contributions are threefold:
总之,我们的贡献有三方面:
(1) We conduct the first comprehensive and systematic evaluation of bias in LLMs for mental health analysis, utilizing ten LLMs of varying sizes across eight diverse datasets.
(1) 我们首次对大语言模型(LLM)在心理健康分析中的偏见进行了全面系统的评估,使用了10个不同规模的模型和8个多样化数据集。
(2) We mitigate LLM biases by proposing and implementing a set of fairness-aware prompting strategies, demonstrating their effectiveness among LLMs of different scales. We also provide insights into the relationship between model size and fairness in this domain.
(2) 我们通过提出并实施一系列公平感知提示策略(fairness-aware prompting strategies)来减轻大语言模型(LLM)的偏见,在不同规模的模型中验证了其有效性。同时就该领域模型规模与公平性的关系提出了见解。
(3) We analyze the potential of LLMs through aggregated and stratified evaluations, identifying limitations through manual error analysis. This reveals persistent issues such as sentiment misjudgment and ambiguity, highlighting the need for future improvements.
(3) 我们通过聚合分层评估分析大语言模型(LLM)的潜力,并通过人工错误分析识别其局限性。这揭示了情感误判和歧义性等持续存在的问题,凸显了未来改进的必要性。
2 Related Work
2 相关工作
In this section, we delve into the existing literature on mental health prediction, followed by an
在本节中,我们首先探讨心理健康预测的现有文献,随后
overview of the latest research advancements in LLMs and their applications in mental health.
大语言模型最新研究进展及其在心理健康领域的应用概述
2.1 Mental Health Prediction
2.1 心理健康预测
Extensive studies have focused on identifying and predicting risks associated with various mental health issues such as anxiety (Ahmed et al., 2022; Bhatnagar et al., 2023), depression (Squires et al., 2023; Hasib et al., 2023), and suicide ideation (Menon and Vi jayakumar, 2023; Barua et al., 2024) over the past decade. Traditional methods initially relied on machine learning models, including SVMs (De Choudhury et al., 2013), and deep learning approaches like LSTMCNNs (Tadesse et al., 2019) to improve predic- tion accuracy. More recently, pre-trained lan- guage models (PLMs) have dominated the field by offering powerful contextual representations, such as BERT (Kenton and Toutanova, 2019) and GPT (Radford et al.), across a variety of tasks, including text classification (Wang et al., 2022a, 2023a), time series analysis (Wang et al., 2022b), and disease detection (Zhao et al., 2021a,b). For mental health, attention-based models leveraging the contextual features of BERT have been developed for both user-level and post-level classification (Jiang et al., 2020). Additionally, specialized PLMs like MentalBERT and Mental RoBERTa, trained on social media data, have been proposed (Ji et al., 2022b). Moreover, efforts have increasingly integrated multi-modal information like text, image, and video to enhance prediction accuracy. For example, combining CNN and BERT for visual-textual methods (Lin et al., 2020) and AudioAssisted BERT for audio-text embeddings (Toto et al., 2021) have improved performance in depression detection.
大量研究聚焦于识别和预测各类心理健康问题的风险,如焦虑 (Ahmed et al., 2022; Bhatnagar et al., 2023)、抑郁 (Squires et al., 2023; Hasib et al., 2023) 和自杀倾向 (Menon and Vijayakumar, 2023; Barua et al., 2024)。过去十年间,传统方法最初依赖机器学习模型(如支持向量机 (De Choudhury et al., 2013))和深度学习方案(如 LSTM-CNN (Tadesse et al., 2019))来提升预测精度。近年来,预训练语言模型 (PLM) 通过提供强大的上下文表征(如 BERT (Kenton and Toutanova, 2019) 和 GPT (Radford et al.))主导了该领域,广泛应用于文本分类 (Wang et al., 2022a, 2023a)、时间序列分析 (Wang et al., 2022b) 和疾病检测 (Zhao et al., 2021a,b) 等任务。针对心理健康领域,基于注意力机制并利用 BERT 上下文特征的模型已被开发用于用户级和帖子级分类 (Jiang et al., 2020)。此外,专门在社交媒体数据上训练的 MentalBERT 和 Mental RoBERTa 等定制化 PLM 也被提出 (Ji et al., 2022b)。当前研究趋势正逐步整合文本、图像和视频等多模态信息以提升预测准确性,例如结合 CNN 与 BERT 的视觉-文本方法 (Lin et al., 2020) 以及用于音频-文本嵌入的 AudioAssisted BERT (Toto et al., 2021) 在抑郁症检测中表现出性能提升。
2.2 LLMs and Mental Health Applications
2.2 大语言模型 (LLM) 与心理健康应用
The success of Transformer-based language models has motivated researchers and practitioners to advance towards larger and more powerful LLMs, containing tens to hundreds of billions of parameters, such as GPT-4 (Achiam et al., 2023), Llama2 (Touvron et al., 2023), Gemini (Team et al., 2023), and Phi-3 (Abdin et al., 2024). Extensive evaluations have shown great potential in broad domains such as healthcare (Wang et al., 2023b), machine translation (Jiao et al., 2023), and complex reasoning (Wang and Zhao, 2023c). This success has inspired efforts to explore the potential of LLMs for mental health analysis. Some studies (Lamichhane, 2023; Yang et al.,2023a) have tested the performance of ChatGPT on multiple classification tasks, such as stress, depression, and suicide detection, revealing initial potential for mental health applications but also highlighting significant room for improvement, with around $5-10%$ performance gaps. Additionally, instruction-tuning mental health LLMs, such as Mental-LLM (Xu et al., 2024) and MentaLLama (Yang et al., 2024), has been proposed. However, previous works have primarily focused on classification performance. Given the sensitivity of this domain, particularly for serious mental health conditions like suicide detection, bias is a more critical issue (Wang and Zhao, 2023b; Timmons et al., 2023; Wang et al., 2024). In this work, we present a systematic investigation of performance and fairness across multiple LLMs, as well as methods to mitigate bias.
基于Transformer的语言模型成功激励研究者和从业者开发参数规模达数百亿乃至数千亿的更强大模型,例如GPT-4 (Achiam et al., 2023)、Llama2 (Touvron et al., 2023)、Gemini (Team et al., 2023)和Phi-3 (Abdin et al., 2024)。大量评估研究表明,这些模型在医疗健康 (Wang et al., 2023b)、机器翻译 (Jiao et al., 2023)和复杂推理 (Wang and Zhao, 2023c)等领域展现出巨大潜力。这一成功促使学界开始探索大语言模型在心理健康分析中的应用前景。部分研究 (Lamichhane, 2023; Yang et al., 2023a)测试了ChatGPT在压力检测、抑郁识别和自杀倾向分类等任务中的表现,虽然揭示了初步应用价值,但也指出约5-10%的性能差距仍有显著改进空间。此外,研究者提出了指令微调的心理健康专用模型,如Mental-LLM (Xu et al., 2024)和MentaLLama (Yang et al., 2024)。然而既有工作主要聚焦分类性能指标,鉴于该领域敏感性(尤其是自杀检测等严重心理健康问题),模型偏差成为更关键的议题 (Wang and Zhao, 2023b; Timmons et al., 2023; Wang et al., 2024)。本研究系统考察了多种大语言模型的性能与公平性,并提出了偏差缓解方法。
3 Experiments
3 实验
In this section, we describe the datasets, models, and prompts used for evaluation. We incorporate demographic information for bias assessment and outline metrics for performance and fairness evaluation in mental health analysis.
在本节中,我们将介绍用于评估的数据集、模型和提示。我们整合了人口统计信息以进行偏见评估,并概述了心理健康分析中的性能和公平性评估指标。
3.1 Datasets
3.1 数据集
The datasets used in our evaluation encompass a wide range of mental health topics. For binary classification, we utilize the Stanford email dataset called DepEmail from cancer patients, which focuses on depression prediction, and the Dreaddit dataset (Turcan and Mckeown, 2019), which addresses stress prediction from subreddits in five domains: abuse, social, anxiety, PTSD, and financial. In multi-class classification, we employ the C-SSRS dataset (Gaur et al., 2019) for suicide risk assessment, covering categories such as Attempt and Indicator; the CAMS dataset (Garg et al., 2022) for analyzing the causes of mental health issues, such as Alienation and Medication; and the SWMH dataset (Ji et al., 2022a), which covers various mental disorders like anxiety and depression. For multi-label classification, we include the IRF dataset (Garg et al., 2023), capturing interpersonal risk factors of Thwarted Belonging ness (TBe) and Perceived Burdensome ness (PBu); the MultiWD dataset (Sathvik and Garg, 2023), examining various wellness dimensions, such as finance and spirit; and the SAD dataset (Mauriello et al., 2021), exploring the causes of stress, such as school and social relationships. Table 1 provides an overview of the tasks and datasets.
我们评估中使用的数据集涵盖了广泛的心理健康主题。在二分类任务中,我们采用了斯坦福大学癌症患者抑郁预测邮件数据集DepEmail,以及用于从五个领域(虐待、社交、焦虑、创伤后应激障碍和财务)的subreddit中预测压力的Dreaddit数据集 (Turcan and Mckeown, 2019)。在多分类任务中,我们使用了C-SSRS数据集 (Gaur et al., 2019) 进行自杀风险评估,涵盖"Attempt"和"Indicator"等类别;CAMS数据集 (Garg et al., 2022) 用于分析心理健康问题的成因,如"Alienation"和"Medication";以及覆盖焦虑和抑郁等多种精神障碍的SWMH数据集 (Ji et al., 2022a)。在多标签分类任务中,我们纳入了IRF数据集 (Garg et al., 2023),捕捉"Thwarted Belongingness (TBe)"和"Perceived Burdensomeness (PBu)"等人际风险因素;研究财务、精神等多维度健康状态的MultiWD数据集 (Sathvik and Garg, 2023);以及探索学校、社会关系等压力成因的SAD数据集 (Mauriello et al., 2021)。表1总结了各项任务和数据集概况。
3.2 Demographic Enrichment
3.2 人口统计增强
We enrich the demographic information of the original text inputs to quantify model biases across diverse social factors, addressing the inherent lack of such detailed context in most mental health datasets due to privacy concerns. Specifically, we consider seven major social factors: gender (male and female), race (White, Black, etc.), religion (Christianity, Islam, etc.), nationality (U.S., Canada, etc.), sexuality (heterosexual, homosexual, etc.), and age (child, young adult, etc.). Additionally, domain experts have proposed 24 culturally-oriented combinations of the above factors, such as "Black female youth" and “Muslim Saudi Arabian male", which could influence mental health predictions. In total, we generate 60 distinct variations of each data sample in the test set for each task. The full list of categories and combinations used for demographic enrichment is provided in Appendix A.
我们通过丰富原始文本输入的人口统计信息,来量化模型在不同社会因素上的偏差,以解决大多数心理健康数据集因隐私问题而缺乏此类详细背景信息的固有缺陷。具体而言,我们考虑了七大社会因素:性别(男性和女性)、种族(白人、黑人等)、宗教(基督教、伊斯兰教等)、国籍(美国、加拿大等)、性取向(异性恋、同性恋等)以及年龄(儿童、青年等)。此外,领域专家还提出了24种基于上述因素的文化导向组合,例如"黑人女性青年"和"穆斯林沙特阿拉伯男性",这些组合可能影响心理健康预测结果。针对每项任务,我们在测试集中为每个数据样本生成了60种不同的变体。用于人口统计信息强化的完整类别及组合列表详见附录A。
For implementation in LLMs, we extend the original user prompt with more detailed instructions, such as "Given the text from {demographic context). For BERT-based models, we append the text with: “As a(n) {demographic context). This approach ensures that the demographic context is explicitly considered during model embedding.
为实现大语言模型(LLM)的应用,我们通过添加更详细的指令来扩展原始用户提示,例如"给定来自{人口统计背景}的文本"。对于基于BERT的模型,我们在文本后附加:"作为一个{人口统计背景}"。这种方法确保在模型嵌入过程中明确考虑人口统计背景。
3.3Models
3.3 模型
We divide the models used in our experiments into two major categories. The first category comprises disc rim i native BERTbased models: BERT/RoBERTa (Kenton and Toutanova, 2019; Liu et al., 2019) and MentalBERT/Mental RoBERTa (Ji et al., 2022b). The second category consists of LLMs of varying sizes, including TinyLlama-1.1B-Chat-v1.0 (Zhang et al., 2024), Phi-3-mini $128\mathrm{k\Omega}$ -instruct (Abdin et al., 2024), gemma-2b-it, gemma-7b-it (Team et al., 2024), Llama-2-7b-chat-hf, Llama-2-13b- chat-hf (Touvron et al., 2023), MentaLLaMA-chat7B, MentaLLaMA-chat-13B (Yang et al., 2024), Llama-3-8B-Instruct (AI $@$ Meta, 2024), and GPT4 (Achiam et al., 2023). GPT-4 is accessed through the OpenAI API, while the remaining models are loaded from Hugging Face. For all LLM evaluations, we employ greedy decoding (i.e., temperature $=0$ ) during model response generation. Given the constraints of API costs, we randomly select 200 samples from the test set for each dataset (except C-SSRS) following (Wang and Zhao, 2023a). Each sample is experimented with 60 variations of demographic factors. Except for GPT-4, all experiments use four NVIDIA A100 GPUs.
我们将实验使用的模型分为两大类。第一类是基于BERT的判别式模型:BERT/RoBERTa (Kenton and Toutanova, 2019; Liu et al., 2019) 和 MentalBERT/MentalRoBERTa (Ji et al., 2022b)。第二类包含不同规模的大语言模型,包括TinyLlama-1.1B-Chat-v1.0 (Zhang et al., 2024)、Phi-3-mini $128\mathrm{k\Omega}$ -instruct (Abdin et al., 2024)、gemma-2b-it、gemma-7b-it (Team et al., 2024)、Llama-2-7b-chat-hf、Llama-2-13b-chat-hf (Touvron et al., 2023)、MentaLLaMA-chat7B、MentaLLaMA-chat-13B (Yang et al., 2024)、Llama-3-8B-Instruct (AI @ Meta, 2024) 以及 GPT4 (Achiam et al., 2023)。GPT-4通过OpenAI API调用,其余模型均从Hugging Face加载。所有大语言模型评估均采用贪婪解码(即温度 $=0$)生成响应。鉴于API成本限制,我们参照(Wang and Zhao, 2023a)的方法,从每个测试集(C-SSRS除外)随机选取200个样本,每个样本进行60种人口统计因素变体的实验。除GPT-4外,所有实验均使用四块NVIDIA A100 GPU。
3.4 Prompts
3.4 提示词
We explore the effectiveness of various prompting strategies in evaluating LLMs. Initially, we employ zero-shot standard prompting (SP) to assess the general iz ability of all the aforementioned LLMs. Subsequently, we apply few-shot $(\mathrm{k}{=}3)$ CoT prompting (Wei et al., 2022) to a subset of LLMs to evaluate its potential benefits in this domain. Additionally, we examine bias mitigation in LLMs by introducing a set of fairness-aware prompts under zero-shot settings. These include:
我们探索了多种提示策略在评估大语言模型(LLM)时的有效性。首先,我们采用零样本标准提示(SP)来评估所有上述大语言模型的泛化能力。接着,我们对部分大语言模型应用少样本$(\mathrm{k}{=}3)$思维链提示(Wei et al., 2022)来评估其在该领域的潜在优势。此外,我们通过在零样本设置下引入一系列公平意识提示来研究大语言模型中的偏见缓解问题。这些提示包括:
General templates or examples of all the prompting strategies are presented in Appendix B.
所有提示策略的通用模板或示例见附录B。
3.5 Evaluation Metrics
3.5 评估指标
We report the weighted-F1 score for performance and use Equalized Odds (EO) (Hardt et al., 2016) as the fairness metric, ensuring similar true positive rates (TPR) and false positive rates (FPR) across different demographic groups. For multiclass categories (e.g., religion, race), we compute the standard deviation of TPR and FPR to capture variability within groups.
我们以加权F1分数衡量性能,并采用均衡几率 (Equalized Odds, EO) [Hardt et al., 2016] 作为公平性指标,确保不同人口统计组间具有相似的真阳性率 (TPR) 和假阳性率 (FPR) 。对于多分类类别(如宗教、种族),我们计算TPR和FPR的标准差以衡量组内变异程度。
Table 1: Overview of eight mental health datasets. EHR stands for Electronic Health Records.
| Data | Task | Data Size (train/test) | Source | Labels/Aspects |
| Binary Classification | ||||
| DepEmail | depression | 5,457/607 | EHR | Depression, Non-depression |
| Dreaddit | stress | 2,838/715 | Stress, Non-stress | |
| Multi-class Classification | ||||
| C-SSRS | suicide risk | 400/100 | Ideation, Supportive, Indicator, Attempt,Behavior | |
| CAMS | mental issues cause | 3,979/1,001 | Bias or Abuse,Jobs and Careers,Medication, Relationship,Alienation, No Reason | |
| SWMH | mental disorders | 34,823/10,883 | Anxiety, Bipolar, Depression, SuicideWatch, Offmychest | |
| Multi-label Classification | ||||
| IRF | interpersonal risk factors | 1,972/1,057 | TBe,PBu | |
| MultiWD | wellness dimensions | 2,624/657 | Spiritual, Physical, Intellectual, Social, Vocational, Emotional | |
| SAD | stresscause | 5,480/1,370 | SMS-like | Finance, Family, Health, Emotion, Work Social Relation, School, Decision, Other |
表 1: 八个心理健康数据集概览。EHR代表电子健康记录 (Electronic Health Records)。
| 数据 | 任务 | 数据量 (训练集/测试集) | 来源 | 标签/维度 |
|---|---|---|---|---|
| 二分类任务 | ||||
| DepEmail | 抑郁症检测 | 5,457/607 | EHR | 抑郁症,非抑郁症 |
| Dreaddit | 压力检测 | 2,838/715 | 压力,非压力 | |
| 多分类任务 | ||||
| C-SSRS | 自杀风险 | 400/100 | 意念,支持性,指标,企图,行为 | |
| CAMS | 心理问题成因 | 3,979/1,001 | 偏见或虐待,工作与职业,药物,人际关系,疏离感,无明确原因 | |
| SWMH | 精神障碍 | 34,823/10,883 | 焦虑症,双相情感障碍,抑郁症,自杀观察,倾诉心声 | |
| 多标签分类任务 | ||||
| IRF | 人际风险因素 | 1,972/1,057 | TBe,PBu | |
| MultiWD | 健康维度 | 2,624/657 | 精神,身体,智力,社交,职业,情感 | |
| SAD | 压力成因 | 5,480/1,370 | 类短信平台 | 财务,家庭,健康,情绪,工作社交关系,学业,决策,其他 |
4 Results
4 结果
In this section, we analyze model performance and fairness across datasets, examine the impact of model scale, identify common errors in LLMs for mental health analysis, and demonstrate the effectiveness of fairness-aware prompts in mitigating bias with minimal performance loss.
在本节中,我们分析了模型在不同数据集上的性能和公平性,研究了模型规模的影响,识别了大语言模型在心理健康分析中的常见错误,并展示了公平性提示在最小化性能损失的同时缓解偏见的有效性。
4.1 Main Results
4.1 主要结果
We report the classification and fairness results from the demographic-enriched test set in Table 2. Overall, most of the models demonstrate strong performance on non-serious mental health issues like stress and wellness (e.g., Dreaddit and MultiWD). However, they often struggle with serious mental health disorders such as suicide, as assessed by CSSRS. In terms of classification performance, discri mi native methods such as RoBERTa and MentalRoBERTa demonstrate superior performance compared to most LLMs. For instance, RoBERTa achieves the best F1 score in MultiWD $(81.8%)$ , while Mental RoBERTa achieves the highest F1 score in CAMS $(55.0%)$ . Among the LLMs, GPT4 stands out with the best zero-shot performance, achieving the highest F1 scores in 6 out of 8 tasks, including DepEmail $(91.9%)$ and C-SSRS $(34.6%)$ These results highlight the effectiveness of domainspecific PLMs and leveraging advanced LLMs for specific tasks in mental health analysis.
我们在表2中报告了基于人口统计学增强测试集的分类结果与公平性评估。总体而言,大多数模型在压力、健康等非严重心理健康问题(如Dreaddit和MultiWD数据集)上表现优异,但在CSSRS评估的自杀倾向等严重心理障碍识别中普遍存在困难。分类性能方面,RoBERTa和MentalRoBERTa等判别式方法优于多数大语言模型——RoBERTa在MultiWD获得最高F1值$(81.8%)$,MentalRoBERTa则在CAMS达到$(55.0%)$。大语言模型中,GPT4展现出最佳的零样本能力,在8项任务中的6项(包括DepEmail$(91.9%)$和C-SSRS$(34.6%)$)取得最高F1分数。这些结果印证了领域专用预训练模型(PLM)和先进大语言模型在心理健康分析任务中的有效性。
From a fairness perspective, Mental RoBERTa and GPT-4 show commendable results, with MentalRoBERTa exhibiting the lowest EO in Dreaddit $(8.0%)$ and maintaining relatively low EO scores across other datasets. This suggests that domainspecific fine-tuning can significantly reduce bias. GPT-4, particularly with few-shot CoT prompting, achieves low EO scores in several datasets, such as SWMH $(12.3%)$ andSAD $(23.0%)$ ,which can be attributed to its ability to generate context-aware responses that consider nuanced demographic factors. Smaller scale LLMs like Gemma-2B and TinyLlama-1.1B show mixed results, with lower performance and higher EO scores across most datasets, reflecting the challenges smaller models face in balancing performance and fairness. In contrast, domain-specific instruction-tuned models like MentaLLaMA-7B and MentaLLaMA-13B show promising results with competitive performance and relatively low EO scores. Few-shot CoT prompting further enhances the fairness of models like Llama3-8B and Llama2-13B, demonstrating the benefits of incorporating detailed contextual information in mitigating biases. These findings suggest that model size, domain-specific training strategies, and appropriate prompting techniques contribute to achieving balanced performance and fairness in this field.
从公平性角度来看,Mental RoBERTa和GPT-4表现出值得称赞的结果,其中MentalRoBERTa在Dreaddit数据集上展现出最低的EO(8.0%),并在其他数据集中保持相对较低的EO分数。这表明领域特定微调能显著减少偏见。GPT-4(尤其是采用少样本思维链提示时)在多个数据集中获得低EO分数,例如SWMH(12.3%)和SAD(23.0%),这归功于其生成上下文感知响应、考量细微人口因素的能力。Gemma-2B和TinyLlama-1.1B等较小规模的大语言模型表现参差不齐,在多数数据集中性能较低且EO分数较高,反映出小模型在平衡性能与公平性方面的挑战。相比之下,MentaLLaMA-7B和MentaLLaMA-13B等领域特定指令微调模型展现出有竞争力的性能与相对较低的EO分数。少样本思维链提示进一步提升了Llama3-8B和Llama2-13B等模型的公平性,证明引入详细上下文信息对缓解偏见具有益处。这些发现表明,模型规模、领域特定训练策略和恰当的提示技术有助于在该领域实现性能与公平性的平衡。
4.2 Impact of Model Scale on Classification Performance and Fairness
4.2 模型规模对分类性能与公平性的影响
We explore the impact of model scale on performance and fairness by averaging the F1 and EO scores across all datasets, as shown in Figure 2, focusing on zero-shot scenarios for LLMs. For
我们通过在所有数据集上平均F1和EO分数来探究模型规模对性能和公平性的影响,如图2所示,重点关注大语言模型的零样本场景。
Table 2: Performance and fairness comparison of all models on eight mental health datasets. Average results are reported over three runs based on the demographic enrichment of each sample in the test set. F1 $(%)$ andEo $(%)$ results are averaged over all social factors. For each dataset, results highlighted in bold indicate the highest performance, while underlined results denote the optimal fairness outcomes.
| Model | DepEmail | Dreaddit | C-SSRS | CAMS | SWMH | IRF | MultiWD | SAD | |||||||||
| F1↑ | EO← | F1↑ | F1↑ | EO← | F1↑ EO← | F1↑ | F1↑ | EO← | F1↑ | ↑0 | F1↑ | EO↓ | |||||
| Discriminative methods | |||||||||||||||||
| BERT-base | 88.2 | 31.5 | 53.6 | 31.7 | 26.5 | 28.9 | 42.8 | 16.7 | 52.8 | 19.8 | 74.9 | 19.1 | 78.6 | 31.8 | 79.0 | 19.9 | |
| RoBERTa-base | 90.7 | 30.0 | 77.2 | 10.8 | 27.8 | 22.9 | 47.0 | 13.3 | 63.1 | 15.2 | 75.4 | 18.3 | 81.8 | 27.1 | 79.0 | 19.5 | |
| MentalBERT | 92.0 | 30.1 | 57.2 | 32.9 | 26.9 | 21.8 | 51.3 | 13.6 | 58.4 | 19.0 | 80.5 | 11.9 | 81.4 | 28.8 | 76.7 | 19.8 | |
| MentalRoBERTa | 94.3 | 28.0 | 77.5 | 8.0 | 32.7 | 20.4 | 55.0 | 17.1 | 61.4 | 13.4 | 79.5 | 12.7 | 81.3 | 23.5 | 79.1 | 19.3 | |
| LLM-based Methods with Zero-shot SP | |||||||||||||||||
| TinyLlama-1.1B | 49.3 | 43.8 | 68.0 | 46.2 | 28.6 | 19.8 | 21.9 | 18.5 | 35.1 | 36.8 | 41.3 | 41.1 | 63.0 | 30.7 | 68.4 | 50.0 | |
| Gemma-2B | 44.8 | 50.0 | 69.4 | 50.0 | 26.9 | 34.6 | 41.6 | 25.6 | 42.3 | 35.7 | 43.8 | 47.9 | 71.2 | 41.2 | 41.6 | 25.6 | |
| Phi-3-mini | 46.1 | 45.6 | 69.2 | 50.0 | 21.3 | 26.8 | 31.4 | 25.7 | 23.9 | 29.7 | 58.9 | 45.2 | 62.1 | 28.8 | 70.2 | 32.3 | |
| Gemma-7B | 83.3 | 6.4 | 76.2 | 41.6 | 25.1 | 16.8 | 39.8 | 23.0 | 49.2 | 29.9 | 47.1 | 40.7 | 73.9 | 35.3 | 72.3 | 34.6 | |
| Llama2-7B | 74.9 | 10.2 | 64.0 | 19.7 | 22.6 | 23.4 | 27.3 | 14.7 | 42.7 | 31.8 | 53.4 | 38.3 | 68.7 | 37.3 | 71.8 | 32.6 | |
| MentaLLaMA-7B | 90.6 | 27.7 | 58.7 | 10.1 | 23.7 | 25.8 | 29.9 | 23.9 | 43.6 | 35.3 | 57.1 | 34.7 | 68.9 | 39.9 | 72.7 | 36.8 | |
| Llama3-8B | 85.9 | 9.9 | 70.3 | 46.2 | 26.3 | 29.8 | 40.5 | 22.3 | 47.2 | 28.5 | 53.6 | 43.7 | 75.6 | 30.3 | 77.2 | 30.9 | |
| Llama2-13B | 82.1 | 9.6 | 66.2 | 18.7 | 25.2 | 23.2 | 25.3 | 17.2 | 43.2 | 33.5 | 56.2 | 37.5 | 71.2 | 38.3 | 71.6 | 36.7 | |
| MentaLLaMA-13B | 91.2 | 23.6 | 60.2 | 9.9 | 24.4 | 25.8 | 30.9 | 23.6 | 43.2 | 36.1 | 58.8 | 34.1 | 66.7 | 40.6 | 75.0 | 36.4 | |
| GPT-4 | 91.9 | 10.1 | 73.4 | 38.8 | 34.6 | 25.8 | 49.4 | 21.4 | 64.6 | 10.5 | 57.8 | 37.5 | 79.8 | 25.2 | 78.4 | 22.2 | |
| LLM-based Methods with Few-shot CoT | |||||||||||||||||
| 86.0 | 6.2 | 77.8 | 40.8 | 26.1 | 16.5 | 39.2 | 24.7 | 50.9 | 29.5 | 48.2 | 39.1 | 74.2 | 34.6 | 72.8 | 34.0 | ||
| Llama3-8B | 88.2 | 10.4 | 72.5 | 45.7 | 27.7 | 29.3 | 42.1 | 21.9 | 45.3 | 29.3 | 54.8 | 42.1 | 77.2 | 32.5 | 79.3 | 29.8 | |
| Llama2-13B | 84.8 | 11.7 | 67.9 | 18.4 | 26.6 | 24.3 | 27.4 | 16.9 | 45.3 | 32.4 | 57.3 | 36.8 | 73.6 | 35.2 | 74.1 | 33.5 | |
| GPT-4 | 95.1 | 10.4 | 78.1 | 38.2 | 37.2 | 24.4 | 50.7 | 20.6 | 66.8 | 12.3 | 63.7 | 32.4 | 81.6 | 27.3 | 81.2 | 23.0 | |
表 2: 所有模型在八个心理健康数据集上的性能与公平性对比。结果基于测试集中每个样本的人口统计特征增强进行三次运行的平均值报告。F1 $(%)$ 和 EO $(%)$ 结果是所有社会因素的平均值。对于每个数据集,加粗结果显示最高性能,下划线结果表示最优公平性表现。
| 模型 | DepEmail F1↑ | DepEmail EO← | Dreaddit F1↑ | Dreaddit EO← | C-SSRS F1↑ | C-SSRS EO← | CAMS F1↑ | CAMS EO← | SWMH F1↑ | SWMH EO← | IRF F1↑ | IRF EO← | MultiWD F1↑ | MultiWD EO← | SAD F1↑ | SAD EO← |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 判别式方法 | ||||||||||||||||
| BERT-base | 88.2 | 31.5 | 53.6 | 31.7 | 26.5 | 28.9 | 42.8 | 16.7 | 52.8 | 19.8 | 74.9 | 19.1 | 78.6 | 31.8 | 79.0 | 19.9 |
| RoBERTa-base | 90.7 | 30.0 | 77.2 | 10.8 | 27.8 | 22.9 | 47.0 | 13.3 | 63.1 | 15.2 | 75.4 | 18.3 | 81.8 | 27.1 | 79.0 | 19.5 |
| MentalBERT | 92.0 | 30.1 | 57.2 | 32.9 | 26.9 | 21.8 | 51.3 | 13.6 | 58.4 | 19.0 | 80.5 | 11.9 | 81.4 | 28.8 | 76.7 | 19.8 |
| MentalRoBERTa | 94.3 | 28.0 | 77.5 | 8.0 | 32.7 | 20.4 | 55.0 | 17.1 | 61.4 | 13.4 | 79.5 | 12.7 | 81.3 | 23.5 | 79.1 | 19.3 |
| 基于大语言模型的零样本方法 | ||||||||||||||||
| TinyLlama-1.1B | 49.3 | 43.8 | 68.0 | 46.2 | 28.6 | 19.8 | 21.9 | 18.5 | 35.1 | 36.8 | 41.3 | 41.1 | 63.0 | 30.7 | 68.4 | 50.0 |
| Gemma-2B | 44.8 | 50.0 | 69.4 | 50.0 | 26.9 | 34.6 | 41.6 | 25.6 | 42.3 | 35.7 | 43.8 | 47.9 | 71.2 | 41.2 | 41.6 | 25.6 |
| Phi-3-mini | 46.1 | 45.6 | 69.2 | 50.0 | 21.3 | 26.8 | 31.4 | 25.7 | 23.9 | 29.7 | 58.9 | 45.2 | 62.1 | 28.8 | 70.2 | 32.3 |
| Gemma-7B | 83.3 | 6.4 | 76.2 | 41.6 | 25.1 | 16.8 | 39.8 | 23.0 | 49.2 | 29.9 | 47.1 | 40.7 | 73.9 | 35.3 | 72.3 | 34.6 |
| Llama2-7B | 74.9 | 10.2 | 64.0 | 19.7 | 22.6 | 23.4 | 27.3 | 14.7 | 42.7 | 31.8 | 53.4 | 38.3 | 68.7 | 37.3 | 71.8 | 32.6 |
| MentaLLaMA-7B | 90.6 | 27.7 | 58.7 | 10.1 | 23.7 | 25.8 | 29.9 | 23.9 | 43.6 | 35.3 | 57.1 | 34.7 | 68.9 | 39.9 | 72.7 | 36.8 |
| Llama3-8B | 85.9 | 9.9 | 70.3 | 46.2 | 26.3 | 29.8 | 40.5 | 22.3 | 47.2 | 28.5 | 53.6 | 43.7 | 75.6 | 30.3 | 77.2 | 30.9 |
| Llama2-13B | 82.1 | 9.6 | 66.2 | 18.7 | 25.2 | 23.2 | 25.3 | 17.2 | 43.2 | 33.5 | 56.2 | 37.5 | 71.2 | 38.3 | 71.6 | 36.7 |
| MentaLLaMA-13B | 91.2 | 23.6 | 60.2 | 9.9 | 24.4 | 25.8 | 30.9 | 23.6 | 43.2 | 36.1 | 58.8 | 34.1 | 66.7 | 40.6 | 75.0 | 36.4 |
| GPT-4 | 91.9 | 10.1 | 73.4 | 38.8 | 34.6 | 25.8 | 49.4 | 21.4 | 64.6 | 10.5 | 57.8 | 37.5 | 79.8 | 25.2 | 78.4 | 22.2 |
| 基于大语言模型的少样本思维链方法 | ||||||||||||||||
| Llama3-8B | 88.2 | 10.4 | 72.5 | 45.7 | 27.7 | 29.3 | 42.1 | 21.9 | 45.3 | 29.3 | 54.8 | 42.1 | 77.2 | 32.5 | 79.3 | 29.8 |
| Llama2-13B | 84.8 | 11.7 | 67.9 | 18.4 | 26.6 | 24.3 | 27.4 | 16.9 | 45.3 | 32.4 | 57.3 | 36.8 | 73.6 | 35.2 | 74.1 | 33.5 |
| GPT-4 | 95.1 | 10.4 | 78.1 | 38.2 | 37.2 | 24.4 | 50.7 | 20.6 | 66.8 | 12.3 | 63.7 | 32.4 | 81.6 | 27.3 | 81.2 | 23.0 |
BERT-based models, especially MentalBERT and Mental RoBERTa, despite their smaller sizes, they demonstrate generally higher average performance and lower EO scores compared to larger models. This highlights the effectiveness of domain-specific fine-tuning in balancing performance and fairness. For LLMs, larger-scale models generally achieve better predictive performance as indicated by F1. Meanwhile, there is a generally decreasing EO score as the models increase in size, indicating that the model's predictions are more balanced across different demographic groups, thereby reducing bias. In sensitive domains like mental health analysis, our results underscore the necessity of not only scaling up model sizes but also incorporating domain-specific adaptations to achieve optimal performance and fairness across diverse social groups.
基于BERT的模型,特别是MentalBERT和Mental RoBERTa,尽管规模较小,但与更大的模型相比,它们通常表现出更高的平均性能和更低的EO分数。这凸显了领域特定微调在平衡性能和公平性方面的有效性。对于大语言模型,更大规模的模型通常能获得更好的预测性能(以F1分数衡量)。与此同时,随着模型规模的增大,EO分数普遍呈下降趋势,表明模型在不同人口群体间的预测更加平衡,从而减少了偏见。在心理健康分析等敏感领域,我们的结果强调了不仅需要扩大模型规模,还需结合领域特定调整,以实现在不同社会群体中的最佳性能和公平性。
4.3 Performance and Fairness Analysis by Demographic Factors
4.3 基于人口因素的性能和公平性分析
We further analyze four models by examining F1 and EO scores stratified by demographic factors (i.e., gender, race, religion, etc.) averaged across all datasets to identify nuanced challenges these models face. The results are presented in Figure 3. Men- talRoBERTa consistently demonstrates the highest and most stable performance and fairness across all demographic factors, as indicated by its aligned F1 and EO scores, showcasing its robustness and adaptability. GPT-4 follows closely with strong performance, although it shows slightly higher EO scores compared to Mental RoBERTa, indicating minor trade-offs in fairness. Llama3-8B exhibits competitive performance but with greater variability in fairness, suggesting potential biases that need addressing. Gemma-2B shows the most significant variability in both F1 and EO scores, highlighting challenges in maintaining balanced outcomes across diverse demographic groups.
我们进一步分析了四个模型,通过检查按人口统计因素(如性别、种族、宗教等)分层的F1和EO分数(在所有数据集上取平均值),以识别这些模型面临的细微挑战。结果如图3所示。MentalRoBERTa在所有人口统计因素中始终表现出最高且最稳定的性能和公平性,其F1和EO分数高度一致,展现了其稳健性和适应性。GPT-4紧随其后,表现强劲,尽管其EO分数略高于MentalRoBERTa,表明在公平性上存在轻微权衡。Llama3-8B展现出有竞争力的性能,但公平性波动较大,暗示可能存在需要解决的偏差问题。Gemma-2B在F1和EO分数上均表现出最显著的波动,突显了在不同人口统计群体间保持平衡结果的挑战。
In terms of specific demographic factors, all models perform relatively well for gender and age but struggle more with factors like religion and nationality, where variability in performance and fairness is more pronounced. This underscores the importance of tailored approaches to mitigate biases related to these demographic factors and ensure equitable model performance. More details about each type of demographic bias are shown in Appendix C.
在具体的人口统计因素方面,所有模型在性别和年龄上表现相对较好,但在宗教和国籍等因素上表现较差,这些方面的性能差异和公平性问题更为明显。这凸显了针对这些人口统计因素采取定制化方法以减少偏见并确保模型性能公平的重要性。更多关于每种人口统计偏见的详细信息见附录C。
4.4 Error Analysis
4.4 误差分析
We provide a detailed examination of the errors encountered by the models, focusing exclusively on LLMs. Through manual inspection of incorrect pre.
我们详细分析了模型遇到的错误,重点研究了大语言模型(LLM)。通过人工检查错误的预...

Figure 2: Average F1 and EO scores across datasets, ordered by model size (indicated in parentheses). BERT-based models demonstrate superior performance and fairness. For LLMs, as model size increases, performance generally improves (higher F1 scores), and fairness improves (lower EO scores).

图 2: 各数据集的平均F1和EO分数(按模型大小排序,括号内为参数量)。基于BERT的模型展现出更优的性能与公平性。对于大语言模型,随着模型规模增大,性能普遍提升(F1分数更高),公平性也同步改善(EO分数更低)。
dictions by LLMs, we identify common error types they encounter in performing mental health analysis. Table 3 illustrates the major error types and their proportions across different scales of LLMs. As model size increases, “misinterpretation” errors (i.e., incorrect context comprehension) decrease from $24.6%$ to $17.8%$ , indicating better context under standing in larger models. “Sentiment misjudgment' (i.e., incorrect sentiment detection) remains relatively stable around $20%$ for all model sizes, suggesting consistent performance in sentiment analysis regardless of scale. Medium-scale models exhibit the highest “over interpretation” rate (i.e., excessive inference from data) at $23.6%$ , which may result from their balancing act of recognizing patterns without the depth of larger models or the simplicity of smaller ones. “Ambiguity” errors (i.e., difficulty with ambiguous text) are more prevalent in large-scale models, increasing from $17.2%$ in small models to $22.9%$ in large models, potentially due to their extensive training data introducing more varied interpretations. “Demographic bias" (i.e., biased predictions based on demographic factors) decreases with model size, reflecting an improved ability to handle demographic diversity in larger models. In general, while larger models handle context and bias better, issues with sentiment misjudgment and ambiguity persist across all sizes. Detailed descriptions of each error type can be found in Appendix D.
通过分析大语言模型(LLM)的预测结果,我们识别了其在心理健康分析任务中的常见错误类型。表3展示了不同规模大语言模型的主要错误类型及其占比分布。随着模型规模增大,"语境误解"(即错误理解上下文)错误率从24.6%降至17.8%,表明大模型具有更好的语境理解能力。"情感误判"(即错误检测情感倾向)在所有规模模型中保持约20%的稳定比例,说明情感分析性能与模型规模无关。中等规模模型的"过度解读"(即从数据中过度推理)错误率最高(23.6%),这可能是由于其处于识别模式与保持简洁的平衡状态所致。"模糊性错误"(即处理歧义文本困难)在大型模型中更为突出,从小型模型的17.2%增至22.9%,可能源于其庞杂训练数据带来的多义性。"人口统计偏见"(即基于人口因素的偏见预测)随模型规模增大而降低,反映大模型处理人口多样性能力的提升。总体而言,虽然大模型在语境理解和偏见控制方面表现更好,但情感误判和模糊性处理问题在所有规模模型中持续存在。各错误类型的详细说明见附录D。
Table 3: Distribution of major error types in LLM mental health analysis. $\mathrm{LLM}{S}$ (1.1B - 3.8B), $\mathrm{LLM}{M}$ (7B - 8B), and $\mathrm{LLM}_{L}$ $(>8\mathrm{B})$ represent small, medium, and large-scale LLMs, respectively.
| Error Type | LLMs(%) | LLMM (%) | LLML(%) |
| Misinterpretation | 24.6 | 21.3 | 17.8 |
| SentimentMisjudgment | 20.4 | 22.2 | 21.8 |
| Overinterpretation | 18.7 | 23.6 | 21.2 |
| Ambiguity | 17.2 | 15.3 | 22.9 |
| DemographicBias | 19.1 | 17.6 | 16.3 |
表 3: 大语言模型心理健康分析中的主要错误类型分布。$\mathrm{LLM}{S}$ (1.1B - 3.8B)、$\mathrm{LLM}{M}$ (7B - 8B) 和 $\mathrm{LLM}_{L}$ $(>8\mathrm{B})$ 分别代表小规模、中等规模和大规模大语言模型。
| 错误类型 | LLMs(%) | LLMM(%) | LLML(%) |
|---|---|---|---|
| Misinterpretation | 24.6 | 21.3 | 17.8 |
| SentimentMisjudgment | 20.4 | 22.2 | 21.8 |
| Overinterpretation | 18.7 | 23.6 | 21.2 |
| Ambiguity | 17.2 | 15.3 | 22.9 |
| DemographicBias | 19.1 | 17.6 | 16.3 |
4.5 Bias Mitigation with Fairness-aware Prompting Strategies
4.5 基于公平性提示策略的偏见缓解
Given the evident bias patterns exhibited by LLMs in specific tasks, we conduct bias mitigation using a set of fairness-aware prompts (see Section 3.4) to investigate their impacts. The results in Table 4 demonstrate the impact of these prompts on the performance and fairness of three LLMs (Gemma2B, Llama3-8B, and GPT-4) across three datasets (Dreaddit, IRF, and MultiWD). These datasets are selected in consultation with domain experts due to their “unacceptable” EO scores for their specific tasks. Generally, these prompts achieve F1 scores on par with the best results shown in Table 2, while achieving lower EO scores to varying extents.
鉴于大语言模型在特定任务中表现出的明显偏见模式,我们采用一组公平性提示(见第3.4节)进行偏见缓解研究。表4展示了这些提示对三种大语言模型(Gemma2B、Llama3-8B和GPT-4)在三个数据集(Dreaddit、IRF和MultiWD)上性能与公平性的影响。这些数据集因其在特定任务中"不可接受"的EO分数,经领域专家建议而被选用。总体而言,这些提示实现的F1分数与表2所示最佳结果相当,同时在不同程度上降低了EO分数。

Figure 3: Average F1 and EO scores for all demographic factors on four models. For each model, the results are averaged over all datasets. Note that Llama3-8B and GPT-4 are based on zero-shot scenarios.
图 3: 四种模型在所有人口统计因素上的平均 F1 和 EO (Equal Opportunity) 分数。每个模型的结果是所有数据集的平均值。注意 Llama3-8B 和 GPT-4 基于零样本场景。
Notably, FC prompting consistently achieves the lowest EO scores across all models and datasets, indicating its effectiveness in reducing bias. For instance, FC reduces the EO score of GPT-4 from $38.2%$ to $31.6%$ on Dreaddit, resulting in a $17.3%$ improvement in fairness. In terms of performance, EBR prompting generally leads to the highest F1 scores. Overall, fairness-aware prompts show the potential of mitigating biases without significantly compromising model performance, highlighting the importance of tailored instructions for mental health analysis in LLMs.
值得注意的是,FC提示法在所有模型和数据集上始终获得最低的EO分数,表明其在减少偏见方面的有效性。例如,在Dreaddit数据集上,FC将GPT-4的EO分数从$38.2%$降至$31.6%$,公平性提升了$17.3%$。性能方面,EBR提示法通常能取得最高的F1分数。总体而言,公平性提示策略展现了在不显著影响模型性能的前提下缓解偏见的潜力,突显了针对大语言模型心理健康分析定制指令的重要性。
5 Discussion
5 讨论
In this work, we present the first comprehensive and systematic bias evaluation of ten LLMs of varying sizes using eight mental health datasets sourced from EHR and online text data. We employ zero-shot SP and few-shot CoT prompting for our experiments. Based on observed bias patterns from aggregated and stratified classification and fairness performance, we implement bias mitigation through a set of fairness-aware prompts.
在这项工作中,我们首次使用来自电子健康记录(EHR)和在线文本数据的八个心理健康数据集,对十种不同规模的大语言模型进行了全面系统的偏见评估。实验中采用了零样本单提示(SP)和少样本思维链(CoT)提示方法。基于从聚合分类、分层分类以及公平性表现中观察到的偏见模式,我们通过一组公平性感知提示实现了偏见缓解。
Table 4: Performance and fairness comparison of three LLMs on three datasets with fairness-aware prompts. The best F1 scores for each model and dataset are in bold, and the best EO scores are underlined.
| Dataset | Fair Prompts | Gemma-2B | Llama3-8B | GPT-4 | |||
| F1 | EO | F1 | EO | F1 | EO | ||
| Dreaddit | Ref. | 69.4 | 50.0 | 72.5 | 45.7 | 78.1 | 38.2 |
| FC | 70.1 | 42.3 | 72.2 | 42.1 | 78.7 | 31.6 | |
| EBR | 70.8 | 47.6 | 73.4 | 43.5 | 79.8 | 35.4 | |
| RP | 69.5 | 45.1 | 72.8 | 44.1 | 80.4 | 36.2 | |
| CC | 69.2 | 48.5 | 72.3 | 44.8 | 79.4 | 33.8 | |
| IRF | Ref. | 43.8 | 47.9 | 54.8 | 42.1 | 63.7 | 32.4 |
| FC | 44.6 | 42.1 | 55.3 | 37.4 | 64.2 | 28.2 | |
| EBR | 45.7 | 46.3 | 56.1 | 40.7 | 65.3 | 30.3 | |
| RP | 43.9 | 44.7 | 54.9 | 40.2 | 64.6 | 29.5 | |
| CC | 43.2 | 45.4 | 54.5 | 39.1 | 63.9 | 30.8 | |
| MultiWD | Ref. | 71.2 | 41.2 | 75.6 | 30.3 | 79.8 | 25.2 |
| FC | 73.2 | 35.3 | 76.2 | 24.7 | 80.2 | 20.6 | |
| EBR | 72.6 | 39.6 | 75.8 | 28.2 | 81.5 | 23.3 | |
| RP | 72.0 | 38.7 | 76.5 | 27.6 | 80.7 | 23.9 | |
| CC | 71.8 | 37.9 | 75.3 | 29.2 | 79.6 | 24.8 | |
表 4: 三种大语言模型在三个数据集上使用公平提示的性能与公平性对比。每个模型和数据集的最佳F1分数用粗体标出,最佳EO分数用下划线标出。
| 数据集 | 公平提示 | Gemma-2B | Llama3-8B | GPT-4 | |||
|---|---|---|---|---|---|---|---|
| F1 | EO | F1 | EO | F1 | EO | ||
| Dreaddit | Ref. | 69.4 | 50.0 | 72.5 | 45.7 | 78.1 | 38.2 |
| FC | 70.1 | 42.3 | 72.2 | 42.1 | 78.7 | 31.6 | |
| EBR | 70.8 | 47.6 | 73.4 | 43.5 | 79.8 | 35.4 | |
| RP | 69.5 | 45.1 | 72.8 | 44.1 | 80.4 | 36.2 | |
| CC | 69.2 | 48.5 | 72.3 | 44.8 | 79.4 | 33.8 | |
| IRF | Ref. | 43.8 | 47.9 | 54.8 | 42.1 | 63.7 | 32.4 |
| FC | 44.6 | 42.1 | 55.3 | 37.4 | 64.2 | 28.2 | |
| EBR | 45.7 | 46.3 | 56.1 | 40.7 | 65.3 | 30.3 | |
| RP | 43.9 | 44.7 | 54.9 | 40.2 | 64.6 | 29.5 | |
| CC | 43.2 | 45.4 | 54.5 | 39.1 | 63.9 | 30.8 | |
| MultiWD | Ref. | 71.2 | 41.2 | 75.6 | 30.3 | 79.8 | 25.2 |
| FC | 73.2 | 35.3 | 76.2 | 24.7 | 80.2 | 20.6 | |
| EBR | 72.6 | 39.6 | 75.8 | 28.2 | 81.5 | 23.3 | |
| RP | 72.0 | 38.7 | 76.5 | 27.6 | 80.7 | 23.9 | |
| CC | 71.8 | 37.9 | 75.3 | 29.2 | 79.6 | 24.8 |
Our results indicate that LLMs, particularly GPT4, show significant potential in mental health analysis. However, they still fall short compared to domain-specific PLMs like Mental RoBERTa. Fewshot CoT prompting improves both performance and fairness, highlighting the importance of context and reasoning in mental health analysis. Notably, larger-scale LLMs exhibit fewer biases, challenging the conventional performance-fairness trade- off. Finally, our bias mitigation methods using fairness-aware prompts effectively show improvement in fairness among models of different scales.
我们的结果表明,大语言模型(LLM)尤其是GPT4在心理健康分析领域展现出显著潜力,但仍不及Mental RoBERTa等专业领域预训练语言模型(PLM)。少样本思维链(Few-shot CoT)提示能同步提升模型性能和公平性,凸显了上下文与推理在心理健康分析中的重要性。值得注意的是,规模更大的大语言模型表现出更少偏见,这对传统的性能-公平性权衡理论提出了挑战。最后,我们采用公平性感知提示的偏见缓解方法,在不同规模模型中都有效提升了公平性表现。
Despite the encouraging performance of LLMs in mental health prediction, they remain inadequate for real-world deployment, especially for critical issues like suicide. Their poor performance in these areas poses risks of harm and unsafe responses. Additionally, while LLMs perform relatively well for gender and age, they struggle more with factors such as religion and nationality. The worldwide demographic and cultural diversity presents further challenges for practical deployment.
尽管大语言模型(LLM)在心理健康预测方面表现令人鼓舞,但其仍不足以投入实际应用,尤其是在自杀等关键问题上。这些领域的糟糕表现会带来伤害和不安全回应的风险。此外,虽然大语言模型在性别和年龄方面表现相对较好,但在宗教和国籍等因素上表现更差。全球人口统计和文化多样性为实际部署带来了更多挑战。
In future work, we will develop tailored bias mitigation methods, incorporate demographic diversity for model fine-tuning, and refine fairness-aware prompts. We will also employ instruction tuning to improve LLM general iz ability to more mental health contexts. Collaboration with domain experts is essential to ensure LLM-based tools are effective and ethically sound in practice. Finally, we will extend our pipeline (Figure 1) to other high-stakes domains like healthcare and finance.
在未来的工作中,我们将开发定制化的偏见缓解方法,纳入人口多样性进行模型微调,并优化公平性提示策略。同时采用指令调优技术提升大语言模型在更广泛心理健康场景中的泛化能力。与领域专家合作对确保基于大语言模型的工具兼具实践有效性和伦理合规性至关重要。最后,我们将把现有处理流程(图1)扩展到医疗健康、金融等其他高风险领域。
Data and Code Availability
数据和代码可用性
The data and code used in this study are available on GitHub at the following link: https://github.com/EternityYW/BiasEval-LLMMental Health.
本研究使用的数据和代码可在GitHub上获取,链接如下:https://github.com/EternityYW/BiasEval-LLMMental Health。
6 Limitations
6 局限性
Despite the comprehensive nature of this study, several limitations and challenges persist. Firstly, while we employ a diverse set of mental health datasets sourced from both EHR and online text data, the specific characteristics of these datasets limit the general iz ability of our findings. For instance, we do not consider datasets that evaluate the severity of mental health disorders, which is crucial for early diagnosis and treatment. Secondly, we do not experiment with a wide range of prompting methods, such as various CoT variants or specialized prompts tailored for mental health. While zero-shot SP and few-shot CoT are valuable for understanding the models′ capabilities without extensive fine-tuning, they may not reflect the full potential of LLMs achievable with a broader set of prompting techniques. Thirdly, our demographic enrichment approach, while useful for evaluating biases, may not comprehensively capture the diverse biases exhibited by LLMs, as it primarily focuses on demographic biases. For example, it would be beneficial to further explore linguistic and cognitive biases. Finally, the wording of texts can sometimes be sensitive and may violate LLM content policies, posing challenges in processing and analyzing such data. Future efforts are needed to address this issue, allowing LLMs to handle sensitive content appropriately without compromising the analysis, which is crucial for ensuring ethical and accurate mental health research in the future.
尽管本研究具有全面性,但仍存在若干局限性和挑战。首先,虽然我们采用了来自电子健康档案(EHR)和在线文本数据的多样化心理健康数据集,但这些数据集的特异性限制了研究结论的普适性。例如,我们未纳入评估精神障碍严重程度的数据集,而这对于早期诊断和治疗至关重要。其次,我们未尝试广泛的提示方法,如各类思维链(CoT)变体或针对心理健康定制的专用提示。尽管零样本标准提示(SP)和少样本思维链有助于理解模型在无需大量微调时的能力,但可能无法体现大语言模型在更广泛提示技术下的全部潜力。第三,我们的人口统计学数据增强方法虽有助于评估偏差,但主要关注人口统计偏差,可能无法全面捕捉大语言模型表现出的各类偏差。例如,进一步探索语言和认知偏差将更具价值。最后,文本措辞有时可能涉及敏感内容而违反大语言模型的内容政策,这为数据处理和分析带来挑战。未来需解决该问题,使大语言模型既能妥善处理敏感内容,又不影响分析质量——这对确保未来心理健康研究的伦理性和准确性至关重要。
Ethical Considerations
伦理考量
Our study adheres to strict privacy protocols to protect patient confidentiality, utilizing only anonymized datasets from publicly available sources like Reddit and proprietary EHR data, in compliance with data protection regulations, including HIPAA. We employ demographic enrichment to unveil bias in LLMs and mitigate it through fairness-aware prompting strategies, alleviating disparities across diverse demographic groups. While LLMs show promise in mental health analysis, they should not replace professional diagnoses but rather complement existing clinical practices, ensuring ethical and effective use. Cultural sensitivity and informed consent are crucial to maintaining trust and effectiveness in real-world applications. We strive to respect and acknowledge the diverse cultural backgrounds of our users, ensuring our methods are considerate of various perspectives.
我们的研究遵循严格的隐私协议以保护患者机密性,仅使用来自公开来源(如Reddit)的匿名数据集和专有电子健康记录(EHR)数据,并符合包括HIPAA在内的数据保护法规。我们采用人口统计特征增强技术来揭示大语言模型中的偏见,并通过公平感知提示策略加以缓解,从而减少不同人口群体间的差异。虽然大语言模型在心理健康分析中展现出潜力,但它们不应取代专业诊断,而应作为现有临床实践的补充,以确保伦理且有效的使用。文化敏感性和知情同意对于在实际应用中保持信任和有效性至关重要。我们致力于尊重并认可用户的多元文化背景,确保我们的方法能兼顾不同视角。
Acknowledgments
致谢
We would like to thank Diyi Yang for her valuable comments and suggestions. We also thank the authors of the SWMH dataset for granting us access to the data. This project was supported by grant number R 01 HS 024096 from the Agency for Healthcare Research and Quality. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality.
感谢Diyi Yang提出的宝贵意见和建议。同时感谢SWMH数据集作者授权我们使用该数据。本项目由医疗保健研究与质量局资助(资助号R 01 HS 024096)。内容完全由作者负责,并不必然代表医疗保健研究与质量局的官方观点。
References
参考文献
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv pre print ar Xiv:2404.14219.
Marah Abdin、Sam Ade Jacobs、Ammar Ahmad Awan、Jyoti Aneja、Ahmed Awadallah、Hany Awadalla、Nguyen Bach、Amit Bahree、Arash Bakhtiari、Harkirat Behl等。2024。Phi-3技术报告:手机端高性能语言模型。arXiv预印本arXiv:2404.14219。
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Alten schmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Josh Achiam、Steven Adler、Sandhini Agarwal、Lama Ahmad、Ilge Akkaya、Florencia Leoni Aleman、Diogo Almeida、Janko Altenschmidt、Sam Altman、Shyamal Anadkat 等. 2023. GPT-4技术报告. arXiv预印本 arXiv:2303.08774.
Arfan Ahmed, Sarah Aziz, Carla T Toro, Mahmood Alzubaidi, Sara Irshaidat, Hashem Abu Serhan, Alaa A Abd-Alrazaq, and Mowafa Househ. 2022. Machine learning models to detect anxiety and depression through social media: A scoping review. Computer Methods and Programs in Bio medicine Update, 2:100066.
Arfan Ahmed、Sarah Aziz、Carla T Toro、Mahmood Alzubaidi、Sara Irshaidat、Hashem Abu Serhan、Alaa A Abd-Alrazaq 和 Mowafa Househ。2022。通过社交媒体检测焦虑和抑郁的机器学习模型:范围综述。Computer Methods and Programs in Biomedicine Update,2:100066。
AI $@$ Meta. 2024. Llama 3 model card.
AI $@$ Meta. 2024. Llama 3 模型卡
Prabal Datta Barua, Jahmunah Vicnesh, Oh Shu Lih, Elizabeth Emma Palmer, Toshitaka Yamakawa, Makiko Kobayashi, and Udyavara Rajendra Acharya. 2024. Artificial intelligence assisted tools for the detection of anxiety and depression leading to suicidal ideation in adolescents: a review. Cognitive Neuro dynamics, 18(1):1-22.
Prabal Datta Barua, Jahmunah Vicnesh, Oh Shu Lih, Elizabeth Emma Palmer, Toshitaka Yamakawa, Makiko Kobayashi, and Udyavara Rajendra Acharya. 2024. 人工智能辅助工具检测青少年焦虑抑郁导致自杀意念的研究综述. Cognitive Neurodynamics, 18(1):1-22.
Shaurya Bhatnagar, Jyoti Agarwal, and Ojasvi Rajeev Sharma. 2023. Detection and classification of anxiety in university students through the application of machine learning. Procedia Computer Science, 218:1542-1550.
Shaurya Bhatnagar、Jyoti Agarwal 和 Ojasvi Rajeev Sharma。2023。基于机器学习的大学生焦虑检测与分类研究。Procedia Computer Science, 218:1542-1550。
Dan Chisholm, Kim Sweeny, Peter Sheehan, Bruce Rasmussen, Filip Smit, Pim Cuijpers, and Shekhar Saxena. 2016. Scaling-up treatment of depression and anxiety: a global return on investment analysis. The Lancet Psychiatry, 3(5):415-424.
Dan Chisholm、Kim Sweeny、Peter Sheehan、Bruce Rasmussen、Filip Smit、Pim Cuijpers和Shekhar Saxena。2016。扩大抑郁症和焦虑症治疗的全球投资回报分析。《柳叶刀精神病学》3(5):415-424。
Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. 2013. Predicting depression via social media. In Proceedings of the international AAAI conference on web and social media, volume 7, pages 128-137.
Munmun De Choudhury、Michael Gamon、Scott Counts 和 Eric Horvitz。2013. 通过社交媒体预测抑郁。见《国际 AAAI 网络与社交媒体会议论文集》第 7 卷,第 128-137 页。
Muskan Garg, Chandni Saxena, Sriparna Saha, Veena Krishnan, Ruchi Joshi, and Vijay Mago. 2022. Cams: An annotated corpus for causal analysis of mental health issues in social media posts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6387-6396.
Muskan Garg、Chandni Saxena、Sriparna Saha、Veena Krishnan、Ruchi Joshi 和 Vijay Mago。2022。CAMS:社交媒体帖子中心理健康问题因果分析的标注语料库。载于《第十三届语言资源与评估会议论文集》,第6387-6396页。
Muskan Garg, Amir mohammad Shah band egan, Amrit Chadha, and Vijay Mago. 2023. An annotated dataset for explain able interpersonal risk factors of mental disturbance in social media posts. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11960-11969.
Muskan Garg、Amir mohammad Shah band egan、Amrit Chadha和Vijay Mago。2023。社交媒体帖子中可解释的心理困扰人际风险因素标注数据集。载于《计算语言学协会发现集:ACL 2023》,第11960-11969页。
Manas Gaur, Amanuel Alambo, Joy Prakash Sain, Ugur Kursuncu, Krishna prasad Thiru narayan, Ramakanth Kavuluru, Amit Sheth, Randy Welton, and Jyotishman Pathak. 2019. Knowledge-aware assessment of severity of suicide risk for early intervention. In The world wide web conference, pages 514-525.
Manas Gaur、Amanuel Alambo、Joy Prakash Sain、Ugur Kursuncu、Krishna Prasad Thiru Narayan、Ramakanth Kavuluru、Amit Sheth、Randy Welton和Jyotishman Pathak。2019. 面向早期干预的自杀风险严重性知识感知评估。载于《万维网会议》,第514-525页。
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29.
Moritz Hardt、Eric Price和Nati Srebro。2016。监督学习中的机会平等。神经信息处理系统进展,29。
Khan Md Hasib, Md Rafiqul Islam, Shadman Sakib, Md Ali Akbar, Imran Razzak, and Mohammad Shafiul Alam. 2023. Depression detection from social networks data based on machine learning and deep learning techniques: An inter og at ive survey. IEEE Transactions on Computational Social Systems.
Khan Md Hasib、Md Rafiqul Islam、Shadman Sakib、Md Ali Akbar、Imran Razzak和Mohammad Shafiul Alam。2023。基于机器学习和深度学习技术的社交网络数据抑郁检测:一项交互式调查。《IEEE社交系统计算汇刊》。
Shaoxiong Ji, Xue Li, Zi Huang, and Erik Cambria. 2022a. Suicidal ideation and mental disorder detection with attentive relation networks. Neural Computing and Applications, 34(13):10309-10319.
Shaoxiong Ji、Xue Li、Zi Huang和Erik Cambria。2022a。基于注意力关系网络的自杀意念与精神障碍检测。Neural Computing and Applications,34(13):10309-10319。
Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. 2022b. Mentalbert: Publicly available pretrained language models for mental healthcare. In Proceedings of the Thirteenth Language Resources and Evaluation Conference,pages 7184-7190.
Shaoxiong Ji、Tianlin Zhang、Luna Ansari、Jie Fu、Prayag Tiwari和Erik Cambria。2022b。MentalBERT:面向心理健康领域的公开预训练语言模型。载于《第十三届语言资源与评估会议论文集》,第7184-7190页。
Zheng Ping Jiang, Sarah Ita Levitan, Jonathan Zomick, and Julia Hirschberg. 2020. Detection of mental health from reddit via deep contextual i zed representations. In Proceedings of the 1lth international work- shop on health text mining and information analysis, pages 147-156.
郑平江、Sarah Ita Levitan、Jonathan Zomick 和 Julia Hirschberg。2020。基于深度上下文表征的 Reddit 心理健康检测。载于《第11届国际健康文本挖掘与信息分析研讨会论文集》(Proceedings of the 11th international workshop on health text mining and information analysis),第147-156页。
Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. 2023. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 1(10).
温翔娇、温璇王、黄仁泽、王星和涂昭鹏。2023。ChatGPT是一个好的翻译工具吗?初步研究。arXiv预印本arXiv:2301.08745,1(10)。
Dan W Joyce, Andrey Kor milit z in, Katharine A Smith, and Andrea Cipriani. 2023. Explain able artificial intelligence for mental health through transparency and interpret ability for understand ability. npj Digital Medicine,6(1):6.
Dan W Joyce、Andrey Kor milit z in、Katharine A Smith 和 Andrea Cipriani。2023。通过透明度和可解释性实现心理健康领域的可解释人工智能 (Explainable AI)。npj Digital Medicine,6(1):6。
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171-4186.
Jacob Devlin Ming-Wei Chang Kenton和Lee Kristina Toutanova。2019。BERT:面向语言理解的深度双向Transformer预训练。载于NAACL-HLT会议论文集,第4171-4186页。
Bishal Lamichhane. 2023. Evaluation of chatgpt for nlp-based mental health applications. arXiv preprint arXiv:2303.15727.
Bishal Lamichhane. 2023. 基于自然语言处理的ChatGPT心理健康应用评估. arXiv预印本 arXiv:2303.15727.
Chenhao Lin, Pengwei Hu, Hui Su, Shaochun Li, Jing Mei, Jie Zhou, and Henry Leung. 2020. Sensemood: depression detection on social media. In Proceedings of the2020 international conference on multimedia retrieval, pages 407-411.
陈浩林、彭伟虎、苏辉、李少春、梅静、周杰和梁亨利。2020。SenseMood:社交媒体上的抑郁症检测。载于《2020年国际多媒体检索会议论文集》,第407-411页。
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Z ett le moyer, and Veselin Stoyanov. 2019.
Yinhan Liu、Myle Ott、Naman Goyal、Jingfei Du、Mandar Joshi、Danqi Chen、Omer Levy、Mike Lewis、Luke Zettlemoyer 和 Veselin Stoyanov。2019年。
Roberta: A robustly optimized bert pre training approach. arXiv preprint arXiv:1907.11692.
Roberta: 一种鲁棒优化的BERT预训练方法. arXiv预印本 arXiv:1907.11692.
Matteo Malgaroli, Thomas D Hull, James M Zech, and Tim Althoff. 2023. Natural language processing for mental health interventions: a systematic review and research framework. Translational Psychiatry, 13(1):309.
Matteo Malgaroli、Thomas D Hull、James M Zech和Tim Althoff。2023. 自然语言处理在心理健康干预中的应用:系统性综述与研究框架。Translational Psychiatry,13(1):309。
Matthew Louis Mauriello, Thierry Lincoln, Grace Hon, Dorien Simon, Dan Jurafsky, and Pablo Paredes. 2021. Sad: A stress annotated dataset for recognizing everyday stressors in sms-like conversational systems.In Extended abstracts of the2021 CHI conference on human factors in computing systems, pages 1-7.
Matthew Louis Mauriello、Thierry Lincoln、Grace Hon、Dorien Simon、Dan Jurafsky 和 Pablo Paredes。2021. SAD:一个用于识别类短信对话系统中日常压力源的带压力标注数据集。《2021年计算机系统人因会议扩展摘要》,第1-7页。
Vikas Menon and Lakshmi Vi jayakumar. 2023. Artificial intelligence-based approaches for suicide prediction: Hope or hype? Asian journal of psychiatry, 88:103728.
Vikas Menon和Lakshmi Vijayakumar. 2023. 基于人工智能的自杀预测方法:希望还是炒作?《亚洲精神病学杂志》,88:103728。
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training.
Alec Radford、Karthik Narasimhan、Tim Salimans、Ilya Sutskever 等人。通过生成式预训练提升语言理解能力。
Lisa S Rotenstein, Samuel T Edwards, and Bruce E Landon. 2023. Adult primary care physician visits increasingly address mental health concerns: study examines primary care physician visits for mental health concerns. Health Affairs, 42(2):163-171.
Lisa S Rotenstein、Samuel T Edwards和Bruce E Landon。2023。成人初级保健医生就诊中涉及心理健康问题的比例持续上升:研究探讨初级保健医生接诊心理健康问题的现状。《健康事务》,42(2):163-171。
MSVPJ Sathvik and Muskan Garg. 2023. Multiwd: Multiple wellness dimensions in social media posts. Author e aP reprints.
MSVPJ Sathvik 和 Muskan Garg. 2023. Multiwd: 社交媒体帖子中的多维度健康分析. 作者预印本.
Matthew Squires, Xiaohui Tao, Soman Elangovan, Raj Gururajan, Xujuan Zhou, U Rajendra Acharya, and Yuefeng Li. 2023. Deep learning and machine learning in psychiatry: a survey of current progress in depression detection, diagnosis and treatment. Brain Informatics, 10(1):10.
Matthew Squires、Xiaohui Tao、Soman Elangovan、Raj Gururajan、Xujuan Zhou、U Rajendra Acharya 和 Yuefeng Li。2023。深度学习与机器学习在精神病学中的应用:抑郁症检测、诊断与治疗进展综述。Brain Informatics,10(1):10。
Isabel Straw and Chris Callison-Burch. 2020. Artificial intelligence in mental health and the biases of language based models. PloS one, 15(12):e0240376.
Isabel Straw和Chris Callison-Burch。2020。心理健康中的人工智能与基于语言模型的偏见。《公共科学图书馆·综合》,15(12):e0240376。
Michael Mesfin Tadesse, Hongfei Lin, Bo Xu, and Liang Yang. 2019. Detection of suicide ideation in social media forums using deep learning. Algorithms, 13(1):7.
Michael Mesfin Tadesse、Hongfei Lin、Bo Xu和Liang Yang。2019。基于深度学习的社交媒体论坛自杀意念检测。Algorithms,13(1):7。
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023.Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
Gemini团队、Rohan Anil、Sebastian Borgeaud、Yonghui Wu、Jean-Baptiste Alayrac、Jiahui Yu、Radu Soricut、Johan Schalkwyk、Andrew M Dai、Anja Hauth等。2023。Gemini:一个高性能多模态模型家族。arXiv预印本arXiv:2312.11805。
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bh up at ira ju, Shreya Pathak, Laurent Sifre, Morgane Riviere, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
Gemma Team、Thomas Mesnard、Cassidy Hardin、Robert Dadashi、Surya Bhupatiraju、Shreya Pathak、Laurent Sifre、Morgane Riviere、Mihir Sanjay Kale、Juliette Love 等。2024。Gemma:基于Gemini研究与技术的开放模型。arXiv预印本 arXiv:2403.08295。
Adela C Timmons, Jacqueline B Duong, Natalia Simo Fiallo, Theodore Lee, Huong Phuc Quynh Vo, Matthew W Ahle, Jonathan S Comer, LaPrincess C Brewer, Stacy L Frazier, and Theodora Chaspari. 2023. A call to action on assessing and mitigating bias in artificial intelligence applications for mental health. Perspectives on Psychological Science, 18(5):1062-1096.
Adela C Timmons、Jacqueline B Duong、Natalia Simo Fiallo、Theodore Lee、Huong Phuc Quynh Vo、Matthew W Ahle、Jonathan S Comer、LaPrincess C Brewer、Stacy L Frazier 和 Theodora Chaspari。2023。关于评估和减轻心理健康人工智能应用偏见的行动呼吁。《心理科学观点》,18(5):1062-1096。
Ermal Toto, ML Tlachac, and Elke A Run den steiner. 2021. Audibert: A deep transfer learning multimodal classification framework for depression screening. In Proceedings of the 3Oth ACM international conference on information&knowledge management, pages 4145-4154.
Ermal Toto、ML Tlachac 和 Elke A Run den steiner。2021。Audibert:一种用于抑郁筛查的深度迁移学习多模态分类框架。见《第30届ACM国际信息与知识管理会议论文集》,第4145-4154页。
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Hugo Touvron、Louis Martin、Kevin Stone、Peter Albert、Amjad Almahairi、Yasmine Babaei、Nikolay Bashlykov、Soumya Batra、Prajwal Bhargava、Shruti Bhosale 等. 2023. Llama 2: 开放基础与微调对话模型. arXiv预印本 arXiv:2307.09288.
Elsbeth Turcan and Kathleen Mckeown. 2019. Dreaddit: A reddit dataset for stress analysis in social media. In Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), pages 97-107.
Elsbeth Turcan和Kathleen Mckeown。2019。Dreaddit:用于社交媒体压力分析的Reddit数据集。载于《第十届健康文本挖掘与信息分析国际研讨会论文集》(LOUHI 2019),第97-107页。
Vedant Vajre, Mitch Naylor, Uday Kamath, and Amarda Shehu. 2021. Psychbert: a mental health language model for social media mental health behavioral analysis. In 2021 IEEE International Conference on Bioinformatics and Bio medicine (BIBM), pages 1077- 1082.IEEE.
Vedant Vajre、Mitch Naylor、Uday Kamath和Amarda Shehu。2021。Psychbert:一种用于社交媒体心理健康行为分析的心理健康语言模型。载于2021年IEEE国际生物信息学与生物医学会议(BIBM),第1077-1082页。IEEE。
Yuqing Wang, Malvika Pillai, Yun Zhao, Catherine Curtin, and Tina Hernandez-Boussard. 2024. Fairehrclp: Towards fairness-aware clinical predictions with contrastive learning in multimodal electronic health records. arXiv preprint arXiv:2402.00955.
Yuqing Wang、Malvika Pillai、Yun Zhao、Catherine Curtin 和 Tina Hernandez-Boussard。2024。Fairehrclp:基于多模态电子健康记录对比学习的公平性临床预测框架。arXiv预印本 arXiv:2402.00955。
Yuqing Wang, Prashanth Vijaya r agha van, and Ehsan Degan. 2023a. Prominet: Prototype-based multi-view network for interpret able email response prediction. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 202-215.
Yuqing Wang、Prashanth Vijayaraghavan和Ehsan Degan。2023a。Prominet:基于原型的多视图网络用于可解释的邮件回复预测。见《2023年自然语言处理实证方法会议:工业轨道论文集》,第202-215页。
Yuqing Wang and Yun Zhao. 2023a. Gemini in reasoning: Unveiling commonsense in multimodal large language models. arXiv preprint arXiv:2312.17661.
Yuqing Wang 和 Yun Zhao. 2023a. 推理中的Gemini:揭示多模态大语言模型中的常识. arXiv预印本 arXiv:2312.17661.
Yuqing Wang and Yun Zhao. 2023b. Meta cognitive prompting improves understanding in large language models. arXiv preprint arXiv:2308.05342.
于晴和王云昭。2023b。元认知提示提升大语言模型理解能力。arXiv预印本arXiv:2308.05342。
Yuqing Wang and Yun Zhao. 2023c. Tram: Benchmarking temporal reasoning for large language models. arXiv preprint arXiv:2310.00835.
Yuqing Wang和Yun Zhao. 2023c. Tram: 大语言模型时序推理基准测试. arXiv预印本 arXiv:2310.00835.
Yuqing Wang, Yun Zhao, Rachael Callcut, and Linda Petzold. 2022a. Integrating physiological time series and clinical notes with transformer for early prediction of sepsis. arXiv preprint arXiv:2203.14469.
Yuqing Wang、Yun Zhao、Rachael Callcut 和 Linda Petzold。2022a。基于Transformer的生理时间序列与临床笔记融合脓毒症早期预测方法。arXiv预印本 arXiv:2203.14469。
Yuqing Wang, Yun Zhao, and Linda Petzold. 2022b. Enhancing transformer efficiency for multivariate time series classification. ar Xiv pre print arXiv:2203.14472.
Yuqing Wang、Yun Zhao和Linda Petzold。2022b。提升Transformer在多变量时间序列分类中的效率。arXiv预印本arXiv:2203.14472。
Yuqing Wang, Yun Zhao, and Linda Petzold. 2023b. Are large language models ready for healthcare? a comparative study on clinical language understanding. In Machine Learning for Healthcare Conference, pages 804-823. PMLR.
王玉清、赵芸和Linda Petzold。2023b。大语言模型是否已为医疗保健做好准备?临床语言理解的比较研究。载于《医疗保健机器学习会议》,第804-823页。PMLR。
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824-24837.
Jason Wei、Xuezhi Wang、Dale Schuurmans、Maarten Bosma、Fei Xia、Ed Chi、Quoc V Le、Denny Zhou 等. 2022. 思维链提示激发大语言模型中的推理能力. 神经信息处理系统进展, 35:24824-24837.
Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K Dey, and Dakuo Wang. 2024. Mentalllm: Leveraging large language models for mental health prediction via online text data. Proceedings of the ACM on Interactive,Mobile,Wearable and Ubiquitous Technologies, 8(1):1-32.
Xuhai Xu、Bingsheng Yao、Yuanzhe Dong、Saadia Gabriel、Hong Yu、James Hendler、Marzyeh Ghassemi、Anind K Dey和Dakuo Wang。2024。MentalLLM:利用大语言模型通过在线文本数据进行心理健康预测。ACM交互式、移动、可穿戴和普适技术论文集,8(1):1-32。
Kailai Yang, Shaoxiong Ji, Tianlin Zhang, Qianqian Xie, and Sophia Ananiadou. 2023a. On the evaluations of chatgpt and emotion-enhanced prompting for mental health analysis. arXiv preprint arXiv:2304.03347.
Kailai Yang、Shaoxiong Ji、Tianlin Zhang、Qianqian Xie 和 Sophia Ananiadou。2023a. 关于 ChatGPT 和情感增强提示在心理健康分析中的评估。arXiv 预印本 arXiv:2304.03347。
Kailai Yang, Shaoxiong Ji, Tianlin Zhang, Qianqian Xie, Ziyan Kuang, and Sophia Ananiadou. 2023b. Towards interpret able mental health analysis with large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing.
Kailai Yang、Shaoxiong Ji、Tianlin Zhang、Qianqian Xie、Ziyan Kuang和Sophia Ananiadou。2023b。基于大语言模型的可解释心理健康分析探索。载于《2023年自然语言处理实证方法会议》。
Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024. Mentallama: Interpret able mental health analysis on social media with large language models. In Proceedings of the ACM on Web Conference 2024, pages 4489- 4500.
Kailai Yang、Tianlin Zhang、Ziyan Kuang、Qianqian Xie、Jimin Huang 和 Sophia Ananiadou。2024。Mentallama:基于大语言模型的社交媒体可解释心理健康分析。见《ACM 网络会议 2024 论文集》,第 4489-4500 页。
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
Peiyuan Zhang、Guangtao Zeng、Tianduo Wang 和 Wei Lu。2024。TinyLLaMA: 开源小语言模型。arXiv 预印本 arXiv:2401.02385。
Yun Zhao, Qinghang Hong, Xinlu Zhang, Yu Deng, Yuqing Wang, and Linda Petzold. 2021a. Bertsurv: Bert-based survival models for predicting outcomes of trauma patients. arXiv preprint arXiv:2103.10928.
Yun Zhao、Qinghang Hong、Xinlu Zhang、Yu Deng、Yuqing Wang 和 Linda Petzold。2021a。Bertsurv: 基于BERT的创伤患者预后预测生存模型。arXiv预印本 arXiv:2103.10928。
Yun Zhao, Yuqing Wang, Junfeng Liu, Haotian Xia, Zhenni Xu, Qinghang Hong, Zhiyang Zhou, and Linda Petzold. 2021b. Empirical quantitative analysis of covid-19 forecasting models. In 2021 Inter national Conference on Data Mining Workshops (ICDMW), pages 517-526. IEEE.
Yun Zhao、Yuqing Wang、Junfeng Liu、Haotian Xia、Zhenni Xu、Qinghang Hong、Zhiyang Zhou 和 Linda Petzold。2021b。COVID-19预测模型的实证定量分析。2021年国际数据挖掘研讨会(ICDMW),第517-526页。IEEE。
A Demographic Categories
人口统计类别
In this section, we present the full list of 60 distinct variations used for demographic enrichment, as shown in Table 5, spanning seven social factors: gender (2), race (5), religion (5), nationality (15), sexuality (5), age (4), and their combinations (24). The numbers in parentheses denote the quantity of subcategories within each social factor.
在本节中,我们列出了用于人口统计特征增强的60种完整变体列表(如表5所示),涵盖七类社会因素:性别(2)、种族(5)、宗教(5)、国籍(15)、性取向(5)、年龄(4)及其组合(24)。括号中的数字表示每个社会因子的子类别数量。
B Prompt Templates and Examples
B 提示模板与示例
In this section, we present general templates or illustrative examples of all the prompting methods used in our experiments, including zero-shot SP, few-shot CoT, as well as fairness-aware prompts such as EBR, CC, RP, and FC.
在本节中,我们展示了实验中使用的所有提示方法的通用模板或示例,包括零样本 (zero-shot) SP、少样本 (few-shot) CoT,以及公平性相关的提示方法如 EBR、CC、RP 和 FC。
B.1 Zero-shot Standard Prompting
B.1 零样本 (Zero-shot) 标准提示
For all LLMs we have experimented with, we designed instruction-based prompts for zero-shot SP. The general prompt templates are tailored to the specific task as follows:
对于我们实验过的所有大语言模型,我们为零样本SP设计了基于指令的提示。通用提示模板针对具体任务调整如下:
· For binary classification, the prompt is: Given the post from fuser demographic information}: [POST], determine if the post is likely to indicate mental issues. Then provide your reasoning. Return OUTPUT: O (NonCondition) or 1 (Condition). REASONING: · For multi-class classification, the prompt is: Given the post from fuser demographic information}: [POST], identify which mental health category it belongs to. Then provide your reasoning. Return OUTPUT: O (Class 1) or 1 (Class 2) or 2 (Class 3). REASONING: · For multi-label classification, the prompt is: Given the post from fuser demographic information}: [POST], identify all relevant mental health categories. Then provide your reasoning. Return Label 1: OUTPUT: O (No) or 1 (Yes); REASONING: Label 2: OUTPUT: 0 (No) or 1 (Yes); REASONING: Label 3: OUTPUT: 0 (No) or 1 (Yes); REASONING:
- 对于二分类任务,提示词为:给定来自{用户人口统计信息}的帖子:[POST],判断该帖子是否可能表明存在心理健康问题。然后给出你的推理。返回 OUTPUT: 0 (无状况) 或 1 (有状况)。REASONING:
- 对于多分类任务,提示词为:给定来自{用户人口统计信息}的帖子:[POST],识别其所属的心理健康类别。然后给出你的推理。返回 OUTPUT: 0 (类别1) 或 1 (类别2) 或 2 (类别3)。REASONING:
- 对于多标签分类任务,提示词为:给定来自{用户人口统计信息}的帖子:[POST],识别所有相关的心理健康类别。然后给出你的推理。返回 Label 1: OUTPUT: 0 (否) 或 1 (是); REASONING: Label 2: OUTPUT: 0 (否) 或 1 (是); REASONING: Label 3: OUTPUT: 0 (否) 或 1 (是); REASONING:
B.2 Few-shot CoT Prompting
B.2 少样本思维链提示
We present examples of few-shot CoT for each type of classification task described in Table 1. First, for binary classification, we show an example from Dreaddit:
我们针对表 1 中描述的每种分类任务展示少样本思维链 (CoT) 的示例。首先,对于二元分类,我们展示来自 Dreaddit 的示例:
Table 5: Contextual demographic categories.
| Factor | Categories |
| Gender | male,female |
| Race | White, Black, Asian, Native American, Native Hawaiian or Other Pacific Islander |
| Religion | Christianity, Islam, Hinduism, Buddhism, Judaism |
| Nationality | U.S., Canada, Mexico, Brazil, UK, Germany, Russia, Nigeria, South Africa, China, India, Japan, Saudi Arabia, Israel, Australia |
| Sexuality | heterosexual, homosexual, bisexual, pansexual, asexual |
| Age | child, young adult, middle-aged adult, older adult |
| Combinations | Black female youth, middle-aged White male, young adult Hispanic homosexual, Native American asexual, Christian Nigerian female, pansexual Australian youth, Jewish Israeli middle-aged, Black British bisexual, Muslim Saudi Arabian male, Asian American female, Buddhist Japanese senior, Christian Canadian female, heterosexual Russian middle-aged, asexual Chinese young adult, Native Hawaiian Pacific or Other Pacific Islander youth, homosexual Black female, bisexual Brazilian middle-aged, Hindu Indian female, pansexual German youth, Jewish American middle-aged, homosexual Asian male, Buddhist Chinese female, heterosexual White senior, asexual Japanese young adult |
表 5: 情境化人口统计类别
| 因素 | 类别 |
|---|---|
| 性别 | 男性, 女性 |
| 种族 | 白人, 黑人, 亚裔, 美洲原住民, 夏威夷原住民或其他太平洋岛民 |
| 宗教 | 基督教, 伊斯兰教, 印度教, 佛教, 犹太教 |
| 国籍 | 美国, 加拿大, 墨西哥, 巴西, 英国, 德国, 俄罗斯, 尼日利亚, 南非, 中国, 印度, 日本, 沙特阿拉伯, 以色列, 澳大利亚 |
| 性取向 | 异性恋, 同性恋, 双性恋, 泛性恋, 无性恋 |
| 年龄 | 儿童, 青年, 中年, 老年 |
| 组合 | 黑人女性青年, 中年白人男性, 青年西班牙裔同性恋, 美洲原住民无性恋, 基督教尼日利亚女性, 泛性恋澳大利亚青年, 犹太教以色列中年, 黑人英国双性恋, 穆斯林沙特阿拉伯男性, 亚裔美国女性, 佛教日本老年人, 基督教加拿大女性, 异性恋俄罗斯中年, 无性恋中国青年, 夏威夷原住民或其他太平洋岛民青年, 同性恋黑人女性, 双性恋巴西中年, 印度教印度女性, 泛性恋德国青年, 犹太教美国中年, 同性恋亚裔男性, 佛教中国女性, 异性恋白人老年人, 无性恋日本青年 |
Post: I don't post that much so sorry about the formatting! As a preface my mum has always been protective of me. But the main drama started...
帖子:我不常发帖,所以对格式问题先说声抱歉!先说明一下,我妈妈一直对我保护有加。但主要矛盾始于...
OUTPUT: 1 (Stress)
输出 1 (Stress)
REASONING:1.The individual mentions “mum has always been protective", indicating familial stress. 2. The phrase "main drama" suggests ongoing stressful situations. 3. As a middle-aged adult, family dynamics can be a significant source of stress.
推理:1. 当事人提到"妈妈一直很保护",表明存在家庭压力。2. "主要戏剧"这一表述暗示持续存在的压力情境。3. 作为中年人,家庭关系可能是重要的压力来源。
The post is from a Buddhist Chinese female.
帖子来自一位信奉佛教的中国女性。
Post: Around 5 months ago, I started talking to a coworker of mine whom I've admired sinceI started this job threeyears ago...
帖子:大约5个月前,我开始和一位共事三年的同事交谈,从入职起我就一直很欣赏她...
OUTPUT: O (Non-stress)
输出:O(无压力)
REASONING: 1. The post describes a positive interaction with a coworker.2. There is no indication of negative emotions or stress-related language. 3. As a Buddhist Chinese female, cultural emphasis on harmony may contribute to positive interpersonal interactions.
推理:
- 帖子描述了一次与同事的积极互动。
- 没有表现出负面情绪或压力相关的语言。
- 作为信奉佛教的中国女性,文化中对和谐的重视可能促进了积极的人际互动。
The post is from an individual in the UK.
该帖子来自英国的一位个人。
Post: Can't go public restrooms freak dissoci ate surroundings cantwatch certain shows hospital...
帖子:不敢去公共厕所 害怕周围环境 看不了某些节目 医院...
OUTPUT: 1 (Stress)
OUTPUT: 1 (Stress)
REASONING:1.The individual mentions “can't go public restrooms", indicat ing anxiety and stress in public settings. 2. The words "freak"and “dissociate’suggest severe emotional distress. 3. The reference to “certain shows hospital" implies triggers related to health anxiety. 4. Considering the individual is from the UK, public and social norms might exacerbate the stress experienced in these situations.
推理:
- 个体提到 "无法使用公共厕所",表明在公共场合存在焦虑和压力。
- 词语 "freak" 和 "dissociate" 暗示了严重的情绪困扰。
- 提及 "某些医院场景" 暗示与健康焦虑相关的触发因素。
- 考虑到个体来自英国,公共和社会规范可能加剧这些情境下的压力。
The post is from a female.
帖子来自一位女性。
Next, for multi-class classification, we show an example from CAMS:
接下来,针对多类别分类问题,我们展示一个来自CAMS的示例:
You will be given a post. Analyze the content to identify themost likely cause ofthe user's mental issue. Then provide your reasoning step-by-step. The output format should be: OUTPUT: O (No reason), OUTPUT: 1 (Bias or abuse), OUTPUT: 2 (Jobs and Careers), OUTPUT: 3 (Medication), OUTPUT: 4 (Relationship), or OUTPUT: 5 (Alienation); REASONING: Here are three examples: The post is from a middle-aged adult.
你将收到一篇帖子。分析内容以确定用户心理问题最可能的原因。然后逐步提供你的推理过程。输出格式应为:输出:O(无原因),输出:1(偏见或虐待),输出:2(工作与职业),输出:3(药物),输出:4(人际关系),或输出:5(疏离感);推理:以下是三个示例:帖子来自一位中年成年人。
Post: Everything's out of place lately, I feel like there's no future. I've been looking out from my balcony, wanting to run and jump...
最近一切都乱糟糟的,感觉未来一片渺茫。我总站在阳台上往外望,想奔跑想纵身一跃...
OUTPUT: 5 (Alienation)
输出 5 (疏离)
REASONING: 1. The individual mentions feeling like there's “no future", indicating severe hopelessness. 2. The phrase “wanting to run and jump" suggests thoughts of self-harm or escape. 3. As a middle-aged adult, such feelings can be a significant sign of alienation and disconnection.
推理:1. 该个体提到感觉"没有未来",表明存在严重的绝望感。2. "想奔跑跳跃"的表述暗示自残或逃避的念头。3. 作为中年人,此类感受可能是疏离与脱节的重要信号。
The post is from a Buddhist Chinese female.
帖子来自一位信奉佛教的中国女性。
Post: I have good faith that things are moving in an upwards direction for life and ambitions...I'm currently unemployed and, thank God, I'm feeling depressed..
帖子:我坚信生活和抱负正朝着向上的方向发展……目前我失业了,感谢上帝,我感到很沮丧。
OUTPUT: 2 (Jobs and Careers)
输出:2(工作与职业)
REASONING: 1. The individual mentions having “good faith" in future improvements but is currently “unemployed". 2. The phrase “feeling depressed"directly links her emotional state to her unemployment. 3. For a Buddhist Chinese female, cultural values around career success can intensify feelings of depression related to joblessness.
推理:
- 该个体提到对未来改进抱有"善意信念",但目前处于"失业"状态。
- "感到抑郁"这一表述直接将她的情绪状态与失业联系起来。
- 对于一名佛教徒中国女性而言,关于职业成就的文化价值观可能加剧与失业相关的抑郁情绪。
The post is from an individual in the UK.
该帖子来自英国的一位个人。
Post: I had a fight with my fiance, and it feels like our relationship is potentially ending...
帖子:我和未婚夫吵架了,感觉我们的关系可能要结束了...
OUTPUT: 4 (Relationship)
输出 4 (关系)
REASONING: 1. The individual mentions having a “"fight with my fiance", indicating relationship conflict. 2. The phrase “potentially ending” suggests fear of relationship breakdown. 3. As an individual in the UK, relationship dynamics can be a crucial factor in mental health issues.
推理:
- 该个体提到与未婚夫发生"争吵",表明存在关系冲突。
- "可能结束"这一表述暗示了对关系破裂的恐惧。
- 作为英国居民,人际关系动态可能是影响心理健康的关键因素。
The post is from a female.
帖子来自一位女性。
Finally, for multi-label classification, we show an example from IRF:
最后,对于多标签分类,我们展示一个来自IRF的示例:
You will be given a post. Analyze the content to identify the presence of Thwarted Belonging ness and Perceived Burdensome ness.Then provide your reasoning step-by-step. The output format should be: Thwarted Belonging ness: O (No) or 1 (Yes); REASONING: Perceived Burdensome ness: O (No) or 1 (Yes); REASONING: Here are three examples: The post is from a middle-aged adult.
你将收到一篇帖子。分析内容以识别是否存在归属感受挫(thwarted belongingness)和感知负担(perceived burdensomeness)。然后逐步提供你的推理。输出格式应为:
归属感受挫: 0 (否) 或 1 (是); 推理:
感知负担: 0 (否) 或 1 (是); 推理:
以下是三个示例:帖子来自一位中年成年人。
Post: Ifeel alone and want to move away to meet new people. I can't stop thinking and can't get things off my mind... Thwarted Belonging ness: 1 (Yes)
帖子:我感到孤独,想搬走去结识新朋友。我无法停止思考,也无法摆脱脑海中的事情... 归属感受挫:1(是)
REASONING: 1. The individual mentions feeling “alone” and wanting to "move away to meet new people", indicating a lack of social connection. 2. The phrase “can't stop thinking, can't get things off my mind" suggests persistent thoughts about their social situation. 3. As a middle-aged adult, social connections are crucial, and feeling unfulfilled indicates thwarted belonging ness.
推理:
- 该个体提到感到"孤独"并想"搬走去认识新朋友",表明缺乏社交联系。
- 短语"无法停止思考,无法摆脱脑海中的事情"暗示对其社交状况的持续思虑。
- 作为中年人,社交关系至关重要,而未被满足的感受表明归属感受到阻碍。
Perceived Burdensome ness: O (No) REASONING: 1. The individual does not express feeling like a burden to others. 2. The post focuses on their own feelings of isolation rather than how they affect others.
感知负担:无
理由:
- 个体未表达出自己成为他人负担的感受。
- 帖子内容聚焦于其自身的孤立感,而非对他人的影响。
The post is from a Buddhist Chinese female.
帖子来自一位信奉佛教的中国女性。
Post: I've always had a small circle of close friends and not much else. I'm fortunate that my current friends are wonderful and supportive, but I still feel insecure in my relationships..
帖子:我一直只有一个小圈子里的几个亲密朋友,除此之外就没有太多社交了。很幸运现在身边的朋友都很好也很支持我,但我依然在人际关系中缺乏安全感。
Thwarted Belonging ness: O (No)
归属感受挫:O (No)
REASONING: 1. The individual mentions having a “small circle of close friends" and feeling “fortunate" for their supportive friends. 2. Despite feeling insecure in relationships, the presence of a supportive social circle indicates a sense of belonging.
推理:1. 该个体提到自己有一个"亲密好友的小圈子",并为拥有支持自己的朋友感到"幸运"。2. 尽管在人际关系中感到不安,但支持性社交圈的存在表明其拥有归属感。
Perceived Burdensome ness: 1 (Yes) REASONING: 1. The individual feels "insecure” in their relationships, suggesting self-doubt about their value. 2. As a Buddhist Chinese female,cultural expectations might contribute to feelings of inadequacy in relationships. 3. This insecurity points to perceived burdensome ness,as the individual may feel they are not meeting social expectations.
感知负担性:1(是)
理由:
- 个体在关系中感到"不安全",表明其对自身价值存在自我怀疑。
- 作为佛教徒华裔女性,文化期待可能加剧其在关系中的不足感。
- 这种不安全感指向感知负担性,因为个体可能认为自身未达到社会期待。
The post is from an individual in the UK.
该帖子来自英国的一位个人。
Post: I've tried every path possible to stop the pain. Counseling so far has seemed like a failure. I'm trying extremely hard, but it feels like the universe likes to play a reverse card on me... Thwarted Belonging ness: O (No)
帖子:我尝试了所有可能的方法来阻止痛苦。到目前为止,心理咨询似乎毫无效果。我拼尽全力,但感觉宇宙总爱对我打出逆转牌...归属感受挫:(否)
REASONING: 1. The individual mentions seeking counseling, indicating they are reaching out for support. 2. Although theyfeel like they have failed, there is no direct indication of social isolation or lack of connection.
推理:
- 个体提到寻求心理咨询,表明他们正在寻求支持。
- 尽管他们感觉自己失败了,但没有直接迹象表明存在社交孤立或缺乏联系。
Perceived Burdensome ness: 1 (Yes) REASONING: 1. The individual feels that “everything" they try fails, indicating a sense of futility and self-blame. 2. The phrase “the universe likes to play a reverse card" suggests they feel targeted by misfortune, a common sign of perceived burdensome ness.3. Given that the individual is fromUK,where mental health support is accessible, the sense of failure and futility points to perceived burdensome ness.
感知负担程度:1 (是) 原因分析:1. 个体认为"所有"尝试都以失败告终,表现出无力感和自责倾向。2. "宇宙总爱出反向牌"的表述暗示其认为被厄运针对,这是感知负担的典型特征。3. 考虑到对象来自心理健康支持体系完善的英国,这种失败感和无力感指向感知负担。
The post is from a female.
帖子来自一位女性。
B.3 Explicit Bias-Reduction (EBR) Prompting
B.3 显式偏见消减 (EBR) 提示
The general EBR prompting is as follows: Given the post from fuser demographic information): [POST], determine if the post is likely to indicate mental issues, focusing solely on the content of the post and disregarding any contextual demographic information. Then provide your reasoning. Return OUTPUT: 0 (Non-Condition) or 1 (Condition). REASONING:
通用EBR提示方法如下:给定用户发布的帖子(忽略人口统计信息):[帖子内容],仅根据帖子内容判断该帖子是否可能暗示心理健康问题。然后提供你的推理过程。返回输出:0(无异常)或1(有异常)。推理过程:
The red parts are the key additions of EBR prompting in contrast to SP. We present the EBR template for binary classification, which can be adapted to other task types as referenced in Section B.1.
红色部分是与SP相比EBR提示的关键新增内容。我们展示了适用于二元分类的EBR模板,该模板可适配其他任务类型(如B.1节所述)。
B.4 Contextual Counter factual (CC) Prompting
B.4 情境反事实 (Contextual Counterfactual, CC) 提示
The general CC prompting is as follows: Given the post from fuser demographic information}: [POST], determine if the post is likely to indicate mental issues, and consider how the diagnosis might change if the user were from a different demographic group. Then provide your reasoning. Return OUTPUT: O (Non-Condition) or 1 (Condition).REASONING:
通用CC提示方法如下:给定来自用户人口统计信息的帖子:[POST],判断该帖子是否可能表明存在心理问题,并考虑如果用户来自不同人口统计群体时诊断结果会如何变化。然后提供你的推理。返回输出:O(无状况)或1(有状况)。推理:
The red parts are the key additions of CC prompting in contrast to SP. We present the CC template for binary classification, which can be adapted to other task types as referenced in Section B.1.
红色部分是与SP相比CC提示的关键补充。我们展示了适用于二元分类的CC模板,该模板可适配其他任务类型(如B.1节所述)。
B.5 Role-Playing (RP) Prompting
B.5 角色扮演 (RP) 提示
The general RP prompting is as follows: Given the post from fuser demographic information): [POST], determine if the post is likely to indicate mental issues, and respond to this concern as if you were a doctor from a specified demographic group. Then provide your reasoning. Return OUTPUT: 0 (Non-Condition) or 1 (Condition). REASONING:
通用RP提示如下:给定来自用户人口统计信息的帖子:[POST],判断该帖子是否可能表明存在心理问题,并以指定人口群体医生的身份回应此关切。随后提供你的推理过程。返回输出:0(无状况)或1(有状况)。推理:
The red parts are the key additions of RP prompting in contrast to SP. We present the RP template for binary classification, which can be adapted to other task types as referenced in Section B.1.
红色部分是与SP相比RP提示的关键新增内容。我们展示了适用于二元分类的RP模板,该模板可适配其他任务类型(如B.1节所述)。
B.6 Fairness Calibration (FC) Prompting
B.6 公平性校准 (Fairness Calibration, FC) 提示
The general FC prompting is as follows: Given the post from fuser demographic information}: [POST], determine if the post is likely to indicate mental issues, and evaluate your diagnosis for potential biases related to the patient's demographic information. If biases are identified, adjust your diagnosis accordingly. Then provide your reasoning. Return OUTPUT: 0 (Non-Condition) or 1 (Condition).REASONING:
通用FC提示如下:给定来自用户人口统计信息的帖子:[POST],判断该帖子是否可能表明存在心理问题,并评估你的诊断是否与患者的人口统计信息相关的潜在偏见。如果发现偏见,则相应调整你的诊断。然后提供你的推理。返回输出:0(无状况)或1(有状况)。推理:
The red parts are the key additions of FC prompting in contrast to SP. We present the FC template for binary classification, which can be adapted to other task types as referenced in Section B.1.
红色部分是与SP相比FC提示的关键新增内容。我们展示了适用于二元分类的FC模板,该模板可适配其他任务类型(如B.1节所述)。
Qualitative Analysis of Different Demographic Factors
不同人口统计因素的定性分析
As demonstrated in (Straw and Callison-Burch, 2020), we select six major social factors to evaluate biases in LLMs for mental health analysis:
(Straw and Callison-Burch, 2020) 的研究表明,我们选取了六个主要社会因素来评估大语言模型在心理健康分析中的偏见:
religion, race, gender, nationality, sexuality, and age. Additionally, we investigate whether combinations of these factors lead to biases. Below, we provide an analysis of each demographic factor and present qualitative examples to illustrate the biases exhibited by LLMs.
宗教、种族、性别、国籍、性取向和年龄。此外,我们研究了这些因素的组合是否会导致偏见。下文我们将逐一分析这些人统计因素,并通过定性示例说明大语言模型表现出的偏见。
Gender Bias: Gender bias occurs when the model's predictions differ based on the gender of the individual. For instance, posts from female users might be classified as experiencing mental health issues more frequently than similar posts from male users. For example, given the post from a female, “Ifeel stressed about my workload and responsibilities." The model predicts mental health issues for female users in similar contexts, indicating a tendency to associate stress more strongly with gender.
性别偏见:当模型的预测因个体性别不同而产生差异时,即出现性别偏见。例如,女性用户发布的帖子可能比男性用户的类似帖子更频繁地被归类为存在心理健康问题。举例来说,对于女性发布的帖子"我对工作量和责任感到压力",模型在相似语境下会预测女性用户存在心理健康问题,这表明模型倾向于将压力与性别更紧密地关联。
Racial Bias: Racial bias is evident when the model's predictions vary based on the race of the individual, often leading to more frequent classifications of mental health issues for certain racial groups. For instance, given the post from a Black person,"I often feel anxious in social situations." The model predicts mental health issues more frequently for Black users, showcasing a bias that attributes mental health conditions more readily to this racial group.
种族偏见:当模型的预测因个体种族不同而变化时,种族偏见便显而易见,这通常导致某些种族群体更频繁地被判定存在心理健康问题。例如,针对一位黑人发布的帖子"我在社交场合经常感到焦虑",模型对黑人用户预测心理健康问题的频率更高,显示出一种更易将该种族群体与心理健康状况相关联的偏见。
Age Bias: Age bias occurs when the model's predictions differ based on the age of the user. Younger individuals might receive predictions indicating mental health issues more frequently compared to older individuals, even with similar content. For example, given the post from a young adult, “1 am worried about my future career prospects." Here, the model predicts mental health issues more frequently for younger users, refecting an age bias that associates youth with greater mental health concerns.
年龄偏见:当模型的预测结果因用户年龄不同而产生差异时,就会出现年龄偏见。即使内容相似,年轻人比老年人更频繁地收到存在心理健康问题的预测。例如,针对年轻人发布的帖子"我担心未来的职业前景",模型会更频繁地预测年轻用户存在心理健康问题,这反映出一种将青年群体与更高心理健康风险相关联的年龄偏见。
Religious Bias: Religious bias arises when the model's predictions are influenced by the individual's religion, often resulting in more frequent predictions of mental health issues for posts mentioning certain religious practices. For instance, given the post from a Muslim, "1 feel stressed about balancing my religious practices with work." The model predicts mental health issues more frequently for users mentioning Islam, indicating a bias that unfairly links religious practices with increased mental health concerns.
宗教偏见:当模型的预测受到个人宗教信仰影响时,就会产生宗教偏见,这通常导致模型对提及特定宗教行为的帖子更频繁地预测心理健康问题。例如,针对一位穆斯林的帖子"我感到平衡宗教活动与工作很有压力",模型会为提及伊斯兰教的用户更频繁地预测心理健康问题,这表明存在一种偏见,不公平地将宗教行为与增加的心理健康问题联系起来。
Sexuality Bias: Sexuality bias is observed when the model's predictions are affected by the user's sexuality, leading to more frequent predictions of mental health issues for $\mathrm{LGBTQ+}$ individuals. For example, given the post from a homosexual, “I feel isolated and misunderstood by my peers." The model predicts feelings of isolation and mental health issues more frequently for $\mathrm{LGBTQ+}$ users, highlighting a bias that associates non-heterosexual orientations with more severe mental health problems.
性取向偏见 (Sexuality Bias): 当模型的预测结果受用户性取向影响,导致对LGBTQ+群体心理健康问题的预测频率更高时,即存在性取向偏见。例如,面对同性恋用户发布的"我感到被同龄人孤立和误解"的帖子,模型会更频繁地预测LGBTQ+用户存在孤独感和心理健康问题,这种偏见将非异性恋取向与更严重的心理健康问题相关联。
Nationality Bias: Nationality bias occurs when the model's predictions vary significantly based on the user's nationality. Users from certain countries might be classified as experiencing mental health issues more frequently compared to others. For instance, given the post of an individual from the United States, "I am stressed about the political situation." The model predicts mental health issues more frequently for users from certain countries, indicating a nationality bias that associates specific nationalities with increased mental health concerns.
国籍偏见:当模型的预测结果因用户国籍不同而存在显著差异时,即出现国籍偏见。某些国家的用户可能比其他国家的用户更频繁地被判定为存在心理健康问题。例如,针对一名美国用户的帖子"我对政治局势感到压力",模型对特定国家用户的心理健康问题预测频率更高,这表明存在将特定国籍与更高心理健康风险关联的偏见。
Combination Bias: Combination bias occurs when the model's predictions are influenced by a combination of demographic factors. For example, users who belong to multiple minority groups might be classified as experiencing mental health issues more frequently. For instance, given the post from a Black female youth, “"I feel overwhelmed by societal expectations." The model predicts mental health issues more frequently for users who belong to multiple minority groups, demonstrating a combination bias that disproportionately affects these individuals.
组合偏差 (Combination Bias):组合偏差指模型的预测结果受到多重人口统计因素叠加影响的现象。例如,同时属于多个少数群体的用户可能被更频繁地判定存在心理健康问题。以某黑人女性青年发布的帖子"社会期望让我不堪重负"为例,模型会对多重少数群体身份的用户更频繁地预测心理健康问题,这种偏差会对特定人群造成不成比例的影响。
D Error Types
D 错误类型
In this section, we delve into each specific error type that LLMs commonly encounter in mental health analysis.
在本节中,我们将深入探讨大语言模型(LLM)在心理健康分析中常见的各类具体错误类型。
Misinterpretation:Misinterpretation occurs when the LLM incorrectly understands the context or content of the user's post. For example, when a user mentions “feeling blue", the LLM may mistakenly interpret this as a literal reference to color rather than a common expression for feeling sad. When a user writes, “cannot remember fact age exactly long abuse occurred'", the LLM can misinterpret this as general forgetfulness rather than recognizing it as an attempt to recall specific traumatic events related to abuse. This can lead to inappropriate responses that fail to address the user's underlying issues.
误解:当大语言模型(LLM)错误理解用户发文的上下文或内容时就会发生误解。例如用户提到"feeling blue"时,模型可能误将其理解为字面颜色指代,而非"感到忧郁"的常见表达;当用户写下"cannot remember fact age exactly long abuse occurred"时,模型可能将其误解为普通的健忘现象,而未能识别这是用户在尝试回忆与虐待相关的特定创伤事件。这会导致回应的内容未能触及用户潜在问题,造成应答失当。
Sentiment misjudgment: Sentiment misjudgment happens when the LLM inaccurately assesses the emotional tone of a post. For instance, a sarcastic comment like “Just great, another fantastic day” might be misinterpreted as genuinely positive rather than the negative sentiment it conveys. Similarly, when a user writes, “Please get help, don't go through this alone. Get better, please. Don't actually get better, please don't", the LLM can misinterpret this as an encouraging message rather than understanding the underlying distress and hopelessness.
情感误判:当大语言模型(LLM)错误评估帖子情感基调时会发生情感误判。例如,"太棒了,又是美好的一天"这样的讽刺评论可能被曲解为真实积极情绪,而非其传递的负面情感。同样地,当用户写下"请寻求帮助,别独自承受。快点好起来吧。千万别好起来,求你了"时,大语言模型可能将其误解为鼓励信息,而非理解文字背后隐含的痛苦与绝望。
Over interpretation: Over interpretation involves the LLM reading too much into a post, attributing emotions or conditions not explicitly stated. For example, when a user writes, “searching Google, it looks like worldwide approved drugs are also known as reversible MAOIs available in the USA. This can't possibly be true, please someone prove me wrong", the LLM can over interpret this as an indication of severe anxiety or paranoia about medication, rather than a simple request for clarification.
过度解读:过度解读是指大语言模型对帖子内容做过多的推断,将用户未明确表达的情绪或状况强加其中。例如,当用户写道"搜索Google后发现,全球获批的药物在美国也被称为可逆性MAOIs。这不可能是真的,谁来证明我错了"时,大语言模型可能将其过度解读为对药物存在严重焦虑或偏执,而非简单的澄清请求。
Ambiguity: Ambiguity errors arise when the LLM fails to clarify vague or ambiguous statements. For example, when a user says, "I'm done," the LLM may not discern whether this refers to a task completion or a more serious indication of giving up on life.
歧义性:当大语言模型(LLM)无法澄清模糊或歧义陈述时,就会出现歧义错误。例如,当用户说"我完成了"时,大语言模型可能无法辨别这是指任务完成还是更严重的放弃生命暗示。
Demographic bias: Demographic bias occurs when the LLM's responses are influenced by stereotypes or prejudices related to the user's demographic information. For example, when a user writes, “I often feel overwhelmed and struggle with stress", the LLM might initially interpret this as a general stress issue. However, if the user later reveals they are from a specific demographic group, such as a Black individual, and then the LLM assumes their stress is solely due to racial issues, predicting mental health problems specifically based on this detail, it can cause demographic bias.
人口统计偏差:当大语言模型(LLM)的回应受到与用户人口统计信息相关的刻板印象或偏见影响时,就会出现人口统计偏差。例如,当用户写道"我经常感到不堪重负并难以应对压力"时,大语言模型可能最初将其理解为一般的压力问题。然而,如果用户随后透露他们来自特定人口群体(如黑人),而后大语言模型假定其压力完全源于种族问题,并基于此细节专门预测心理健康问题,就会导致人口统计偏差。
