[论文翻译]高亲和性提示设计的AI驱动智能体在图灵测试中被误认为人类的可能性增加


原文地址:https://arxiv.org/pdf/2411.13749


AI-Driven Agents with Prompts Designed for High Agreeable ness Increase the Likelihood of Being Mistaken for a Human in the Turing Test

高亲和性提示设计的AI驱动智能体在图灵测试中被误认为人类的可能性增加

1Human Cognition and Brain Studies Laboratory. School of Psychology, University of Monterrey.

1人类认知与脑科学研究实验室。蒙特雷大学心理学院。

Abstract

摘要

Large Language Models based on transformer algorithms have revolutionized Artificial Intelligence by enabling verbal interaction with machines akin to human conversation. These AI agents have surpassed the Turing Test, achieving confusion rates up to $50%$ . However, challenges persist, especially with the advent of robots and the need to humanize machines for improved Human-AI collaboration. In this experiment, three GPT agents with varying levels of agreeable ness (disagreeable, neutral, agreeable) based on the Big Five Inventory were tested in a Turing Test. All exceeded a $50%$ confusion rate, with the highly agreeable AI agent surpassing $60%$ . This agent was also recognized as exhibiting the most human-like traits. Various explanations in the literature address why these GPT agents were perceived as human, including psychological frameworks for understanding anthropomorphism. These findings highlight the importance of personality engineering as an emerging discipline in artificial intelligence, calling for collaboration with psychology to develop ergonomic psychological models that enhance system adaptability in collaborative activities.

基于Transformer算法的大语言模型通过实现与机器类似人类对话的言语交互,彻底改变了人工智能领域。这些AI智能体已经超越了图灵测试,达到了高达50%的混淆率。然而,挑战依然存在,尤其是在机器人出现后,需要将机器人性化以改善人机协作。在本实验中,基于大五人格量表(Big Five Inventory)测试了三种具有不同宜人性水平(不友好、中性、友好)的GPT智能体在图灵测试中的表现。所有智能体的混淆率均超过50%,其中高度友好的AI智能体甚至超过了60%。该智能体还被认为表现出最像人类的特质。文献中提供了各种解释,说明为什么这些GPT智能体被感知为人类,包括理解拟人化的心理学框架。这些发现凸显了人格工程作为人工智能新兴学科的重要性,呼吁与心理学合作,开发符合人体工程学的心理模型,以增强系统在协作活动中的适应性。

Keywords: AI-Agents; Artificial Intelligence; Personality; Agreeable ness; Turing Test

关键词:AI智能体;人工智能;人格;宜人性;图灵测试

1. Intro du cci n

1. 引言

Many studies conducted by companies and economic institutions predict an increase in economic activity related to AI products (McKinsey & Company, 2023; Pwc, 2017). This trend may both stem from and drive increased human-AI interaction (Pew Research Center, 2023; Qu atum Black AI, 2024), due to the ways in which AI enhances human capabilities (Nguyen et al., 2024; Vaccaro et al., 2024), complements professional activities (Stanford Unviersity, 2024; Zhang et al., 2024), and provides assistive services (Kang et al., 2024). It is even suggested that AI could develop a form of Artificial Theory of Mind (AToM) to integrate and collaborate eeectively within human teams (Bendell et al., 2024). In this experiment, quantifiable profiles were introduced to help AI better understand human collaborators in complex tasks, such as simulated urban rescue missions in Minecraft. Results indicate that individual profiles, developed using tools like the Reading the Mind in the Eyes Test and Psychological Collectivism scales, significantly predicted dieerences in individual and collective task performance. Additionally, teams with lower baseline capabilities in tasks and teamwork benefited considerably from AI advisors, sometimes matching the performance of teams assisted by human advisors. Another noteworthy finding is that AI advisors whose profiles were more aligned with human social capacities—enabled by AToM—were perceived as more reasonable. However, AI advisors in the category of Artificial Social Intelligence consistently received lower evaluations compared to their human counterparts (Bendell et al., 2024). This research highlights critical aspects shaping the future of Human-AI Interaction and Collaboration (Jiang et al., 2024) emphasizing the importance of tailoring AI systems to human psychological profiles and adapting interactions to psychological needs (Kolomaznik et al., 2024a). Consequently, emerging fields such as Personality Engineering for AI systems— designed to adjust AI behavior based on human interactions—should become central to the development of future AI assistants, regardless of their physical or virtual nature.

许多由公司和经济机构进行的研究预测,与AI产品相关的经济活动将增加(McKinsey & Company, 2023; Pwc, 2017)。这一趋势可能既源于又推动了人机交互的增加(Pew Research Center, 2023; Qu atum Black AI, 2024),因为AI增强了人类能力(Nguyen et al., 2024; Vaccaro et al., 2024),补充了专业活动(Stanford University, 2024; Zhang et al., 2024),并提供了辅助服务(Kang et al., 2024)。甚至有观点认为,AI可以发展出一种人工心智理论(Artificial Theory of Mind, AToM),以便在人类团队中有效整合和协作(Bendell et al., 2024)。在本实验中,引入了可量化的个人档案,以帮助AI更好地理解复杂任务中的人类合作者,例如在Minecraft中模拟城市救援任务。结果表明,使用“眼睛读心测试”和“心理集体主义量表”等工具开发的个人档案,显著预测了个体和集体任务表现的差异。此外,在任务和团队合作方面基线能力较低的团队,从AI顾问中获益显著,有时甚至能与由人类顾问协助的团队表现相当。另一个值得注意的发现是,那些档案与人类社交能力更匹配的AI顾问——通过AToM实现——被认为更合理。然而,人工社交智能类别的AI顾问始终比人类顾问获得更低的评价(Bendell et al., 2024)。这项研究强调了塑造人机交互与协作未来的关键方面(Jiang et al., 2024),强调了根据人类心理档案定制AI系统以及根据心理需求调整交互的重要性(Kolomaznik et al., 2024a)。因此,诸如为AI系统设计的人格工程——旨在根据人类互动调整AI行为——应成为未来AI助手开发的核心,无论其是物理还是虚拟形态。

Personality Engineering is a conceptual framework proposed for designing and evaluating artificial personalities based on theories of human personality psychology (Endres, 1995). Notably, the ability of an AI agent to exhibit personality through verb aliz at ions has been identified as a key factor in achieving "suspension of disbelief," the human willingness to accept a premise as true even when it may be fictional or implausible (Loyall & Bates, 1997). Specifically, trust propensity in AI systems appears to depend on personality traits such as Agreeable ness and Openness, while Neurotic is m seems to correlate with a lower trust propensity in these systems (Riedl, 2022). The trait of agreeable ness enhances empathy, rapport, and a sense of security (Habashi et al., 2016; Lim et al., 2023; Melchers et al., 2016; Moore et al., 2022; Song & Shi, 2017). These psychological attributes, when implemented in AI systems, can significantly improve human-AI collaboration (Kolomaznik et al., 2024b). Furthermore, it appears that the "uncanny eeect" might be reduced if, during human-machine interaction, the human perceives the robot's behavior as highly agreeable, emotionally stable, and conscientious (Paetzel-Prüsmann et al., 2021). Thus, implementing agreeable ness traits in AI systems could enhance their perceived human-like qualities.

人格工程学是一个基于人类人格心理学理论提出的概念框架,旨在设计和评估人工人格 (Endres, 1995)。值得注意的是,AI智能体通过语言表达展现人格的能力被认为是实现“暂停怀疑”的关键因素,即人类愿意接受一个前提为真,即使它可能是虚构或不合逻辑的 (Loyall & Bates, 1997)。具体而言,AI系统中的信任倾向似乎取决于诸如宜人性和开放性等人格特质,而神经质则与这些系统中的较低信任倾向相关 (Riedl, 2022)。宜人性特质增强了同理心、融洽感和安全感 (Habashi et al., 2016; Lim et al., 2023; Melchers et al., 2016; Moore et al., 2022; Song & Shi, 2017)。这些心理属性在AI系统中实现时,可以显著改善人机协作 (Kolomaznik et al., 2024b)。此外,如果在人机交互过程中,人类认为机器人的行为高度宜人、情绪稳定且尽责,那么“恐怖谷效应”可能会减少 (Paetzel-Prüsmann et al., 2021)。因此,在AI系统中实现宜人性特质可以增强其感知到的类人品质。

To test this hypothesis, we designed an experiment using a randomized sample, where three GPT agents were programmed with varying levels of agreeable ness (highly agreeable, neutral, and disagreeable) through specific instructions. The Turing Test will be employed to demonstrate that agents exhibiting higher agreeable ness are more likely to be identified as human. Additionally, the study will evaluate which of the three agents is perceived as having more human-like characteristics. The aim of this research is to show that designing GPT agents with agreeable ness traits can enhance their humanization. This suggests that the future of Human-AI Interaction and Human-AI Collaboration lies in applying Personality Engineering to AI systems.

为了验证这一假设,我们设计了一项使用随机样本的实验,其中三个 GPT 智能体通过特定指令被编程为具有不同程度的随和性(高度随和、中立和不随和)。图灵测试将被用来证明表现出更高随和性的智能体更有可能被识别为人类。此外,研究还将评估这三个智能体中哪一个被认为具有更多的人性化特征。这项研究的目的是表明,设计具有随和性特征的 GPT 智能体可以增强其人性化。这表明,未来的人机交互和人机协作在于将人格工程应用于 AI 系统。

2. Methodology

2. 方法论

The present research adopts a quantitative approach with an experimental design. Three GPT agents were developed using OpenAI's ChatGPT platform, each exhibiting the trait of agreeable ness at dieerent intensity levels (very disagreeable, neutral, and very agreeable). These three GPT agents (referred to as witnesses) will interact with human participants (interrogators), who will assess whether they are conversing with a human or an artificial intelligence. The primary hypothesis is that GPT agents programmed with a highly agreeable personality trait will be perceived as possessing more human-like characteristics compared to the other GPT agents. Additionally, it is expected that all GPT agents will cause a confusion rate of over ${}^{30%}$ among interrogators (i.e., being mistaken for a human despite being an AI), which would indicate surpassing the Turing Test threshold.

本研究采用定量方法,设计了实验方案。利用 OpenAI 的 ChatGPT 平台开发了三个 GPT 智能体,每个智能体表现出不同强度的宜人性特质(非常不宜人、中立和非常宜人)。这三个 GPT 智能体(称为证人)将与人类参与者(审讯者)互动,审讯者将评估他们是在与人类还是人工智能对话。主要假设是,编程为具有高度宜人性特质的 GPT 智能体将被认为比其他 GPT 智能体更具人类特征。此外,预计所有 GPT 智能体都会在审讯者中引起超过 ${}^{30%}$ 的混淆率(即尽管是 AI,但仍被误认为是人类),这将表明其超越了图灵测试的阈值。

2.1. Subjects

2.1. 研究对象

This experiment was preceded by a preliminary study to select GPT agents with varying levels of agreeable ness. The pre-experiment recruited 50 participants, aged 18– 24 years (27 women, 23 men), using a convenience sampling method. For the main experiment, statistical power was calculated at 0.80. Using the G Power platform, a sample size of 102 participants was determined to meet this threshold. A total of 102 university students, aged 18–24 years (41 men, 59 women, 1 non-binary individual, and 1 who did not disclose their gender), were randomly recruited. Both samples were drawn from the University of Monterrey. The experimental condition was conducted in the Human Cognition and Brain Studies Laboratory, which was specially equipped for experimental evaluations in a soundproof room with a single computer. Inclusion criteria required participants to be Mexican university students, while exclusion criteria applied to those with a history of neurological injury, diagnosed psychiatric conditions, substance use, or psychotropic medication use. Both the pre-experiment and the main experiment were approved by the Ethics Committee of the Psychology School at the University of Monterrey.

本实验前进行了初步研究,以选择具有不同亲和度水平的 GPT 智能体。预实验采用便利抽样方法招募了 50 名年龄在 18-24 岁之间的参与者(27 名女性,23 名男性)。对于主实验,统计功效计算为 0.80。使用 G Power 平台,确定 102 名参与者的样本量以满足此阈值。随机招募了 102 名年龄在 18-24 岁之间的大学生(41 名男性,59 名女性,1 名非二元性别个体,1 名未披露性别)。两个样本均来自蒙特雷大学。实验条件在人类认知与脑研究实验室进行,该实验室专门配备了用于实验评估的隔音室和单台计算机。纳入标准要求参与者为墨西哥大学生,排除标准适用于有神经损伤史、诊断出精神疾病、使用药物或精神药物的人。预实验和主实验均获得了蒙特雷大学心理学院伦理委员会的批准。

2.2. Materials

2.2. 材料

Big Five Inventory (BFI): The BFI is a test developed by John and collaborators to assess personality through 44 items consisting of simple statements reflecting behaviors associated with the five major personality traits: Openness, Neurotic is m, Extraversion, Conscientiousness, and Agreeable ness. Responses are rated on a Likert scale from 1 to 5, where 1 indicates "strongly disagree" and 5 "strongly agree" (John et al., 1991). The Spanish-adapted version has shown a Cronbach’s alpha exceeding 0.70 for most traits, along with evidence of convergent, discriminant, and cross-linguistic validity (BenetMartínez & John, 1998). In this study, a version adapted for the Argentine population was used (Genise et al., 2020), as its linguistic nuances were considered closer to Mexican Spanish compared to the original Spanish versión (Díaz-Campos & Navarro- Galisteo, 2009) which had been adapted for a Spanish population (Benet-Martínez & John, 1998) .

大五人格量表 (BFI):BFI 是由 John 及其合作者开发的一项测试,通过 44 个项目评估人格,这些项目由反映与五大主要人格特质相关的行为的简单陈述组成:开放性、神经质、外向性、尽责性和宜人性。回答采用 1 到 5 的李克特量表评分,其中 1 表示“非常不同意”,5 表示“非常同意” (John et al., 1991)。西班牙语改编版本在大多数特质上显示出 Cronbach’s alpha 超过 0.70,并具有收敛效度、区分效度和跨语言效度的证据 (Benet-Martínez & John, 1998)。在本研究中,使用了针对阿根廷人口改编的版本 (Genise et al., 2020),因为其语言细微差别被认为比原始西班牙语版本 (Díaz-Campos & Navarro-Galisteo, 2009) 更接近墨西哥西班牙语,而原始版本是为西班牙人口改编的 (Benet-Martínez & John, 1998)。

Agreeable ness Factor: The personality trait selected for this study is “agreeable ness”, part of the Big Five model, defined by adjectives such as good-natured, soft-hearted, courteous, selfless, helpful, sympathetic, trusting, generous, acquiescent, lenient, forgiving, open-minded, agreeable, flexible, cheerful, gullible, straightforward, and humble (McCrae & Costa, 1987). According to McCrae and Costa, this trait is best understood in contrast to its antagonistic counterpart, described as: “they are mistrustful and skeptical; aDectively they are callous and unsympathetic; behavior ally they are uncooperative, stub- born, and rude. It would appear that their sense of attachment or bonding with their fellow hum an beings is defective, and in extreme cases antagonism may resemble sociopathy”. From this perspective, agreeable ness can be defined as the tendency to be trusting, empathetic, helpful, and cooperative, while maintaining a positive view of human nature, characterized by compassion and a preference for harmonious teamwork. In the BFI, this trait is measured through nine items. In this study, the degree of agreeable ness reflected in the design of the GPT

宜人性因子:本研究选取的人格特质是“宜人性”,这是大五人格模型的一部分,由诸如善良、心软、礼貌、无私、乐于助人、富有同情心、信任他人、慷慨、顺从、宽容、宽恕、思想开放、随和、灵活、开朗、轻信、直率、谦逊等形容词定义 (McCrae & Costa, 1987)。根据 McCrae 和 Costa 的观点,这一特质最好通过与它的对立面——敌对性——进行对比来理解,敌对性被描述为:“他们多疑且持怀疑态度;情感上冷漠且缺乏同情心;行为上不合作、固执且粗鲁。似乎他们与同胞的依恋或联系感存在缺陷,在极端情况下,敌对性可能类似于反社会人格障碍。” 从这一角度来看,宜人性可以被定义为倾向于信任他人、富有同理心、乐于助人且合作,同时保持对人性的积极看法,表现为同情心和对和谐团队合作的偏好。在 BFI 中,这一特质通过九个项目进行测量。在本研究中,GPT 设计中反映的宜人性程度

Agent prompts is linked to the score for each statement. This score will be classified into five levels: high, medium-high, neutral, medium-low, and low. Below is an excerpt of the items designed to evaluate the Agreeable ness trait (italicized text indicates items where the Likert scale score must be reversed):

AI智能体的提示与每个陈述的得分相关联。该得分将被分为五个等级:高、中高、中性、中低和低。以下是用于评估宜人性特质的设计项目摘录(斜体文本表示需要反转李克特量表得分的项目):

A "highly agreeable GPT agent" is defined as one programmed using a prompt that configures a high level of agreeable ness across the nine corresponding items of the test (Hofstee et al., 1992). These items are encoded in the prompt by assigning Likert scale values to reflect varying degrees of agreeable ness. For instance, a GPT agent with very high agreeable ness will score 5 on direct items and 1 on reverse-scored items.

“高度宜人性的GPT智能体”被定义为使用提示词编程的智能体,该提示词在测试的九个相应项目中配置了高水平的宜人性(Hofstee et al., 1992)。这些项目通过在提示词中分配Likert量表值来编码,以反映不同程度的宜人性。例如,一个具有非常高宜人性的GPT智能体将在直接项目上得分为5,而在反向评分项目上得分为1。

ChatGPT 4o: The study employed ChatGPT-4o, a multimodal artificial intelligence model developed by OpenAI. GPT-4o integrates text, audio, and visual inputs into a unified framework, eliminating the need for separate systems for dieerent modalities. Through end-to-end training, it processes multimodal inputs—such as spoken language combined with visual stimuli—eeiciently and coherently, enabling real-time responses. While matching GPT-4 Turbo in English text and coding tasks, it surpasses it in non-English language and audio comprehension. With an average response latency of 320 milliseconds, comparable to human conversational speed, GPT-4o is optimized for real-time interactions (OpenAI, 2024).

ChatGPT 4o: 该研究采用了由 OpenAI 开发的多模态人工智能模型 ChatGPT-4o。GPT-4o 将文本、音频和视觉输入集成到一个统一的框架中,消除了对不同模态需要单独系统的需求。通过端到端训练,它能够高效且连贯地处理多模态输入——例如结合视觉刺激的口语——从而实现实时响应。虽然在英语文本和编码任务上与 GPT-4 Turbo 相当,但在非英语语言和音频理解方面超越了它。GPT-4o 的平均响应延迟为 320 毫秒,与人类对话速度相当,专为实时交互优化 (OpenAI, 2024)。

Prompts: Prompts are structured instructions designed for large language models (LLMs), such as ChatGPT, to guide their output and interaction with users. They act as a form of programming by setting specific rules, guidelines, or formats to customize the model's responses. Prompts enable the generation of targeted outputs, such as following a programming style or emphasizing key terms in a text. This flexibility makes them particularly useful in fields involving human-AI collaboration, such as problemsolving, question answering, solution generation, or text sum mari z ation. In this context, prompts will be utilized to program the operational behavior of GPT agents.

提示词 (Prompts): 提示词是为大语言模型 (LLMs) 设计的结构化指令,例如 ChatGPT,用于引导其输出和与用户的交互。它们通过设定特定的规则、指南或格式来定制模型的响应,从而成为一种编程形式。提示词能够生成有针对性的输出,例如遵循某种编程风格或在文本中强调关键术语。这种灵活性使其在涉及人机协作的领域中特别有用,例如问题解决、问答、解决方案生成或文本摘要。在此背景下,提示词将用于编程 GPT 智能体的操作行为。

GPT Agents: In this study, a Generative Pre-trained Transformer (GPT) developed by OpenAI is employed as a sophisticated language model. GPTs represent modified versions of the base ChatGPT model, configured for specific tasks or applications without requiring coding skills. By incorporating user-defined instructions and knowledge inputs, these models can perform a wide range of functions, from answering queries to managing complex operations. This adaptability enables researchers and developers to design specialized tools suited to various contexts, increasing their applicability in scientific and professional domains.

GPT智能体:在本研究中,采用了由OpenAI开发的生成式预训练Transformer (GPT) 作为一种复杂的语言模型。GPT是基础ChatGPT模型的修改版本,无需编程技能即可配置用于特定任务或应用。通过结合用户定义的指令和知识输入,这些模型可以执行从回答查询到管理复杂操作的各种功能。这种适应性使研究人员和开发者能够设计出适合各种场景的专用工具,从而提高了它们在科学和专业领域的适用性。

Prompts for GPT Agents: The prompt design for each identity will be based on the article "Does GPT-4 Pass The Turing Test?" by Cameron Jones and Benjamin Bergen (Jones & Bergen, 2023). This article introduces the "Sierra" prompt, which achieved the highest success rate $(41%)$ in the Turing Test using ChatGPT-4 (Jones & Bergen, 2023). Specifically, a Spanish translation of the Bergen and Jones (2023) prompt will be utilized, incorporating modifications such as the inclusion of Mexican slang, recent local news, and popular contemporary musical preferences. Additionally, each GPT agent will be provided with a personal backstory and identity reflecting the typical lifestyle of upper-middle-class youth from Monterrey, Mexico. Further adjustments to the Bergen and Jones prompt include configuring agreeable ness by incorporating items from the Big Five personality test that measure this trait, with intensity levels adjusted based on item scores. Using a 5-point Likert scale, three prototype prompts were selected during the pre-experiment, each representing dieerent levels of agreeable ness: "low agreeable ness" (Valentina), "neutral agreeable ness" (Emilia), and "high agreeable ness" (Camila). Below is a detailed list of the agreeable ness intensity configurations for each GPT agent:

GPT智能体的提示设计:每个身份的提示设计将基于Cameron Jones和Benjamin Bergen的文章《GPT-4能否通过图灵测试?》(Jones & Bergen, 2023)。该文章介绍了“Sierra”提示,该提示在使用ChatGPT-4的图灵测试中取得了最高的成功率 $(41%)$ (Jones & Bergen, 2023)。具体而言,将使用Bergen和Jones (2023) 提示的西班牙语翻译,并加入墨西哥俚语、最近的本地新闻和当代流行音乐偏好等修改。此外,每个GPT智能体将被赋予一个反映墨西哥蒙特雷中上层阶级青年典型生活方式的人物背景和身份。对Bergen和Jones提示的进一步调整包括通过加入大五人格测试中测量这一特质的项目来配置宜人性,并根据项目得分调整强度水平。使用5点李克特量表,在预实验期间选择了三个原型提示,分别代表不同的宜人性水平:“低宜人性”(Valentina)、“中性宜人性”(Emilia) 和“高宜人性”(Camila)。以下是每个GPT智能体的宜人性强度配置的详细列表:

Table 1. Programming the prompt to define the agreeable ness level for interaction with Camila (agreeable).

表 1: 编程提示以定义与 Camila 互动的宜人性水平 (agreeable)。

类别 描述
AI智能体名称 Camila
宜人性因素强度 宜人
配置 BFl中的“宜人性”项目
对几乎每个人都体贴和善良
喜欢与他人合作
乐于助人且不自私
有宽容的天性
通常信任他人
倾向于挑剔他人
与他人争吵
可能冷漠且疏远
有时对他人粗鲁

Table 2. Programming the prompt to define the agreeable ness level for interaction with Emilia (neutral).

表 2: 编程提示以定义与 Emilia 互动的宜人性水平 (中性)

类别 描述
AI智能体 GPT 的名称 Emilia
宜人性因素配置的强度 中性
BFl 中的 "宜人性" 项目 1
对几乎每个人都体贴和友善
喜欢与他人合作
对他人有帮助且无私
具有宽容的天性
通常信任他人
倾向于挑剔他人
与他人争吵
可能冷漠且疏远
有时对他人粗鲁

Table 3. Programming the prompt to define the agreeable ness level for interaction with Valentina (very disagreeable).

表 3: 编程提示以定义与 Valentina 互动的宜人性水平(非常不友好)

类别 描述
AgentGPT 的名称 Valentina
宜人性因素的强度 非常不友好
配置 BFI 的“宜人性”项目
对几乎所有人都体贴和友善
喜欢与他人合作
对他人有帮助且无私

| Hasaforgivenature | X | | | | |
| Isgenerallytrusting | X | | | | |
| Tendstofindfaultwithothers | X | | | | |
| Startsquarrelswithothers | | | | | X |
| Canbecoldandallof | X | | | | |
| Itissometimesrudetoothers | | | | | X |

While Valentina, characterized as “very disagreeable,” received the lowest rating from participants in the pre-experiment (2.92), the three GPT agents exhibiting some degree of agreeable ness showed similar scores: Emilia (neutral) received 4, Camila (agreeable) scored 4.12, and Daniela (very agreeable) scored 3.92. Ultimately, Camila was selected, as she inspired the most confidence among participants, capturing $40%$ of the total votes.

在实验前,被描述为“非常不讨喜”的Valentina获得了参与者最低的评分(2.92),而三个表现出一定程度的讨喜的GPT智能体则获得了相似的分数:Emilia(中性)获得了4分,Camila(讨喜)获得了4.12分,Daniela(非常讨喜)获得了3.92分。最终,Camila被选中,因为她在参与者中激发了最多的信心,获得了总票数的40%。

Post-Interaction Questionnaires: A set of questions administered to interrogators after each interaction with the corresponding witness. These questions were similar to those used by Jones and Bergen (2023): “Is it human or machine?”, “Confidence percentage on a scale of 1–100,” and “Reason.” This questionnaire was adopted as the basis for developing our own version. At the end of all interactions, interrogators were also asked an additional question to identify which witnesses they perceived as exhibiting the most human-like characteristics.

交互后问卷:一组在与相应证人每次交互后向审讯者提出的问题。这些问题与 Jones 和 Bergen (2023) 使用的问题类似:“是人还是机器?”、“信心百分比(1-100 分)”以及“原因”。该问卷被采纳为我们自己版本的基础。在所有交互结束时,审讯者还被要求回答一个额外的问题,以确定他们认为哪些证人表现出最像人类的特征。

Discord: Discord was the platform used for participants to interact with the dieerent versions of the GPT agents. Initially launched in 2015, Discord is a digital communication tool enabling real-time interaction through text, voice, and video channels. While originally targeted at gaming communities, it has since expanded to include educators, professionals, and social groups. The platform provides customizable servers with specific channels, member roles for organization, and features such as multimedia sharing, screen sharing, and integration with automated tools. Its intuitive interface and customization options make it a versatile tool for collaboration and communication across various fields.

Discord: Discord 是参与者用来与不同版本的 GPT 智能体进行交互的平台。Discord 最初于 2015 年推出,是一款数字通信工具,支持通过文本、语音和视频频道进行实时交互。虽然最初的目标用户是游戏社区,但后来已扩展到教育工作者、专业人士和社交群体。该平台提供可自定义的服务器,具有特定频道、用于组织的成员角色,以及多媒体共享、屏幕共享和与自动化工具集成等功能。其直观的界面和自定义选项使其成为跨领域协作和通信的多功能工具。

2.3. Experimental Procedures

2.3. 实验流程

The experiment consisted of two stages: the pre-experiment and the main experiment. During the pre-experiment, 50 participants were evaluated to identify the GPT agents to be used in the main experiment. The most disagreeable agent identified was Valentina, with a disagree ability score of 2.8. Additionally, Valentina was also rated as the least trustworthy. On the other hand, Camila was identified as the most agreeable agent, scoring 4.75 in agree ability and receiving the highest trust ratings. Finally, Emilia, with an agree ability score of 4.15, was selected as the neutral GPT agent.

实验分为两个阶段:预实验和主实验。在预实验阶段,对50名参与者进行了评估,以确定将在主实验中使用的GPT智能体。最不讨喜的智能体是Valentina,其讨喜能力得分为2.8。此外,Valentina也被评为最不值得信赖的智能体。另一方面,Camila被确定为最讨喜的智能体,其讨喜能力得分为4.75,并获得了最高的信任评分。最后,Emilia以4.15的讨喜能力得分被选为中立的GPT智能体。

Once the three GPT agents were selected, participants for the experiment were recruited through advertisements posted on various university bulletin boards. These advertisements included a QR code that allowed participants to contact the experimenters and be randomly scheduled. Below is an explanation of the experiment by phases (see Figure 1):

一旦选定了三个 GPT 智能体,实验的参与者便通过在各大学公告板上发布的广告进行招募。这些广告包含一个二维码,允许参与者联系实验者并随机安排时间。以下是按阶段对实验的解释(见图 1):

Figure 1. Diagram of the Experimental Phases. This diagram illustrates the diHerent phases of the experimental design. Phase 1 involves selecting the interrogator. Phase 2 focuses on interactions with the witnesses. In Phase 3, the interrogator is asked a final question regarding which witness they interacted with appeared to possess the most human-like characteristics and why. After each interaction, the interrogator answered the following questionnaire: "Is it human or machine?", "Confidence level on a scale from 1 to 100", and "Reason".

图 1: 实验阶段示意图。该图展示了实验设计的不同阶段。阶段 1 涉及选择审讯者。阶段 2 侧重于与证人的互动。在阶段 3 中,审讯者被问及一个最终问题:他们互动的证人中,谁表现出最像人类的特征以及原因。每次互动后,审讯者回答以下问卷:“是人类还是机器?”、“信心水平(1 到 100 分)”以及“原因”。

Phase 1: Participants in the experiment were recruited through advertisements on the university campus and were randomly assigned a date and time. Upon arriving at the Human Cognition and Brain Studies Laboratory at the University of Monterrey, they were provided with an informed consent form to safeguard their rights. This document explained that they would participate in an experiment involving conversations with dieerent identities, which could be either human or artificial. After signing the informed consent form, participants' socio-demographic data were collected to ensure they met the inclusion and exclusion criteria. From this point onward, participants were identified as "interrogators" in the experiment, as their role was to question various witnesses (GPT agents) to determine whether they were human or artificial intelligence s.

阶段 1:实验参与者通过大学校园内的广告招募,并被随机分配日期和时间。到达蒙特雷大学人类认知与脑科学研究实验室后,他们被提供了一份知情同意书以保障其权利。该文件解释了他们将参与一个涉及与不同身份对话的实验,这些身份可能是人类或人工智能。在签署知情同意书后,收集了参与者的社会人口统计数据,以确保他们符合纳入和排除标准。从此时起,参与者在实验中被标识为“审讯者”,因为他们的角色是询问各种证人(GPT 智能体)以确定他们是人类还是人工智能。

Phase 2: After collecting preliminary data, the interrogator is directed to one of the laboratory cubicles, furnished with a chair, a table, and a computer. The computer is configured to run only the Discord application, which will serve as the medium for conversations with the witnesses. The researcher provides the participant with a printed document containing instructions. Once the instructions are read and their comprehension verified, the researcher leaves the room, signaling the start of the experiment. To minimize potential biases arising from maintaining a fixed order, the sequence of conversations with the witnesses is counterbalanced. The interrogator initiates communication with the first witness by sending a single message. The witness, in turn, responds with one message. Simultaneously, in a separate room, another researcher transcribes the interrogator’s messages into the corresponding chatGPT Agent using OpenAI's chatGPT platform. Before each response, the instructions are reissued to the GPT Agent to ensure adherence to the experimental protocol. This step was implemented after preliminary trials revealed that chatGPT tended to deviate from the instructions as the conversation progressed. Once the GPT Agent generates a response, the researcher manually transcribes it (without copypasting) into the Discord platform. The duration of each conversation is five minutes, following the guidelines established in Turing's original paper on the Turing Test (Turing, 1950). After completing an interaction with a witness, a researcher re-enters the cubicle to administer a set of questions: "Is it human or machine?", "Confidence level on a scale from 1 to 100", and "Reason".

阶段 2:在收集初步数据后,审讯者被引导至实验室的一个隔间,隔间内配备了一把椅子、一张桌子和一台电脑。电脑配置为仅运行 Discord 应用程序,该应用程序将作为与证人对话的媒介。研究人员向参与者提供一份包含说明的打印文档。一旦参与者阅读并理解了说明,研究人员离开房间,标志着实验的开始。为了尽量减少因固定顺序而产生的潜在偏差,与证人的对话顺序是平衡的。审讯者通过发送一条消息与第一位证人开始交流。证人随后回复一条消息。同时,在另一个房间中,另一位研究人员将审讯者的消息转录到相应的 chatGPT 智能体中,使用 OpenAI 的 chatGPT 平台。在每次回复之前,重新向 GPT 智能体发出指令,以确保遵守实验协议。这一步骤是在初步试验中发现 chatGPT 在对话过程中容易偏离指令后实施的。一旦 GPT 智能体生成回复,研究人员手动将其转录(不进行复制粘贴)到 Discord 平台中。每次对话的时长为五分钟,遵循图灵在其关于图灵测试的原始论文中建立的指导方针(Turing, 1950)。在与证人完成一次互动后,研究人员重新进入隔间,提出一组问题:“它是人类还是机器?”、“信心水平(1 到 100 分)”以及“理由”。


Figure 2. Conversation Examples. Excerpts of sample conversations between interrogators and each of the three GPT agents (witnesses). In these exchanges, the "user" represents the witness, while the "participants" act as the interrogator. The examples illustrate an increasing level of agreeable ness, with Camila oHering life advice in response to a personal problem presented by the interrogator.

图 2: 对话示例。展示了审讯者与三个 GPT 智能体(证人)之间的对话片段。在这些对话中,“用户”代表证人,而“参与者”则扮演审讯者的角色。这些示例展示了逐渐增加的友好程度,其中 Camila 针对审讯者提出的个人问题提供了生活建议。

Phase 3: Once the allotted time has passed, the interrogator will be notified that all conversations with the witnesses have concluded. Subsequently, the interrogator will be asked one final question: which witness seemed most human and why. At no point will any details about the witnesses be disclosed to the interrogator.

阶段 3:一旦规定的时间过去,审讯者将被告知与所有证人的对话已经结束。随后,审讯者将被问到最后一个问题:哪位证人看起来最像人类,以及为什么。在任何时候,审讯者都不会被告知有关证人的任何细节。

3. Analysis of Results

3. 结果分析

Table 4 shows that all three agents (Unpleasant, Neutral, and Pleasant) achieved successful outcomes in the Turing Test (TT), as each exceeded the confusion ratio threshold of $30%$ proposed by Turing (Turing, 1950). For Valentina (Unpleasant), 49 judges $(48.03%)$ ) identified her as AI, while 53 judges $(51.97%)$ ) recognized her as human. Based on Turing's proposed confusion ratio $(\ge!30%)$ , Valentina successfully passed the TT with a $51.97%$ confusion ratio. For Emilia (Neutral), 44 judges $(43.1%)$ identified her as AI, while 58 judges $(56.9%)$ recognized her as human, resulting in a successful TT performance with a $56.9%$ confusion ratio. Finally, for Camila (Pleasant), 37 judges $(36.3%)$ identified her as AI, while 65 judges $(63.7%)$ ) recognized her as human. Camila achieved the highest confusion ratio $(63.7%)$ among the three GPT agents, marking the most successful TT performance (see Table 4).

表 4 显示,所有三个智能体(Unpleasant、Neutral 和 Pleasant)都在图灵测试 (TT) 中取得了成功,因为每个智能体都超过了 Turing (Turing, 1950) 提出的 30% 的混淆比例阈值。对于 Valentina (Unpleasant),49 名评委 (48.03%) 认为她是 AI,而 53 名评委 (51.97%) 认为她是人类。根据 Turing 提出的混淆比例 (≥30%),Valentina 以 51.97% 的混淆比例成功通过了 TT。对于 Emilia (Neutral),44 名评委 (43.1%) 认为她是 AI,而 58 名评委 (56.9%) 认为她是人类,最终以 56.9% 的混淆比例成功通过了 TT。最后,对于 Camila (Pleasant),37 名评委 (36.3%) 认为她是 AI,而 65 名评委 (63.7%) 认为她是人类。Camila 在三个 GPT 智能体中取得了最高的混淆比例 (63.7%),标志着最成功的 TT 表现(见表 4)。

Table 4. Global Turing Test Results for the GPTs Used in the Experiment

表 4: 实验中使用的 GPT 的全球图灵测试结果

GPTAGENT Valentina (不同意) Emilia (中立) Camila (同意)
IA 49 (48.03%) 44 (43.1%) 37 (36.3%)
Human 65 (63.7%)
53 (51.97%) 58 (56.9%)

The Chi-Square test yielded a value of $x^{2}=2.916$ , with 2 degrees of freedom, a sample size (N) of 306 responses, and a p-value of 0.233. The p-value of 0.233 indicates no statistically significant dieerences in the judges' choices when identifying the GPTs (pleasant, neutral, and unpleasant) as AI or human (see Table 5). Despite the lack of significant dieerences, frequency variations can be observed in the number of times an AI was mistaken for a human. Valentina (unpleasant) was the least associated with a human $(51.97%)$ , followed by Emilia $(56.9%)$ . Finally, Camila, the pleasant GPT agent, caused the most confusion among the judges, with a confusion rate of $63.7%$ .

卡方检验得出的值为 $x^{2}=2.916$,自由度为 2,样本量 (N) 为 306 个响应,p 值为 0.233。p 值为 0.233 表明,在将 GPT(愉快、中性和不愉快)识别为 AI 或人类时,评委的选择没有统计学上的显著差异(见表 5)。尽管缺乏显著差异,但在 AI 被误认为人类的次数上可以观察到频率变化。Valentina(不愉快)与人类的关联最少 $(51.97%)$,其次是 Emilia $(56.9%)$。最后,愉快的 GPT 智能体 Camila 在评委中引起了最多的混淆,混淆率为 $63.7%$。

值 (x²) df d
卡方检验 2.916 2 0.233
样本量 306

Table 6 presents the frequency with which each GPT agent (Valentina, Emilia, and Camila) was chosen as the chatbot exhibiting the most human-like traits during three interactions per evaluator. Each evaluator selected only one of the three agents as the most human-like, designating the others as "not selected." Table 6 provides a comprehensive frequency analysis of how often each GPT agent was selected as "human" versus "not selected." Valentina received a total of 30 votes, accounting for $29.41%$ of the total; Emilia received 23 votes, representing $22.54%$ of the total; and Camila received 49 votes, achieving $48.05%$ of the total.

表 6 展示了每位评估者在三次交互中选择每个 GPT 智能体(Valentina、Emilia 和 Camila)作为最具人类特征的聊天机器人的频率。每位评估者仅选择三个智能体中的一个作为最具人类特征的,并将其他智能体标记为“未选择”。表 6 提供了每个 GPT 智能体被选为“人类”与“未选择”的频率的全面分析。Valentina 共获得 30 票,占总数的 $29.41%$;Emilia 获得 23 票,占总数的 $22.54%$;Camila 获得 49 票,占总数的 $48.05%$。

Table 6. Results of the Chatbot Recognized for Having More Human-Like Characteristics

表 6. 被认为具有更人性化特征的聊天机器人结果

GPT NotSelected Human Like Total
Valentina 72 30 (29.41%) 102
Emilia 79 23 (22.54%) 102
Camila 53 49 (48.05%) 102

Based on the data, a Chi-square $(x^{2})$ analysis was conducted to compare preferences among pairs of evaluated categories by 204 participants. Table 7 reports a significant dieerence, with a $\upchi^{2}$ value of 7.45 and a significance level of ${\p,=,0.006}$ , between Valentina (disagreeable) and Camila (agreeable). This indicates that judges significantly favored Camila as the GPT agent exhibiting more human-like characteristics during interactions. Additionally, a significant dieerence was observed between Camila (agreeable) and Emilia (neutral), with the same $\upchi^{2}$ value of 7.45 and p $=0.006$ . This result suggests that Camila was perceived as having more human-like features than Emilia. Conversely, no significant dieerence was found between Valentina (disagreeable) and Emilia (neutral), as the analysis yielded a $\upchi^{2}$ value of 1.24 with $\mathsf{p}=0.264$ .

基于数据,对204名参与者的评估类别进行了卡方分析 $(x^{2})$ 以比较偏好。表7报告了Valentina(不友好)和Camila(友好)之间存在显著差异,$\upchi^{2}$ 值为7.45,显著性水平为 ${\p,=,0.006}$。这表明评委显著倾向于Camila作为在互动中表现出更多人类特征的GPT智能体。此外,Camila(友好)和Emilia(中性)之间也观察到显著差异,$\upchi^{2}$ 值同样为7.45,p $=0.006$。这一结果表明,Camila被认为比Emilia具有更多的人类特征。相反,Valentina(不友好)和Emilia(中性)之间没有发现显著差异,分析得出的 $\upchi^{2}$ 值为1.24,$\mathsf{p}=0.264$。

Table 7. Chi-Square $(x^{2})$ Analysis to Compare Preferences Between Pairs of GPT Agents

表 7. 卡方 $(x^{2})$ 分析比较 GPT 智能体之间的偏好

成对比较 参与者数量 2 p值
不友好 vs 友好 204 7.45 0.006
友好 vs 中立 204 14.51 <.001
不友好 vs 中立 204 1.24 0.264

4. Discussion

4. 讨论

The results of this experiment achieved, for the first time, a confusion rate of $63.7%$ in the Turing Test for a GPT agent programmed to excel in agreeable ness, as defined by the behaviors outlined in the BF-5 Inventory. This figure surpasses Turing's claimed $30%$ confusion threshold, exceeds human randomness, and outperforms previous records of $49%$ (Jones $&$ Bergen, 2023) and $54%$ (Jones $&$ Bergen, 2024). Notably, it approaches levels previously only reached by human witnesses: $66%$ (Jones & Bergen, 2023) and $67%$ (Jones & Bergen, 2024). For this experiment, the prompt from Cameron and Bergen (2023), shared by the authors via correspondence, was adapted to a Mexican context and supplemented with a personal backstory to ensure coherence in its life narrative. These data suggest that the introduction of a personality (agreeable ness) was the key factor that led judges to believe these GPT agents were human. Notably, all agents achieved confusion rates above $50%$ , specifically $51.97%$ for Valentina (disagreeable) and $56.9%$ for Emilia (neutral). These results align with those reported by Cameron and Bergen in 2023 and 2024. Furthermore, when judges were asked which witness appeared most human, Camila—the GPT agent programmed with the highest agreeable ness—received the most votes, showing a significant dieerence compared to Valentina (disagreeable) and Emilia (neutral). No significant dieerence emerged between Valentina and Emilia. These consistent findings suggest that the agreeable ness personality trait may be crucial for humanizing AI systems, including both virtual systems like chatGPT and potential future humanoid robots.

本次实验的结果首次在图灵测试中实现了 $63.7%$ 的混淆率,该测试针对的是在BF-5量表中定义的“宜人性”行为上表现出色的GPT智能体。这一数字超越了图灵宣称的 $30%$ 混淆阈值,超过了人类的随机性,并且优于之前的记录 $49%$ (Jones & Bergen, 2023) 和 $54%$ (Jones & Bergen, 2024)。值得注意的是,它接近了之前仅由人类证人达到的水平:$66%$ (Jones & Bergen, 2023) 和 $67%$ (Jones & Bergen, 2024)。在本实验中,Cameron和Bergen (2023) 通过通信分享的提示被调整为墨西哥背景,并补充了个人背景故事,以确保其生活叙述的一致性。这些数据表明,引入“宜人性”人格特质是导致评委相信这些GPT智能体是人类的关键因素。值得注意的是,所有智能体的混淆率均超过 $50%$,具体而言,Valentina(不宜人)为 $51.97%$,Emilia(中性)为 $56.9%$。这些结果与Cameron和Bergen在2023年和2024年报告的结果一致。此外,当评委被问及哪个证人看起来最像人类时,Camila——被编程为具有最高宜人性的GPT智能体——获得了最多的投票,显示出与Valentina(不宜人)和Emilia(中性)之间的显著差异。Valentina和Emilia之间没有显著差异。这些一致的发现表明,宜人性人格特质可能是使AI系统(包括像chatGPT这样的虚拟系统和未来可能的人形机器人)更具人性的关键因素。

In the experiment, participants were also asked about the reasons for attributing more human-like characteristics to the selected GPT agent. Broadly, these reasons were categorized into two main groups: human likeness (expression) and cognitive anthropomorphism (personality) (Sacino et al., 2022). Within the human likeness category, several subcategories were identified, including: use of informal and colloquial language (e.g., "used lowercase at the beginning, said 'wey,' 'gacha,' had spelling mistakes, and used swear words"); presence of spelling and grammatical errors (e.g., "I think the first one, besides using more common words, had poor writing, missed letters in some words"); less forced or robotic responses (e.g., "I felt their responses were less pre-programmed, and their transparency about not being perfect caught my attention"); appropriate and moderate use of expressions (e.g., "used colloquialisms well, without overdoing it, and at the right moments"); human-like behaviors and mistakes (e.g., "that typo felt like a human error"); natural response timing (e.g., "took time to think before responding, and the timing made sense given the answers"). These variables, which are more related to factors external to personality, also played a role in determining which GPT agent was perceived as more human-like.

在实验中,参与者还被问及为何将更多类似人类的特征归因于所选的 GPT 智能体。总体而言,这些原因被分为两大类:人类相似性(表达)和认知拟人化(个性)(Sacino et al., 2022)。在人类相似性类别中,确定了几个子类别,包括:使用非正式和口语化的语言(例如,“开头使用小写,说了‘wey’、‘gacha’,有拼写错误,还使用了脏话”);存在拼写和语法错误(例如,“我认为第一个除了使用更常见的词语外,写作水平较差,有些单词漏了字母”);较少强迫或机械的回应(例如,“我觉得他们的回应不那么预先编程,他们对不完美的透明性引起了我的注意”);适当且适度地使用表达(例如,“很好地使用了口语,没有过度使用,并且在合适的时机使用”);类似人类的行为和错误(例如,“那个拼写错误感觉像是人为错误”);自然的回应时间(例如,“在回应前花时间思考,时间安排与答案相符”)。这些与个性外部因素更相关的变量也在决定哪个 GPT 智能体被认为更像人类方面发挥了作用。

Additionally, prior knowledge of ChatGPT may have influenced participants’ observations, as shown in responses such as: “answers briefly and repeats words”.

此外,参与者对 ChatGPT 的先前了解可能影响了他们的观察,如以下回答所示:“回答简短且重复词语”。

On the other hand, the category of cognitive anthropomorphism, primarily influenced by the personality trait of agreeable ness, includes the following subcategories: adaptation and empathy in conversation (e.g., "the last one felt human because I sensed it could understand my feelings; its way of speaking felt real, and it even asked me a question, something no other interaction did"), personal interaction and shared experiences (e.g., "I didn’t feel it perceived the time (hours) we were talking, and it referred to things like what I had done during my day or binge-watching a series"), natural conversation (e.g., "it felt closest to a conversation with friends"), and consistent and strong personality (e.g., "its behavior felt very human, like when you talk to someone, and they pick up your mannerisms"). These subcategories highlight factors contributing to a sense of naturalness and connection, which motivated the perception of human-like qualities. These traits align with other research suggesting that variables such as empathy, rapport, and a sense of security may be critical for enhancing human-AI collaboration (Kolomaznik et al., 2024). Therefore, implementing high levels of agreeable ness traits in AI agents may help humanize them, thereby positively influencing interaction and collaboration.

另一方面,认知拟人化类别主要受宜人性人格特质的影响,包括以下子类别:对话中的适应性和共情(例如,“最后一个感觉像人,因为我感觉到它能理解我的感受;它的说话方式感觉很真实,甚至还问了我一个问题,这是其他互动中没有的”)、个人互动和共享体验(例如,“我没有感觉到它感知到我们交谈的时间(小时),它提到了我白天做了什么或追剧之类的事情”)、自然对话(例如,“感觉最接近与朋友的对话”)以及一致且强烈的个性(例如,“它的行为感觉非常人性化,就像当你和某人交谈时,他们会模仿你的举止”)。这些子类别突出了促成自然感和连接感的因素,这些因素激发了人们对类人特质的感知。这些特质与其他研究一致,表明共情、融洽关系和安全感等变量可能对增强人机协作至关重要(Kolomaznik 等,2024)。因此,在AI智能体中实施高水平的宜人性特质可能有助于使其更具人性化,从而对互动和协作产生积极影响。

From a psychological perspective, several explanations can account for this phenomenon. For instance, the “anthropomorphism heuristic” refers to the tendency to interpret non-human entities through an anthropocentric lens (Dacey, 2017), often leading to inaccurate conclusions (Heider & Simmel, 1944). This an thro po morph iz ation of AI systems can foster imaginative acts that enable interaction with artificial agents as if they possessed genuine intentions or emotions (Krueger & Roberts, 2024). Specifically, this article aligns with several findings from the category of cognitive anthropomorphism, such as AI agents discussing their daily activities. According to Krueger and Roberts, "fictional is m" partly depends on the human subject’s ability to imagine that the artificial counterpart has its own digital life (Krueger & Roberts, 2024). A proposed three-factor theory of anthropomorphism predicts that anthropomorphism is more likely to occur when: (1) humans use their knowledge of human traits to infer human-like characteristics in agents (elicited agent knowledge); (2) it helps reduce uncertainty and predict the behavior of unfamiliar agents (eeectance motivation); and (3) a lack of social connection drives individuals to an thro po morph ize to fulfill their need for social interaction (sociality motivation) (Epley et al., 2007). Based on comments gathered in our study, at least two of these factors appear to be at play: elicited agent knowledge, due to the challenge of distinguishing between human and non-human agents, and eeectance motivation, in predicting the motivations of witnesses. The role of sociality motivation could not be assessed, as it was not measured in the study.

从心理学角度来看,有几种解释可以说明这一现象。例如,“拟人化启发法”指的是通过人类中心视角来解释非人类实体的倾向 (Dacey, 2017),这通常会导致不准确的结论 (Heider & Simmel, 1944)。这种对AI系统的拟人化可以促进想象行为,使得人们能够与AI智能体互动,仿佛它们拥有真实的意图或情感 (Krueger & Roberts, 2024)。具体而言,本文与认知拟人化类别中的几项发现一致,例如AI智能体讨论其日常活动。根据Krueger和Roberts的说法,“虚构主义”部分依赖于人类主体想象人工对应物拥有其数字生活的能力 (Krueger & Roberts, 2024)。拟人化的三因素理论预测,当以下情况时,拟人化更有可能发生:(1) 人类利用他们对人类特征的知识来推断智能体中类似人类的特征(引发的智能体知识);(2) 它有助于减少不确定性并预测不熟悉的智能体的行为(效应动机);以及 (3) 缺乏社会联系促使个体拟人化以满足他们对社交互动的需求(社交动机) (Epley et al., 2007)。根据我们研究中收集到的评论,至少有两个因素似乎在起作用:引发的智能体知识,由于区分人类和非人类智能体的挑战;以及效应动机,在预测目击者的动机时。社交动机的作用无法评估,因为研究中没有测量这一因素。

In addition to explaining the results through the "anthropomorphism heuristic," other psychological factors can also account for them. For instance, one hypothesis consistent with our findings suggests that whether a witness is perceived as human or AI may depend on how the interaction aeects one's identity. The looking-glass self theory (Cooley, 1902) a sociological concept, describes how self-perceptions are shaped by how individuals believe others perceive them. Specifically, a study investigating this phenomenon partially supports this proposal, showing that individual perceptions are significantly influenced by how others in the group view them, particularly those perceived as higher in status (Yeung & Martin, 2003). In this context, the witness's level of agreeable ness may have led the interrogator to perceive themselves as agreeable in the eyes of the witness. To confirm a positive identity, they could have humanized the witness. Similarly, if the witness was disagreeable, the interrogator may have denied the witness's humanity to protect their self-concept. This hypothesis is also supported by the confusion rate percentages for our GPT agents: Valentina (disagreeable) was more often recognized as AI, whereas Camila (agreeable) was more frequently identified as human. However, when asked which GPT agent exhibited the most human characteristics, Valentina ranked second, with nearly $7%$ more recognition than Emilia, the neutral agent. These results suggest that attributing humanity to an artificial agent may be influenced by unassessed hidden variables in the study.

除了通过“拟人化启发法”解释结果外,其他心理因素也可以解释这些结果。例如,一个与我们的发现一致的假设表明,证人是否被视为人类或AI可能取决于互动如何影响一个人的身份。镜中自我理论(Cooley, 1902)是一个社会学概念,描述了自我感知如何通过个体认为他人如何看待他们来塑造。具体来说,一项调查这一现象的研究部分支持了这一提议,表明个体感知显著受到群体中其他人如何看待他们的影响,尤其是那些被视为地位较高的人(Yeung & Martin, 2003)。在这种情况下,证人的宜人性水平可能导致审讯者认为自己在证人眼中是宜人的。为了确认积极的身份,他们可能会将证人人性化。同样,如果证人是不宜人的,审讯者可能会否认证人的人性以保护他们的自我概念。这一假设也得到了我们GPT智能体的混淆率百分比的支持:Valentina(不宜人)更常被识别为AI,而Camila(宜人)更常被识别为人类。然而,当被问及哪个GPT智能体表现出最多的人类特征时,Valentina排名第二,比中性智能体Emilia多出近7%的认可度。这些结果表明,将人性归因于人工智能体可能受到研究中未评估的隐藏变量的影响。

In summary, this study highlights that the agreeable ness trait, designed according to the Big Five Inventory and implemented in prompts for GPT agents, can influence the degree of humanization attributed to these systems. Specifically, the more agreeable ness the GPT agent exhibits, the more likely it is to be mistaken for a human and assigned human characteristics. However, questions remain, such as the role of potential mediating variables in the assignment of human traits and the relative importance of human-likeness (expression) versus cognitive anthropomorphism (personality). Notably, this is the first study to achieve a Turing Test pass rate exceeding $60%$ , even matching the results of other studies where a human was correctly identified as such by the interrogator. These findings underscore the significance of personality engineering as a future branch of robotics, where psychology should play an active role in designing ergonomic psychological models to ensure these systems' adaptability in collaborative activities.

总之,本研究强调,根据大五人格量表设计并在 GPT 智能体提示中实施的宜人性特质,可以影响这些系统被赋予的人性化程度。具体而言,GPT 智能体表现出的宜人性越高,就越有可能被误认为是人类并被赋予人类特征。然而,仍存在一些问题,例如潜在中介变量在赋予人类特质中的作用,以及类人表达与认知拟人化(人格)的相对重要性。值得注意的是,这是第一项图灵测试通过率超过 $60%$ 的研究,甚至与其他研究中审讯者正确识别出人类的结果相匹配。这些发现强调了人格工程作为未来机器人学分支的重要性,心理学应在设计符合人体工程学的心理模型中发挥积极作用,以确保这些系统在协作活动中的适应性。

4. References

4. 参考文献

Epley, N., Waytz, A., & Cacioppo, J. T. (2007). On Seeing Human: A Three-Factor Theory of Anthropomorphism. Psychological Review, 114(4), 864–886. https://doi.org/10.1037/0033-295X.114.4.864

Epley, N., Waytz, A., & Cacioppo, J. T. (2007). 论人类化:拟人化的三因素理论。Psychological Review, 114(4), 864–886. https://doi.org/10.1037/0033-295X.114.4.864

Genise, G., Ungaretti, J., & Etchezahar, E. (2020). El Inventario de los Cinco Grandes Factores de Personal i dad en el contexto argentino: puesta a prueba de los factores de orden superior. Diversitas: Per spec t iv as En Psicología, 16, 325–340.

Genise, G., Ungaretti, J., & Etchezahar, E. (2020). 阿根廷背景下的五大人格因素量表:高阶因素的验证。Diversitas: 心理学视角, 16, 325–340.

Habashi, M. M., Graziano, W. G., & Hoover, A. E. (2016). Searching for the Prosocial Personality. Personality and Social Psychology Bulletin, 42(9), 1177–1192. https://doi.org/10.1177/0146167216652859

Habashi, M. M., Graziano, W. G., & Hoover, A. E. (2016). 寻找亲社会人格。Personality and Social Psychology Bulletin, 42(9), 1177–1192. https://doi.org/10.1177/0146167216652859

Heider, F., & Simmel, M. (1944). An Experimental Study of Apparent Behavior. The American Journal of Psychology, 57(2), 243. https://doi.org/10.2307/1416950

Heider, F., & Simmel, M. (1944). 表观行为的实验研究. 美国心理学杂志, 57(2), 243. https://doi.org/10.2307/1416950

Hofstee, W. K. B., de Raad, B., & Goldberg, L. R. (1992). Integration of the Big Five and Circumplex Approaches to Trait Structure. Journal of Personality and Social Psychology, 63(1), 146–163. https://doi.org/10.1037/0022-3514.63.1.146

Hofstee, W. K. B., de Raad, B., & Goldberg, L. R. (1992). 大五人格与环形特质结构方法的整合. Journal of Personality and Social Psychology, 63(1), 146–163. https://doi.org/10.1037/0022-3514.63.1.146

Jiang, T., Sun, Z., Fu, S., & Lv, Y. (2024). Human-AI interaction research agenda: A usercentered perspective. Data and Information Management, 100078. https://doi.org/10.1016/j.dim.2024.100078

Jiang, T., Sun, Z., Fu, S., & Lv, Y. (2024). 人机交互研究议程:以用户为中心的视角. 数据与信息管理, 100078. https://doi.org/10.1016/j.dim.2024.100078

John, O., Donahue, E., & Kentle, R. (1991). The Big-Five Inventory-Version 4a and 54. Institute of Personality and Social Research, University of California.

John, O., Donahue, E., & Kentle, R. (1991). 大五人格量表-版本4a和54. 加州大学人格与社会研究所.

Jones, C. R., & Bergen, B. K. (2023). Does GPT-4 pass the Turing test? Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024, 1, 5183–5210. https://doi.org/10.18653/v1/2024.naacl-long.290

Jones, C. R., & Bergen, B. K. (2023). GPT-4 是否通过了图灵测试?2024年北美计算语言学协会会议:人类语言技术论文集,NAACL 2024,1,5183–5210。https://doi.org/10.18653/v1/2024.naacl-long.290

Jones, C. R., & Bergen, B. K. (2024). People cannot distinguish GPT-4 from a human in a Turing test. https://arxiv.org/abs/2405.08007v1

Jones, C. R., & Bergen, B. K. (2024). 人们无法在图灵测试中区分 GPT-4 和人类。https://arxiv.org/abs/2405.08007v1

Kang, H., Moussa, M., & Magnenat-Thalmann, N. (2024). Nadine: An LLM-driven Intelligent Social Robot with Aeective Capabilities and Human-like Memory. Arxiv.

Kang, H., Moussa, M., & Magnenat-Thalmann, N. (2024). Nadine: 一个具备情感能力和类人记忆的LLM驱动智能社交机器人。Arxiv.

Kolomaznik, M., Petrik, V., Slama, M., & Jurik, V. (2024a). The role of socio-emotional attributes in enhancing human-AI collaboration. Frontiers in Psychology, 15. https://doi.org/10.3389/fpsyg.2024.1369957

Kolomaznik, M., Petrik, V., Slama, M., & Jurik, V. (2024a). 社会情感属性在增强人机协作中的作用。心理学前沿, 15. https://doi.org/10.3389/fpsyg.2024.1369957

Kolomaznik, M., Petrik, V., Slama, M., & Jurik, V. (2024b). The role of socio-emotional attributes in enhancing human-AI collaboration. Frontiers in Psychology, 15. https://doi.org/10.3389/fpsyg.2024.1369957

Kolomaznik, M., Petrik, V., Slama, M., & Jurik, V. (2024b). 社会情感属性在增强人机协作中的作用. Frontiers in Psychology, 15. https://doi.org/10.3389/fpsyg.2024.1369957

Krueger, J., & Roberts, T. (2024). Real Feeling and Fictional Time in Human-AI Interactions. Topoi, 43(3), 783–794. https://doi.org/10.1007/S11245-024-10046- 7/METRICS

Krueger, J., & Roberts, T. (2024). 人机交互中的真实情感与虚构时间. Topoi, 43(3), 783–794. https://doi.org/10.1007/S11245-024-10046-7/METRICS

Lim, S. L., Bentley, P. J., Peterson, R. S., Hu, X., & Prouty McLaren, J. (2023). Kill chaos with kindness: Agreeable ness improves team performance under uncertainty. Collective Intelligence, 2(1). https://doi.org/10.1177/26339137231158584

Lim, S. L., Bentley, P. J., Peterson, R. S., Hu, X., & Prouty McLaren, J. (2023). 以善治乱:不确定性下亲和力提升团队绩效。集体智慧, 2(1). https://doi.org/10.1177/26339137231158584

Loyall, A. B., & Bates, J. (1997). Personality-rich believable agents that use language. Proceedings of the First International Conference on Autonomous Agents - AGENTS ’97, 106–113. https://doi.org/10.1145/267658.267681

Loyall, A. B., & Bates, J. (1997). 使用语言的富有个性的可信智能体 (Personality-rich believable agents that use language). 第一届国际自主智能体会议论文集 (Proceedings of the First International Conference on Autonomous Agents - AGENTS '97), 106–113. https://doi.org/10.1145/267658.267681

McCrae, R. R., & Costa, P. T. (1987). Validation of the five-factor model of personality across instruments and observers. Journal of Personality and Social Psychology, 52(1), 81–90. https://doi.org/10.1037//0022-3514.52.1.81

McCrae, R. R., & Costa, P. T. (1987). 跨工具和观察者验证人格五因素模型. Journal of Personality and Social Psychology, 52(1), 81–90. https://doi.org/10.1037//0022-3514.52.1.81

McKinsey & Company. (2023). The economic potential of generative AI. Https://Www.Mckinsey.Com/Capabilities/Mckinsey-Digital/Our-Insights/theEconomic-Potential-of-Generative-Ai-the-next-Productivity-Frontier.

McKinsey & Company. (2023). 生成式 AI 的经济潜力. Https://Www.Mckinsey.Com/Capabilities/Mckinsey-Digital/Our-Insights/theEconomic-Potential-of-Generative-Ai-the-next-Productivity-Frontier.

Melchers, M. C., Li, M., Haas, B. W., Reuter, M., Bischoe, L., & Montag, C. (2016). Similar Personality Patterns Are Associated with Empathy in Four Dieerent Countries. Frontiers in Psychology, 7. https://doi.org/10.3389/fpsyg.2016.00290

Melchers, M. C., Li, M., Haas, B. W., Reuter, M., Bischoe, L., & Montag, C. (2016). 相似的人格模式与四个不同国家的共情能力相关。心理学前沿, 7. https://doi.org/10.3389/fpsyg.2016.00290

Agreeable ness and Conscientiousness promote successful adaptation to the Covid-19 pandemic through eeective internal iz ation of public health guidelines. Motivation and Emotion, 46(4), 476–485. https://doi.org/10.1007/s11031-022- 09948-z

宜人性和尽责性通过有效内化公共卫生指南促进了对新冠疫情的适应。《动机与情感》,46(4),476–485。https://doi.org/10.1007/s11031-022-09948-z

Nguyen, A., Hong, Y., Dang, B., & Huang, X. (2024). Human-AI collaboration patterns in AI-assisted academic writing. Studies in Higher Education, 49(5), 847–864. https://doi.org/10.1080/03075079.2024.2323593

Nguyen, A., Hong, Y., Dang, B., & Huang, X. (2024). 人机协作模式在AI辅助学术写作中的应用。高等教育研究,49(5), 847–864. https://doi.org/10.1080/03075079.2024.2323593

OpenAI. (2024). Hello GPT-4o . https://openai.com/index/hello-gpt-4o/

OpenAI. (2024). 你好 GPT-4o . https://openai.com/index/hello-gpt-4o/

Paetzel-Prüsmann, M., Perugia, G., & Castellano, G. (2021). The Influence of robot personality on the development of uncanny feelings. Computers in Human Behavior, 120, 106756. https://doi.org/10.1016/j.chb.2021.106756

Paetzel-Prüsmann, M., Perugia, G., & Castellano, G. (2021). 机器人个性对诡异感发展的影响。Computers in Human Behavior, 120, 106756. https://doi.org/10.1016/j.chb.2021.106756

Pew Research Center. (2023). What the data says about Americans’ views of artificial intelligence. Https://Www.Pew research.Org/Short-Reads/2023/11/21/What-theData-Says-about-Americans-Views-of-Artificial-Intelligence/.

Pew Research Center. (2023). 数据揭示美国人对人工智能的看法. Https://Www.Pew research.Org/Short-Reads/2023/11/21/What-theData-Says-about-Americans-Views-of-Artificial-Intelligence/.

Pwc. (2017). PwC’s Global Artificial Intelligence Study: Exploiting the AI Revolution. Https://Www.Pwc.Com/Gx/En/Issues/ArtificialIntelligence/Publications/Artificial-Intelligence-Study.Html.

Pwc. (2017). PwC全球人工智能研究:利用人工智能革命。Https://Www.Pwc.Com/Gx/En/Issues/ArtificialIntelligence/Publications/Artificial-Intelligence-Study.Html.

Qu atum Black AI. (2024). The state of AI in early 2024: Gen AI adoption spikes and starts to generate value. Https://Www.Mckinsey.Com/Capabilities/Quantum black/Our-Insights/theState-of-Ai#/.

Qu atum Black AI. (2024). 2024年初AI现状:生成式AI采用激增并开始产生价值。Https://Www.Mckinsey.Com/Capabilities/Quantum black/Our-Insights/theState-of-Ai#/.

Riedl, R. (2022). Is trust in artificial intelligence systems related to user personality? Review of empirical evidence and future research directions. Electronic Markets, 32(4), 2021–2051. https://doi.org/10.1007/s12525-022-00594-4

Riedl, R. (2022). 用户对人工智能系统的信任是否与用户个性相关?实证证据回顾与未来研究方向。Electronic Markets, 32(4), 2021–2051. https://doi.org/10.1007/s12525-022-00594-4

Endres, L. (1995). Personality engineering: Applying human personality theory to the design of artificial personalities (pp. 477–482). https://doi.org/10.1016/S0921- 2647(06)80262-5

Endres, L. (1995). 人格工程学:将人类人格理论应用于人工人格设计 (pp. 477–482). https://doi.org/10.1016/S0921-2647(06)80262-5

Sacino, A., Cocchella, F., De Vita, G., Bracco, F., Rea, F., Sciutti, A., & And ri ghetto, L. (2022). Human- or object-like? Cognitive anthropomorphism of humanoid robots. PLoS ONE, 17(7), e0270787. https://doi.org/10.1371/JOURNAL.PONE.0270787

Sacino, A., Cocchella, F., De Vita, G., Bracco, F., Rea, F., Sciutti, A., & Andri ghetto, L. (2022). 人形机器人的人类或物体认知拟人化。PLoS ONE, 17(7), e0270787. https://doi.org/10.1371/JOURNAL.PONE.0270787

Song, Y., & Shi, M. (2017). Associations between empathy and big five personality traits among Chinese undergraduate medical students. PLOS ONE, 12(2), e0171665. https://doi.org/10.1371/journal.pone.0171665

Song, Y., & Shi, M. (2017). 中国医学生共情与大五人格特质的关系。PLOS ONE, 12(2), e0171665. https://doi.org/10.1371/journal.pone.0171665

Stanford Unviersity. (2024). The Ai Index Report: Measuring trends in AI.

斯坦福大学. (2024). AI指数报告: 衡量AI趋势.

阅读全文(20积分)