AI-Driven Agents with Prompts Designed for High Agreeable ness Increase the Likelihood of Being Mistaken for a Human in the Turing Test
高亲和性提示设计的AI驱动智能体在图灵测试中被误认为人类的可能性增加
1Human Cognition and Brain Studies Laboratory. School of Psychology, University of Monterrey.
1人类认知与脑科学研究实验室。蒙特雷大学心理学院。
Abstract
摘要
Large Language Models based on transformer algorithms have revolutionized Artificial Intelligence by enabling verbal interaction with machines akin to human conversation. These AI agents have surpassed the Turing Test, achieving confusion rates up to $50%$ . However, challenges persist, especially with the advent of robots and the need to humanize machines for improved Human-AI collaboration. In this experiment, three GPT agents with varying levels of agreeable ness (disagreeable, neutral, agreeable) based on the Big Five Inventory were tested in a Turing Test. All exceeded a $50%$ confusion rate, with the highly agreeable AI agent surpassing $60%$ . This agent was also recognized as exhibiting the most human-like traits. Various explanations in the literature address why these GPT agents were perceived as human, including psychological frameworks for understanding anthropomorphism. These findings highlight the importance of personality engineering as an emerging discipline in artificial intelligence, calling for collaboration with psychology to develop ergonomic psychological models that enhance system adaptability in collaborative activities.
基于Transformer算法的大语言模型通过实现与机器类似人类对话的言语交互,彻底改变了人工智能领域。这些AI智能体已经超越了图灵测试,达到了高达50%的混淆率。然而,挑战依然存在,尤其是在机器人出现后,需要将机器人性化以改善人机协作。在本实验中,基于大五人格量表(Big Five Inventory)测试了三种具有不同宜人性水平(不友好、中性、友好)的GPT智能体在图灵测试中的表现。所有智能体的混淆率均超过50%,其中高度友好的AI智能体甚至超过了60%。该智能体还被认为表现出最像人类的特质。文献中提供了各种解释,说明为什么这些GPT智能体被感知为人类,包括理解拟人化的心理学框架。这些发现凸显了人格工程作为人工智能新兴学科的重要性,呼吁与心理学合作,开发符合人体工程学的心理模型,以增强系统在协作活动中的适应性。
Keywords: AI-Agents; Artificial Intelligence; Personality; Agreeable ness; Turing Test
关键词:AI智能体;人工智能;人格;宜人性;图灵测试
1. Intro du cci n
1. 引言
Many studies conducted by companies and economic institutions predict an increase in economic activity related to AI products (McKinsey & Company, 2023; Pwc, 2017). This trend may both stem from and drive increased human-AI interaction (Pew Research Center, 2023; Qu atum Black AI, 2024), due to the ways in which AI enhances human capabilities (Nguyen et al., 2024; Vaccaro et al., 2024), complements professional activities (Stanford Unviersity, 2024; Zhang et al., 2024), and provides assistive services (Kang et al., 2024). It is even suggested that AI could develop a form of Artificial Theory of Mind (AToM) to integrate and collaborate eeectively within human teams (Bendell et al., 2024). In this experiment, quantifiable profiles were introduced to help AI better understand human collaborators in complex tasks, such as simulated urban rescue missions in Minecraft. Results indicate that individual profiles, developed using tools like the Reading the Mind in the Eyes Test and Psychological Collectivism scales, significantly predicted dieerences in individual and collective task performance. Additionally, teams with lower baseline capabilities in tasks and teamwork benefited considerably from AI advisors, sometimes matching the performance of teams assisted by human advisors. Another noteworthy finding is that AI advisors whose profiles were more aligned with human social capacities—enabled by AToM—were perceived as more reasonable. However, AI advisors in the category of Artificial Social Intelligence consistently received lower evaluations compared to their human counterparts (Bendell et al., 2024). This research highlights critical aspects shaping the future of Human-AI Interaction and Collaboration (Jiang et al., 2024) emphasizing the importance of tailoring AI systems to human psychological profiles and adapting interactions to psychological needs (Kolomaznik et al., 2024a). Consequently, emerging fields such as Personality Engineering for AI systems— designed to adjust AI behavior based on human interactions—should become central to the development of future AI assistants, regardless of their physical or virtual nature.
许多由公司和经济机构进行的研究预测,与AI产品相关的经济活动将增加(McKinsey & Company, 2023; Pwc, 2017)。这一趋势可能既源于又推动了人机交互的增加(Pew Research Center, 2023; Qu atum Black AI, 2024),因为AI增强了人类能力(Nguyen et al., 2024; Vaccaro et al., 2024),补充了专业活动(Stanford University, 2024; Zhang et al., 2024),并提供了辅助服务(Kang et al., 2024)。甚至有观点认为,AI可以发展出一种人工心智理论(Artificial Theory of Mind, AToM),以便在人类团队中有效整合和协作(Bendell et al., 2024)。在本实验中,引入了可量化的个人档案,以帮助AI更好地理解复杂任务中的人类合作者,例如在Minecraft中模拟城市救援任务。结果表明,使用“眼睛读心测试”和“心理集体主义量表”等工具开发的个人档案,显著预测了个体和集体任务表现的差异。此外,在任务和团队合作方面基线能力较低的团队,从AI顾问中获益显著,有时甚至能与由人类顾问协助的团队表现相当。另一个值得注意的发现是,那些档案与人类社交能力更匹配的AI顾问——通过AToM实现——被认为更合理。然而,人工社交智能类别的AI顾问始终比人类顾问获得更低的评价(Bendell et al., 2024)。这项研究强调了塑造人机交互与协作未来的关键方面(Jiang et al., 2024),强调了根据人类心理档案定制AI系统以及根据心理需求调整交互的重要性(Kolomaznik et al., 2024a)。因此,诸如为AI系统设计的人格工程——旨在根据人类互动调整AI行为——应成为未来AI助手开发的核心,无论其是物理还是虚拟形态。
Personality Engineering is a conceptual framework proposed for designing and evaluating artificial personalities based on theories of human personality psychology (Endres, 1995). Notably, the ability of an AI agent to exhibit personality through verb aliz at ions has been identified as a key factor in achieving "suspension of disbelief," the human willingness to accept a premise as true even when it may be fictional or implausible (Loyall & Bates, 1997). Specifically, trust propensity in AI systems appears to depend on personality traits such as Agreeable ness and Openness, while Neurotic is m seems to correlate with a lower trust propensity in these systems (Riedl, 2022). The trait of agreeable ness enhances empathy, rapport, and a sense of security (Habashi et al., 2016; Lim et al., 2023; Melchers et al., 2016; Moore et al., 2022; Song & Shi, 2017). These psychological attributes, when implemented in AI systems, can significantly improve human-AI collaboration (Kolomaznik et al., 2024b). Furthermore, it appears that the "uncanny eeect" might be reduced if, during human-machine interaction, the human perceives the robot's behavior as highly agreeable, emotionally stable, and conscientious (Paetzel-Prüsmann et al., 2021). Thus, implementing agreeable ness traits in AI systems could enhance their perceived human-like qualities.
人格工程学是一个基于人类人格心理学理论提出的概念框架,旨在设计和评估人工人格 (Endres, 1995)。值得注意的是,AI智能体通过语言表达展现人格的能力被认为是实现“暂停怀疑”的关键因素,即人类愿意接受一个前提为真,即使它可能是虚构或不合逻辑的 (Loyall & Bates, 1997)。具体而言,AI系统中的信任倾向似乎取决于诸如宜人性和开放性等人格特质,而神经质则与这些系统中的较低信任倾向相关 (Riedl, 2022)。宜人性特质增强了同理心、融洽感和安全感 (Habashi et al., 2016; Lim et al., 2023; Melchers et al., 2016; Moore et al., 2022; Song & Shi, 2017)。这些心理属性在AI系统中实现时,可以显著改善人机协作 (Kolomaznik et al., 2024b)。此外,如果在人机交互过程中,人类认为机器人的行为高度宜人、情绪稳定且尽责,那么“恐怖谷效应”可能会减少 (Paetzel-Prüsmann et al., 2021)。因此,在AI系统中实现宜人性特质可以增强其感知到的类人品质。
To test this hypothesis, we designed an experiment using a randomized sample, where three GPT agents were programmed with varying levels of agreeable ness (highly agreeable, neutral, and disagreeable) through specific instructions. The Turing Test will be employed to demonstrate that agents exhibiting higher agreeable ness are more likely to be identified as human. Additionally, the study will evaluate which of the three agents is perceived as having more human-like characteristics. The aim of this research is to show that designing GPT agents with agreeable ness traits can enhance their humanization. This suggests that the future of Human-AI Interaction and Human-AI Collaboration lies in applying Personality Engineering to AI systems.
为了验证这一假设,我们设计了一项使用随机样本的实验,其中三个 GPT 智能体通过特定指令被编程为具有不同程度的随和性(高度随和、中立和不随和)。图灵测试将被用来证明表现出更高随和性的智能体更有可能被识别为人类。此外,研究还将评估这三个智能体中哪一个被认为具有更多的人性化特征。这项研究的目的是表明,设计具有随和性特征的 GPT 智能体可以增强其人性化。这表明,未来的人机交互和人机协作在于将人格工程应用于 AI 系统。
2. Methodology
2. 方法论
The present research adopts a quantitative approach with an experimental design. Three GPT agents were developed using OpenAI's ChatGPT platform, each exhibiting the trait of agreeable ness at dieerent intensity levels (very disagreeable, neutral, and very agreeable). These three GPT agents (referred to as witnesses) will interact with human participants (interrogators), who will assess whether they are conversing with a human or an artificial intelligence. The primary hypothesis is that GPT agents programmed with a highly agreeable personality trait will be perceived as possessing more human-like characteristics compared to the other GPT agents. Additionally, it is expected that all GPT agents will cause a confusion rate of over ${}^{30%}$ among interrogators (i.e., being mistaken for a human despite being an AI), which would indicate surpassing the Turing Test threshold.
本研究采用定量方法,设计了实验方案。利用 OpenAI 的 ChatGPT 平台开发了三个 GPT 智能体,每个智能体表现出不同强度的宜人性特质(非常不宜人、中立和非常宜人)。这三个 GPT 智能体(称为证人)将与人类参与者(审讯者)互动,审讯者将评估他们是在与人类还是人工智能对话。主要假设是,编程为具有高度宜人性特质的 GPT 智能体将被认为比其他 GPT 智能体更具人类特征。此外,预计所有 GPT 智能体都会在审讯者中引起超过 ${}^{30%}$ 的混淆率(即尽管是 AI,但仍被误认为是人类),这将表明其超越了图灵测试的阈值。
2.1. Subjects
2.1. 研究对象
This experiment was preceded by a preliminary study to select GPT agents with varying levels of agreeable ness. The pre-experiment recruited 50 participants, aged 18– 24 years (27 women, 23 men), using a convenience sampling method. For the main experiment, statistical power was calculated at 0.80. Using the G Power platform, a sample size of 102 participants was determined to meet this threshold. A total of 102 university students, aged 18–24 years (41 men, 59 women, 1 non-binary individual, and 1 who did not disclose their gender), were randomly recruited. Both samples were drawn from the University of Monterrey. The experimental condition was conducted in the Human Cognition and Brain Studies Laboratory, which was specially equipped for experimental evaluations in a soundproof room with a single computer. Inclusion criteria required participants to be Mexican university students, while exclusion criteria applied to those with a history of neurological injury, diagnosed psychiatric conditions, substance use, or psychotropic medication use. Both the pre-experiment and the main experiment were approved by the Ethics Committee of the Psychology School at the University of Monterrey.
本实验前进行了初步研究,以选择具有不同亲和度水平的 GPT 智能体。预实验采用便利抽样方法招募了 50 名年龄在 18-24 岁之间的参与者(27 名女性,23 名男性)。对于主实验,统计功效计算为 0.80。使用 G Power 平台,确定 102 名参与者的样本量以满足此阈值。随机招募了 102 名年龄在 18-24 岁之间的大学生(41 名男性,59 名女性,1 名非二元性别个体,1 名未披露性别)。两个样本均来自蒙特雷大学。实验条件在人类认知与脑研究实验室进行,该实验室专门配备了用于实验评估的隔音室和单台计算机。纳入标准要求参与者为墨西哥大学生,排除标准适用于有神经损伤史、诊断出精神疾病、使用药物或精神药物的人。预实验和主实验均获得了蒙特雷大学心理学院伦理委员会的批准。
2.2. Materials
2.2. 材料
Big Five Inventory (BFI): The BFI is a test developed by John and collaborators to assess personality through 44 items consisting of simple statements reflecting behaviors associated with the five major personality traits: Openness, Neurotic is m, Extraversion, Conscientiousness, and Agreeable ness. Responses are rated on a Likert scale from 1 to 5, where 1 indicates "strongly disagree" and 5 "strongly agree" (John et al., 1991). The Spanish-adapted version has shown a Cronbach’s alpha exceeding 0.70 for most traits, along with evidence of convergent, discriminant, and cross-linguistic validity (BenetMartínez & John, 1998). In this study, a version adapted for the Argentine population was used (Genise et al., 2020), as its linguistic nuances were considered closer to Mexican Spanish compared to the original Spanish versión (Díaz-Campos & Navarro- Galisteo, 2009) which had been adapted for a Spanish population (Benet-Martínez & John, 1998) .
大五人格量表 (BFI):BFI 是由 John 及其合作者开发的一项测试,通过 44 个项目评估人格,这些项目由反映与五大主要人格特质相关的行为的简单陈述组成:开放性、神经质、外向性、尽责性和宜人性。回答采用 1 到 5 的李克特量表评分,其中 1 表示“非常不同意”,5 表示“非常同意” (John et al., 1991)。西班牙语改编版本在大多数特质上显示出 Cronbach’s alpha 超过 0.70,并具有收敛效度、区分效度和跨语言效度的证据 (Benet-Martínez & John, 1998)。在本研究中,使用了针对阿根廷人口改编的版本 (Genise et al., 2020),因为其语言细微差别被认为比原始西班牙语版本 (Díaz-Campos & Navarro-Galisteo, 2009) 更接近墨西哥西班牙语,而原始版本是为西班牙人口改编的 (Benet-Martínez & John, 1998)。
Agreeable ness Factor: The personality trait selected for this study is “agreeable ness”, part of the Big Five model, defined by adjectives such as good-natured, soft-hearted, courteous, selfless, helpful, sympathetic, trusting, generous, acquiescent, lenient, forgiving, open-minded, agreeable, flexible, cheerful, gullible, straightforward, and humble (McCrae & Costa, 1987). According to McCrae and Costa, this trait is best understood in contrast to its antagonistic counterpart, described as: “they are mistrustful and skeptical; aDectively they are callous and unsympathetic; behavior ally they are uncooperative, stub- born, and rude. It would appear that their sense of attachment or bonding with their fellow hum an beings is defective, and in extreme cases antagonism may resemble sociopathy”. From this perspective, agreeable ness can be defined as the tendency to be trusting, empathetic, helpful, and cooperative, while maintaining a positive view of human nature, characterized by compassion and a preference for harmonious teamwork. In the BFI, this trait is measured through nine items. In this study, the degree of agreeable ness reflected in the design of the GPT
宜人性因子:本研究选取的人格特质是“宜人性”,这是大五人格模型的一部分,由诸如善良、心软、礼貌、无私、乐于助人、富有同情心、信任他人、慷慨、顺从、宽容、宽恕、思想开放、随和、灵活、开朗、轻信、直率、谦逊等形容词定义 (McCrae & Costa, 1987)。根据 McCrae 和 Costa 的观点,这一特质最好通过与它的对立面——敌对性——进行对比来理解,敌对性被描述为:“他们多疑且持怀疑态度;情感上冷漠且缺乏同情心;行为上不合作、固执且粗鲁。似乎他们与同胞的依恋或联系感存在缺陷,在极端情况下,敌对性可能类似于反社会人格障碍。” 从这一角度来看,宜人性可以被定义为倾向于信任他人、富有同理心、乐于助人且合作,同时保持对人性的积极看法,表现为同情心和对和谐团队合作的偏好。在 BFI 中,这一特质通过九个项目进行测量。在本研究中,GPT 设计中反映的宜人性程度
Agent prompts is linked to the score for each statement. This score will be classified into five levels: high, medium-high, neutral, medium-low, and low. Below is an excerpt of the items designed to evaluate the Agreeable ness trait (italicized text indicates items where the Likert scale score must be reversed):
AI智能体的提示与每个陈述的得分相关联。该得分将被分为五个等级:高、中高、中性、中低和低。以下是用于评估宜人性特质的设计项目摘录(斜体文本表示需要反转李克特量表得分的项目):
A "highly agreeable GPT agent" is defined as one programmed using a prompt that configures a high level of agreeable ness across the nine corresponding items of the test (Hofstee et al., 1992). These items are encoded in the prompt by assigning Likert scale values to reflect varying degrees of agreeable ness. For instance, a GPT agent with very high agreeable ness will score 5 on direct items and 1 on reverse-scored items.
“高度宜人性的GPT智能体”被定义为使用提示词编程的智能体,该提示词在测试的九个相应项目中配置了高水平的宜人性(Hofstee et al., 1992)。这些项目通过在提示词中分配Likert量表值来编码,以反映不同程度的宜人性。例如,一个具有非常高宜人性的GPT智能体将在直接项目上得分为5,而在反向评分项目上得分为1。
ChatGPT 4o: The study employed ChatGPT-4o, a multimodal artificial intelligence model developed by OpenAI. GPT-4o integrates text, audio, and visual inputs into a unified framework, eliminating the need for separate systems for dieerent modalities. Through end-to-end training, it processes multimodal inputs—such as spoken language combined with visual stimuli—eeiciently and coherently, enabling real-time responses. While matching GPT-4 Turbo in English text and coding tasks, it surpasses it in non-English language and audio comprehension. With an average response latency of 320 milliseconds, comparable to human conversational speed, GPT-4o is optimized for real-time interactions (OpenAI, 2024).
ChatGPT 4o: 该研究采用了由 OpenAI 开发的多模态人工智能模型 ChatGPT-4o。GPT-4o 将文本、音频和视觉输入集成到一个统一的框架中,消除了对不同模态需要单独系统的需求。通过端到端训练,它能够高效且连贯地处理多模态输入——例如结合视觉刺激的口语——从而实现实时响应。虽然在英语文本和编码任务上与 GPT-4 Turbo 相当,但在非英语语言和音频理解方面超越了它。GPT-4o 的平均响应延迟为 320 毫秒,与人类对话速度相当,专为实时交互优化 (OpenAI, 2024)。
Prompts: Prompts are structured instructions designed for large language models (LLMs), such as ChatGPT, to guide their output and interaction with users. They act as a form of programming by setting specific rules, guidelines, or formats to customize the model's responses. Prompts enable the generation of targeted outputs, such as following a programming style or emphasizing key terms in a text. This flexibility makes them particularly useful in fields involving human-AI collaboration, such as problemsolving, question answering, solution generation, or text sum mari z ation. In this context, prompts will be utilized to program the operational behavior of GPT agents.
提示词 (Prompts): 提示词是为大语言模型 (LLMs) 设计的结构化指令,例如 ChatGPT,用于引导其输出和与用户的交互。它们通过设定特定的规则、指南或格式来定制模型的响应,从而成为一种编程形式。提示词能够生成有针对性的输出,例如遵循某种编程风格或在文本中强调关键术语。这种灵活性使其在涉及人机协作的领域中特别有用,例如问题解决、问答、解决方案生成或文本摘要。在此背景下,提示词将用于编程 GPT 智能体的操作行为。
GPT Agents: In this study, a Generative Pre-trained Transformer (GPT) developed by OpenAI is employed as a sophisticated language model. GPTs represent modified versions of the base ChatGPT model, configured for specific tasks or applications without requiring coding skills. By incorporating user-defined instructions and knowledge inputs, these models can perform a wide range of functions, from answering queries to managing complex operations. This adaptability enables researchers and developers to design specialized tools suited to various contexts, increasing their applicability in scientific and professional domains.
GPT智能体:在本研究中,采用了由OpenAI开发的生成式预训练Transformer (GPT) 作为一种复杂的语言模型。GPT是基础ChatGPT模型的修改版本,无需编程技能即可配置用于特定任务或应用。通过结合用户定义的指令和知识输入,这些模型可以执行从回答查询到管理复杂操作的各种功能。这种适应性使研究人员和开发者能够设计出适合各种场景的专用工具,从而提高了它们在科学和专业领域的适用性。
Prompts for GPT Agents: The prompt design for each identity will be based on the article "Does GPT-4 Pass The Turing Test?" by Cameron Jones and Benjamin Bergen (Jones & Bergen, 2023). This article introduces the "Sierra" prompt, which achieved the highest success rate $(41%)$ in the Turing Test using ChatGPT-4 (Jones & Bergen, 2023). Specifically, a Spanish translation of the Bergen and Jones (2023) prompt will be utilized, incorporating modifications such as the inclusion of Mexican slang, recent local news, and popular contemporary musical preferences. Additionally, each GPT agent will be provided with a personal backstory and identity reflecting the typical lifestyle of upper-middle-class youth from Monterrey, Mexico. Further adjustments to the Bergen and Jones prompt include configuring agreeable ness by incorporating items from the Big Five personality test that measure this trait, with intensity levels adjusted based on item scores. Using a 5-point Likert scale, three prototype prompts were selected during the pre-experiment, each representing dieerent levels of agreeable ness: "low agreeable ness" (Valentina), "neutral agreeable ness" (Emilia), and "high agreeable ness" (Camila). Below is a detailed list of the agreeable ness intensity configurations for each GPT agent:
GPT智能体的提示设计:每个身份的提示设计将基于Cameron Jones和Benjamin Bergen的文章《GPT-4能否通过图灵测试?》(Jones & Bergen, 2023)。该文章介绍了“Sierra”提示,该提示在使用ChatGPT-4的图灵测试中取得了最高的成功率 $(41%)$ (Jones & Bergen, 2023)。具体而言,将使用Bergen和Jones (2023) 提示的西班牙语翻译,并加入墨西哥俚语、最近的本地新闻和当代流行音乐偏好等修改。此外,每个GPT智能体将被赋予一个反映墨西哥蒙特雷中上层阶级青年典型生活方式的人物背景和身份。对Bergen和Jones提示的进一步调整包括通过加入大五人格测试中测量这一特质的项目来配置宜人性,并根据项目得分调整强度水平。使用5点李克特量表,在预实验期间选择了三个原型提示,分别代表不同的宜人性水平:“低宜人性”(Valentina)、“中性宜人性”(Emilia) 和“高宜人性”(Camila)。以下是每个GPT智能体的宜人性强度配置的详细列表:
Table 1. Programming the prompt to define the agreeable ness level for interaction with Camila (agreeable).
表 1: 编程提示以定义与 Camila 互动的宜人性水平 (agreeable)。
类别 | 描述 |
---|---|
AI智能体名称 | Camila |
宜人性因素强度 | 宜人 |
配置 | BFl中的“宜人性”项目 |
对几乎每个人都体贴和善良 | |
喜欢与他人合作 | |
乐于助人且不自私 | |
有宽容的天性 | |
通常信任他人 | |
倾向于挑剔他人 | |
与他人争吵 | |
可能冷漠且疏远 | |
有时对他人粗鲁 |
Table 2. Programming the prompt to define the agreeable ness level for interaction with Emilia (neutral).
表 2: 编程提示以定义与 Emilia 互动的宜人性水平 (中性)
类别 | 描述 |
---|---|
AI智能体 GPT 的名称 | Emilia |
宜人性因素配置的强度 | 中性 |
BFl 中的 "宜人性" 项目 | 1 |
对几乎每个人都体贴和友善 | |
喜欢与他人合作 | |
对他人有帮助且无私 | |
具有宽容的天性 | |
通常信任他人 | |
倾向于挑剔他人 | |
与他人争吵 | |
可能冷漠且疏远 | |
有时对他人粗鲁 |
Table 3. Programming the prompt to define the agreeable ness level for interaction with Valentina (very disagreeable).
表 3: 编程提示以定义与 Valentina 互动的宜人性水平(非常不友好)
类别 | 描述 |
---|---|
AgentGPT 的名称 | Valentina |
宜人性因素的强度 | 非常不友好 |
配置 | BFI 的“宜人性”项目 |
对几乎所有人都体贴和友善 | |
喜欢与他人合作 | |
对他人有帮助且无私 | |
| Hasaforgivenature | X | | | | |
| Isgenerallytrusting | X | | | | |
| Tendstofindfaultwithothers | X | | | | |
| Startsquarrelswithothers | | | | | X |
| Canbecoldandallof | X | | | | |
| Itissometimesrudetoothers | | | | | X |
While Valentina, characterized as “very disagreeable,” received the lowest rating from participants in the pre-experiment (2.92), the three GPT agents exhibiting some degree of agreeable ness showed similar scores: Emilia (neutral) received 4, Camila (agreeable) scored 4.12, and Daniela (very agreeable) scored 3.92. Ultimately, Camila was selected, as she inspired the most confidence among participants, capturing $40%$ of the total votes.
在实验前,被描述为“非常不讨喜”的Valentina获得了参与者最低的评分(2.92),而三个表现出一定程度的讨喜的GPT智能体则获得了相似的分数:Emilia(中性)获得了4分,Camila(讨喜)获得了4.12分,Daniela(非常讨喜)获得了3.92分。最终,Camila被选中,因为她在参与者中激发了最多的信心,获得了总票数的40%。
Post-Interaction Questionnaires: A set of questions administered to interrogators after each interaction with the corresponding witness. These questions were similar to those used by Jones and Bergen (2023): “Is it human or machine?”, “Confidence percentage on a scale of 1–100,” and “Reason.” This questionnaire was adopted as the basis for developing our own version. At the end of all interactions, interrogators were also asked an additional question to identify which witnesses they perceived as exhibiting the most human-like characteristics.
交互后问卷:一组在与相应证人每次交互后向审讯者提出的问题。这些问题与 Jones 和 Bergen (2023) 使用的问题类似:“是人还是机器?”、“信心百分比(1-100 分)”以及“原因”。该问卷被采纳为我们自己版本的基础。在所有交互结束时,审讯者还被要求回答一个额外的问题,以确定他们认为哪些证人表现出最像人类的特征。
Discord: Discord was the platform used for participants to interact with the dieerent versions of the GPT agents. Initially launched in 2015, Discord is a digital communication tool enabling real-time interaction through text, voice, and video channels. While originally targeted at gaming communities, it has since expanded to include educators, professionals, and social groups. The platform provides customizable servers with specific channels, member roles for organization, and features such as multimedia sharing, screen sharing, and integration with automated tools. Its intuitive interface and customization options make it a versatile tool for collaboration and communication across various fields.
Discord: Discord 是参与者用来与不同版本的 GPT 智能体进行交互的平台。Discord 最初于 2015 年推出,是一款数字通信工具,支持通过文本、语音和视频频道进行实时交互。虽然最初的目标用户是游戏社区,但后来已扩展到教育工作者、专业人士和社交群体。该平台提供可自定义的服务器,具有特定频道、用于组织的成员角色,以及多媒体共享、屏幕共享和与自动化工具集成等功能。其直观的界面和自定义选项使其成为跨领域协作和通信的多功能工具。
2.3. Experimental Procedures
2.3. 实验流程
The experiment consisted of two stages: the pre-experiment and the main experiment. During the pre-experiment, 50 participants were evaluated to identify the GPT agents to be used in the main experiment. The most disagreeable agent identified was Valentina, with a disagree ability score of 2.8. Additionally, Valentina was also rated as the least trustworthy. On the other hand, Camila was identified as the most agreeable agent, scoring 4.75 in agree ability and receiving the highest trust ratings. Finally, Emilia, with an agree ability score of 4.15, was selected as the neutral GPT agent.
实验分为两个阶段:预实验和主实验。在预实验阶段,对50名参与者进行了评估,以确定将在主实验中使用的GPT智能体。最不讨喜的智能体是Valentina,其讨喜能力得分为2.8。此外,Valentina也被评为最不值得信赖的智能体。另一方面,Camila被确定为最讨喜的智能体,其讨喜能力得分为4.75,并获得了最高的信任评分。最后,Emilia以4.15的讨喜能力得分被选为中立的GPT智能体。
Once the three GPT agents were selected, participants for the experiment were recruited through advertisements posted on various university bulletin boards. These advertisements included a QR code that allowed participants to contact the experimenters and be randomly scheduled. Below is an explanation of the experiment by phases (see Figure 1):
一旦选定了三个 GPT 智能体,实验的参与者便通过在各大学公告板上发布的广告进行招募。这些广告包含一个二维码,允许参与者联系实验者并随机安排时间。以下是按阶段对实验的解释(见图 1):
Figure 1. Diagram of the Experimental Phases. This diagram illustrates the diHerent phases of the experimental design. Phase 1 involves selecting the interrogator. Phase 2 focuses on interactions with the witnesses. In Phase 3, the interrogator is asked a final question regarding which witness they interacted with appeared to possess the most human-like characteristics and why. After each interaction, the interrogator answered the following questionnaire: "Is it human or machine?", "Confidence level on a scale from 1 to 100", and "Reason".
图 1: 实验阶段示意图。该图展示了实验设计的不同阶段。阶段 1 涉及选择审讯者。阶段 2 侧重于与证人的互动。在阶段 3 中,审讯者被问及一个最终问题:他们互动的证人中,谁表现出最像人类的特征以及原因。每次互动后,审讯者回答以下问卷:“是人类还是机器?”、“信心水平(1 到 100 分)”以及“原因”。
Phase 1: Participants in the experiment were recruited through advertisements on the university campus and were randomly assigned a date and time. Upon arriving at the Human Cognition and Brain Studies Laboratory at the University of Monterrey, they were provided with an informed consent form to safeguard their rights. This document explained that they would participate in an experiment involving conversations with dieerent identities, which could be either human or artificial. After signing the informed consent form, participants' socio-demographic data were collected to ensure they met the inclusion and exclusion criteria. From this point onward, participants were identified as "interrogators" in the experiment, as their role was to question various witnesses (GPT agents) to determine whether they were human or artificial intelligence s.
阶段 1:实验参与者通过大学校园内的广告招募,并被随机分配日期和时间。到达蒙特雷大学人类认知与脑科学研究实验室后,他们被提供了一份知情同意书以保障其权利。该文件解释了他们将参与一个涉及与不同身份对话的实验,这些身份可能是人类或人工智能。在签署知情同意书后,收集了参与者的社会人口统计数据,以确保他们符合纳入和排除标准。从此时起,参与者在实验中被标识为“审讯者”,因为他们的角色是询问各种证人(GPT 智能体)以确定他们是人类还是人工智能。
Phase 2: After collecting preliminary data, the interrogator is directed to one of the laboratory cubicles, furnished with a chair, a table, and a computer. The computer is configured to run only the Discord application, which will serve as the medium for conversations with the witnesses. The researcher provides the participant with a printed document containing instructions. Once the instructions are read and their comprehension verified, the researcher leaves the room, signaling the start of the experiment. To minimize potential biases arising from maintaining a fixed order, the sequence of conversations with the witnesses is counterbalanced. The interrogator initiates communication with the first witness by sending a single message. The witness, in turn, responds with one message. Simultaneously, in a separate room, another researcher transcribes the interrogator’s messages into the corresponding chatGPT Agent using OpenAI's chatGPT platform. Before each response, the instructions are reissued to the GPT Agent to ensure adherence to the experimental protocol. This step was implemented after preliminary trials revealed that chatGPT tended to deviate from the instructions as the conversation progressed. Once the GPT Agent generates a response, the researcher manually transcribes it (without copypasting) into the Discord platform. The duration of each conversation is five minutes, following the guidelines established in Turing's original paper on the Turing Test (Turing, 1950). After completing an interaction with a witness, a researcher re-enters the cubicle to administer a set of questions: "Is it human or machine?", "Confidence level on a scale from 1 to 100", and "Reason".
阶段 2:在收集初步数据后,审讯者被引导至实验室的一个隔间,隔间内配备了一把椅子、一张桌子和一台电脑。电脑配置为仅运行 Discord 应用程序,该应用程序将作为与证人对话的媒介。研究人员向参与者提供一份包含说明的打印文档。一旦参与者阅读并理解了说明,研究人员离开房间,标志着实验的开始。为了尽量减少因固定顺序而产生的潜在偏差,与证人的对话顺序是平衡的。审讯者通过发送一条消息与第一位证人开始交流。证人随后回复一条消息。同时,在另一个房间中,另一位研究人员将审讯者的消息转录到相应的 chatGPT 智能体中,使用 OpenAI 的 chatGPT 平台。在每次回复之前,重新向 GPT 智能体发出指令,以确保遵守实验协议。这一步骤是在初步试验中发现 chatGPT 在对话过程中容易偏离指令后实施的。一旦 GPT 智能体生成回复,研究人员手动将其转录(不进行复制粘贴)到 Discord 平台中。每次对话的时长为五分钟,遵循图灵在其关于图灵测试的原始论文中建立的指导方针(Turing, 1950)。在与证人完成一次互动后,研究人员重新进入隔间,提出一组问题:“它是人类还是机器?”、“信心水平(1 到 100 分)”以及“理由”。
Figure 2. Conversation Examples. Excerpts of sample conversations between interrogators and each of the three GPT agents (witnesses). In these exchanges, the "user" represents the witness, while the "participants" act as the interrogator. The examples illustrate an increasing level of agreeable ness, with Camila oHering life advice in response to a personal problem presented by the interrogator.
图 2: 对话示例。展示了审讯者与三个 GPT 智能体(证人)之间的对话片段。在这些对话中,“用户”代表证人,而“参与者”则扮演审讯者的角色。这些示例展示了逐渐增加的友好程度,其中 Camila 针对审讯者提出的个人问题提供了生活建议。
Phase 3: Once the allotted time has passed, the interrogator will be notified that all conversations with the witnesses have concluded. Subsequently, the interrogator will be asked one final question: which witness seemed most human and why. At no point will any details about the witnesses be disclosed to the interrogator.
阶段 3:一旦规定的时间过去,审讯者将被告知与所有证人的对话已经结束。随后,审讯者将被问到最后一个问题:哪位证人看起来最像人类,以及为什么。在任何时候,审讯者都不会被告知有关证人的任何细节。
3. Analysis of Results
3. 结果分析
Table 4 shows that all three agents (Unpleasant, Neutral, and Pleasant) achieved successful outcomes in the Turing Test (TT), as each exceeded the confusion ratio threshold of $30%$ proposed by Turing (Turing, 1950). For Valentina (Unpleasant), 49 judges $(48.03%)$ ) identified her as AI, while 53 judges $(51.97%)$ ) recognized her as human. Based on Turing's proposed confusion ratio $(\ge!30%)$ , Valentina successfully passed the TT with a $51.97%$ confusion ratio. For Emilia (Neutral), 44 judges $(43.1%)$ identified her as AI, while 58 judges $(56.9%)$ recognized her as human, resulting in a successful TT performance with a $56.9%$ confusion ratio. Finally, for Camila (Pleasant), 37 judges $(36.3%)$ identified her as AI, while 65 judges $(63.7%)$ ) recognized her as human. Camila achieved the highest confusion ratio $(63.7%)$ among the three GPT agents, marking the most successful TT performance (see Table 4).
表 4 显示,所有三个智能体(Unpleasant、Neutral 和 Pleasant)都在图灵测试 (TT) 中取得了成功,因为每个智能体都超过了 Turing (Turing, 1950) 提出的 30% 的混淆比例阈值。对于 Valentina (Unpleasant),49 名评委 (48.03%) 认为她是 AI,而 53 名评委 (51.97%) 认为她是人类。根据 Turing 提出的混淆比例 (≥30%),Valentina 以 51.97% 的混淆比例成功通过了 TT。对于 Emilia (Neutral),44 名评委 (43.1%) 认为她是 AI,而 58 名评委 (56.9%) 认为她是人类,最终以 56.9% 的混淆比例成功通过了 TT。最后,对于 Camila (Pleasant),37 名评委 (36.3%) 认为她是 AI,而 65 名评委 (63.7%) 认为她是人类。Camila 在三个 GPT 智能体中取得了最高的混淆比例 (63.7%),标志着最成功的 TT 表现(见表 4)。
Table 4. Global Turing Test Results for the GPTs Used in the Experiment
表 4: 实验中使用的 GPT 的全球图灵测试结果
GPTAGENT | Valentina (不同意) | Emilia (中立) Camila (同意) |
---|---|---|
IA | 49 (48.03%) | 44 (43.1%) 37 (36.3%) |
Human | 65 (63.7%) | |
53 (51.97%) | 58 (56.9%) |
The Chi-Square test yielded a value of $x^{2}=2.916$ , with 2 degrees of freedom, a sample size (N) of 306 responses, and a p-value of 0.233. The p-value of 0.233 indicates no statistically significant dieerences in the judges' choices when identifying the GPTs (pleasant, neutral, and unpleasant) as AI or human (see Table 5). Despite the lack of significant dieerences, frequency variations can be observed in the number of times an AI was mistaken for a human. Valentina (unpleasant) was the least associated with a human $(51.97%)$ , followed by Emilia $(56.9%)$ . Finally, Camila, the pleasant GPT agent, caused the most confusion among the judges, with a confusion rate of $63.7%$ .
卡方检验得出的值为 $x^{2}=2.916$,自由度为 2,样本量 (N) 为 306 个响应,p 值为 0.233。p 值为 0.233 表明,在将 GPT(愉快、中性和不愉快)识别为 AI 或人类时,评委的选择没有统计学上的显著差异(见表 5)。尽管缺乏显著差异,但在 AI 被误认为人类的次数上可以观察到频率变化。Valentina(不愉快)与人类的关联最少 $(51.97%)$,其次是 Emilia $(56.9%)$。最后,愉快的 GPT 智能体 Camila 在评委中引起了最多的混淆,混淆率为 $63.7%$。
值 (x²) df | d | |
---|---|---|
卡方检验 | 2.916 | 2 0.233 |
样本量 | 306 |
Table 6 presents the frequency with which each GPT agent (Valentina, Emilia, and Camila) was chosen as the chatbot exhibiting the most human-like traits during three interactions per evaluator. Each evaluator selected only one of the three agents as the most human-like, designating the others as "not selected." Table 6 provides a comprehensive frequency analysis of how often each GPT agent was selected as "human" versus "not selected." Valentina received a total of 30 votes, accounting for $29.41%$ of the total; Emilia received 23 votes, representing $22.54%$ of the total; and Camila received 49 votes, achieving $48.05%$ of the total.
表 6 展示了每位评估者在三次交互中选择每个 GPT 智能体(Valentina、Emilia 和 Camila)作为最具人类特征的聊天机器人的频率。每位评估者仅选择三个智能体中的一个作为最具人类特征的,并将其他智能体标记为“未选择”。表 6 提供了每个 GPT 智能体被选为“人类”与“未选择”的频率的全面分析。Valentina 共获得 30 票,占总数的 $29.41%$;Emilia 获得 23 票,占总数的 $22.54%$;Camila 获得 49 票,占总数的 $48.05%$。
Table 6. Results of the Chatbot Recognized for Having More Human-Like Characteristics
表 6. 被认为具有更人性化特征的聊天机器人结果
GPT | NotSelected | Human Like Total |
---|---|---|
Valentina | 72 | 30 (29.41%) 102 |
Emilia | 79 | 23 (22.54%) 102 |
Camila | 53 | 49 (48.05%) 102 |
Based on the data, a Chi-square $(x^{2})$ analysis was conducted to compare preferences among pairs of evaluated categories by 204 participants. Table 7 reports a significant dieerence, with a $\upchi^{2}$ value of 7.45 and a significance level of ${\p,=,0.006}$ , between Valentina (disagreeable) and Camila (agreeable). This indicates that judges