[论文翻译]一个人无法独自创作梗图:评估大语言模型与人类在幽默生成中的协同创造力


原文地址:https://arxiv.org/pdf/2501.11433


One Does Not Simply Meme Alone: Evaluating Co-Creativity Between LLMs and Humans in the Generation of Humor

一个人无法独自创作梗图:评估大语言模型与人类在幽默生成中的协同创造力

Zhikun Wu Thomas Weber Florian Müller KTH Royal Institute of Technology LMU Munich TU Darmstadt Stockholm, Sweden Munich, Germany Darmstadt, Germany zhikun@kth.se thomas.weber@ifi.lmu.de florian.mueller@tu-darmstadt.de

吴志坤 Thomas Weber Florian Müller
瑞典斯德哥尔摩 德国慕尼黑 德国达姆施塔特
皇家理工学院 慕尼黑大学 达姆施塔特工业大学
zhikun@kth.se thomas.weber@ifi.lmu.de florian.mueller@tu-darmstadt.de


Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration Across Humor, Creativity, and Share ability Metric.

图 1: AI、人类以及人机协作在幽默、创意和可分享性指标上生成的前 4 个表情包。

Abstract

摘要

Collaboration has been shown to enhance creativity, leading to more innovative and effective outcomes. While previous research has explored the abilities of Large Language Models (LLMs) to serve as co-creative partners in tasks like writing poetry or creating narratives, the collaborative potential of LLMs in humor-rich and culturally nuanced domains remains an open question. To address this gap, we conducted a user study to explore the potential of LLMs in co-creating memes—a humor-driven and culturally specific form of creative expression. We conducted a user study with three groups of 50 participants each: a human-only group creating memes without AI assistance, a human-AI collaboration group interacting with a state-of-the-art LLM model, and an AI-only group where the LLM autonomously generated memes. We assessed the quality of the generated memes through crowd sourcing, with each meme rated on creativity, humor, and share ability. Our results showed that LLM assistance increased the number of ideas generated and reduced the effort participants felt. However, it did not improve the quality of the memes when humans were collaborated with LLM. Interestingly, memes created entirely by AI performed better than both human-only and human-AI collaborative memes in all areas on average. However, when looking at the top-performing memes, human-created ones were better in humor, while humanAI collaborations stood out in creativity and share ability. These findings highlight the complexities of human-AI collaboration in creative tasks. While AI can boost productivity and create content that appeals to a broad audience, human creativity remains crucial for content that connects on a deeper level.

协作已被证明能够增强创造力,从而产生更具创新性和有效性的成果。尽管之前的研究已经探索了大语言模型 (LLMs) 在写诗或创作叙事等任务中作为共同创作伙伴的能力,但 LLMs 在幽默丰富且文化细微的领域中的协作潜力仍然是一个开放性问题。为了解决这一差距,我们进行了一项用户研究,以探索 LLMs 在共同创作模因(一种以幽默驱动且具有文化特定性的创意表达形式)中的潜力。我们进行了用户研究,将参与者分为三组,每组 50 人:一组是仅由人类组成的组,在没有 AI 协助的情况下创作模因;一组是人类与 AI 协作组,与最先进的 LLM 模型互动;最后一组是仅由 AI 组成的组,LLM 自主生成模因。我们通过众包评估了生成的模因的质量,每个模因在创造力、幽默感和可分享性方面进行了评分。我们的结果表明,LLM 的协助增加了生成的想法数量,并减少了参与者感受到的努力。然而,当人类与 LLM 协作时,它并没有提高模因的质量。有趣的是,完全由 AI 创作的模因在所有方面的平均表现都优于仅由人类创作和人类与 AI 协作的模因。然而,当观察表现最佳的模因时,人类创作的模因在幽默感方面表现更好,而人类与 AI 协作的模因在创造力和可分享性方面脱颖而出。这些发现突显了人类与 AI 在创意任务中协作的复杂性。虽然 AI 可以提高生产力并创作出吸引广泛受众的内容,但人类的创造力对于在更深层次上产生共鸣的内容仍然至关重要。

CCS Concepts

CCS 概念

• Human-centered computing $\rightarrow$ Empirical studies in HCI.

• 以人为中心的计算 (Human-centered computing) $\rightarrow$ 人机交互中的实证研究 (Empirical studies in HCI)。

Keywords

关键词

Human-AI Collaboration, LLM, Co-Creativity, Memes, Humor

人机协作、大语言模型、共创、表情包、幽默

ACM Reference Format:

ACM 参考格式:

Zhikun Wu, Thomas Weber, and Florian Müller. 2025. One Does Not Simply Meme Alone: Evaluating Co-Creativity Between LLMs and Humans in the Generation of Humor. In 30th International Conference on Intelligent User Interfaces (IUI ’25), March 24–27, 2025, Cagliari, Italy. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3708359.3712094

Zhikun Wu, Thomas Weber 和 Florian Müller. 2025. 一个人不会简单地独自创作表情包:评估大语言模型与人类在幽默生成中的共创能力。在第30届国际智能用户界面会议 (IUI ’25) 上,2025年3月24日至27日,意大利卡利亚里。ACM,纽约,美国,11页。https://doi.org/10.1145/3708359.3712094

1 Introduction

1 引言

From co-authoring articles to discussing vacation plans or pair programming, collaborating with other people is a core part of our daily lives in many ways. While we could perform all these activities on our own, the diverse perspectives and problem-solving strategies [22], increased motivation in groups [22] and continuous feedback [52] help to support creative processes [2]. Prior work showed that such collaborative work increases the performance of project teams [12], improves quality [53], and, in particular, enhances creativity [38] in numerous domains. With the rise of Large Language Models (LLMs), there is a growing trend of replacing human collaboration partners with such systems in typically creative activities in areas such as art [30, 58], music [16, 17, 21], and literature [28, 45, 50, 56]. In these areas, LLMs have demonstrated remarkable capabilities in generating content, often matching or even surpassing human performance in tasks like divergent thinking, which involves producing numerous and varied ideas [29].

从合著文章到讨论假期计划或结对编程,与其他人合作在许多方面是我们日常生活的核心部分。虽然我们可以独自完成所有这些活动,但多样化的视角和问题解决策略 [22]、团队中增加的动机 [22] 以及持续的反馈 [52] 有助于支持创造性过程 [2]。先前的研究表明,这种协作工作提高了项目团队的绩效 [12],改善了质量 [53],特别是在多个领域中增强了创造力 [38]。随着大语言模型 (LLMs) 的兴起,在艺术 [30, 58]、音乐 [16, 17, 21] 和文学 [28, 45, 50, 56] 等领域的典型创造性活动中,越来越多地使用这些系统替代人类合作伙伴。在这些领域中,大语言模型在生成内容方面展示了显著的能力,通常在发散思维等任务中匹配甚至超越人类表现,发散思维涉及产生大量多样化的想法 [29]。

However, much of the prior work on the creative aspect of LLMs focuses on their outputs as standalone creations, often neglecting their potential as true co-creative partners. In these studies, the researchers presented LLMs with various tasks that typically require creativity and evaluated the result produced by the LLM [23, 25, 26, 29]. These assessments focus on attributes like originality and fluency, where LLMs demonstrate strong performance on metrics such as the Torrance Test of Creative Thinking [25, 29]. While these works offer important insights into the possibilities of such models for creative tasks, such approaches fail to capture the iterative nature of human co-creativity. Unlike simple task delegation, co-creativity involves iterative co-creation, where humans and AI systems actively refine ideas through dialog and feedback loops, aligning with established frameworks of collaborative creativity[7, 38]. Recently, research has started to investigate this co-creative process in the joint creative work of humans and LLMs in a number of domains [27, 28, 35, 48] and found that LLMs can contribute novel suggestions, enrich human ideas and enhance the joint creative output [4, 36].

然而,之前关于大语言模型 (LLM) 创意方面的研究大多集中在将其输出视为独立的创作,往往忽视了它们作为真正共创伙伴的潜力。在这些研究中,研究人员为大语言模型提供了各种通常需要创造力的任务,并评估了大语言模型产生的结果 [23, 25, 26, 29]。这些评估侧重于原创性和流畅性等属性,大语言模型在诸如托兰斯创造性思维测试 (Torrance Test of Creative Thinking) 等指标上表现出色 [25, 29]。虽然这些研究为这类模型在创意任务中的可能性提供了重要见解,但这种方法未能捕捉到人类共创的迭代性质。与简单的任务委派不同,共创涉及迭代的共同创作,人类和 AI 系统通过对话和反馈循环积极完善想法,这与既定的协作创造力框架相一致 [7, 38]。最近,研究开始探讨人类和大语言模型在多个领域中的联合创意工作中的这一共创过程 [27, 28, 35, 48],并发现大语言模型可以提供新颖的建议,丰富人类的想法,并增强联合创意输出 [4, 36]。

Additionally, one important aspect has not yet been investigated in the area of co-creative collaboration with LLMs: humor. Humor is an interesting area because it is one of the most sophisticated and complex forms of human creativity. Humor strengthens social bonds, addresses difficult topics, and offers new perspectives on everyday situations [3]. It relies on surprise, contrast, cultural context, and emotional resonance [44]. What we think is funny depends on our personal cultural, linguistic and political background [15]. With the rise of online social networks and increasing globalization, parts of this context such as pop culture are aligning across borders, resulting in humor that incorporates elements that are personal and local as well as elements that are globally valid [43]. A wellknown example of such humor, that incorporates global and local contexts, are internet memes [1], often in the form of captioned images. Such memes, as a cultural phenomenon, have emerged as a universal language of the internet and are used to express emotions, convey messages or appropriation and re contextual iz ation familiar elements [54]. This also made them relevant for research as a means of evaluating creative humor [47]. However, while prior studies have explored the autonomous generation of memes by LLMs [49], there exists no prior work in examining how human-AI collaboration with a multimodal LLM affects the creativity, humor, and share ability of memes.

此外,在与大语言模型(LLM)的共创协作领域中,幽默这一重要方面尚未被研究。幽默是一个有趣的领域,因为它是最复杂和精妙的人类创造力形式之一。幽默能够加强社会纽带,处理困难话题,并为日常情境提供新的视角 [3]。它依赖于惊喜、对比、文化背景和情感共鸣 [44]。我们认为有趣的事物取决于个人的文化、语言和政治背景 [15]。随着在线社交网络的兴起和全球化的加剧,部分背景如流行文化正在跨国界趋同,从而产生了既包含个人和本地元素,又包含全球通用元素的幽默 [43]。互联网迷因 [1] 是这种结合全球和本地背景的幽默的著名例子,通常以带标题的图片形式出现。这种迷因作为一种文化现象,已成为互联网的通用语言,用于表达情感、传递信息或重新语境化熟悉的元素 [54]。这也使得它们成为评估创造性幽默的研究手段 [47]。然而,尽管先前的研究探索了大语言模型自主生成迷因的能力 [49],但目前还没有研究探讨人类与多模态大语言模型的协作如何影响迷因的创造力、幽默感和可分享性。

In this paper, we add to the body of work on human-AI cocreativity by exploring the potential of LLMs as co-creative partners for generating humor.While ’collaboration’ in HCI traditionally entails shared goals and mutual interdependence between multiple human collaborators[7, 38], we follow prior work that defines co-creativity in terms of iterative, dialog-based refinement of ideas [24, 58]. In such collaborative systems, the AI can serve as a valuable, but not necessarily equal, partner in the creative process. Consequently, we investigate how people interact with a “humor assistant” in the creation of internet memes and how the availability of such an assistant affects productivity. Additionally, we evaluate how memes generated with such an assistant compare to memes that were created purely by a human and purely by AI in terms of their humor and share ability. For this, we conducted two user studies: in the first, we asked participants to generate ideas for memes, either collaborative ly, using an LLM, or without any assistance, and rate the experience. Our study showed that participants who worked with the LLM assistant generated more ideas during the meme creation process than those who worked independently while perceiving the process as less arduous. This first study also yielded 335 images from the human-only group and 307 from the collaborative group which serve as the basis for the subsequent evaluation. For it, we sampled 150 images from each group, as well as 150 fully AI generated images, and asked a second group of participants to rate then in terms of how funny they considered them, how creative, and how likely it would be that they would shared them. This evaluation showed that memes generated entirely by AI surpassed both human and human-AI collaborations in all three dimensions.

在本文中,我们通过探索大语言模型(LLM)作为生成幽默的共创伙伴的潜力,丰富了人机共创的研究领域。尽管在人机交互(HCI)中,“协作”传统上意味着多个人类协作者之间的共同目标和相互依赖 [7, 38],但我们遵循先前的研究,将共创定义为基于对话的迭代式想法精炼 [24, 58]。在这种协作系统中,AI 可以作为创意过程中一个有价值的、但不一定平等的伙伴。因此,我们研究了人们在与“幽默助手”共同创作网络迷因时的互动方式,以及这种助手的可用性如何影响生产力。此外,我们评估了由这种助手生成的迷因与完全由人类和完全由 AI 生成的迷因在幽默性和可分享性方面的比较。为此,我们进行了两项用户研究:在第一项研究中,我们要求参与者生成迷因的想法,要么使用大语言模型进行协作,要么在没有任何帮助的情况下独立完成,并对体验进行评分。我们的研究表明,与独立工作的参与者相比,使用大语言模型助手的参与者在迷因创作过程中产生了更多的想法,同时认为这一过程不那么费力。第一项研究还从纯人类组中获得了 335 张图片,从协作组中获得了 307 张图片,这些图片为后续评估提供了基础。为此,我们从每组中抽取了 150 张图片,以及 150 张完全由 AI 生成的图片,并要求第二组参与者根据他们认为这些图片有多有趣、多有创意以及他们分享这些图片的可能性进行评分。评估结果显示,完全由 AI 生成的迷因在所有三个维度上都超过了人类和人类-AI 协作生成的迷因。

These findings indicate that while LLM assistance can significantly boost productivity and reduce perceived effort in creative tasks, it does not necessarily enhance the quality of creative output. This suggests that AI models, by drawing from vast datasets, are quite adept at producing content that appeals to a wide audience. However, many of the top-rated memes were created with human involvement, suggesting that AI models primarily produce solid but average quality, while human input can help to iterate and curate on it to lift it to a higher level of quality.

这些发现表明,虽然大语言模型 (LLM) 的辅助可以显著提高生产力并减少创意任务中的感知努力,但它并不一定能提高创意输出的质量。这表明,AI 模型通过从庞大的数据集中提取信息,能够非常熟练地生成吸引广泛受众的内容。然而,许多评分最高的表情包都是在人类参与的情况下创作的,这表明 AI 模型主要生成的是扎实但平均质量的内容,而人类的输入可以帮助迭代和筛选,从而将其提升到更高的质量水平。

These results emphasize how AI can be tool to quickly and easily produce large volumes of ideas that can often already meed a broad, average appeal. However, it also demonstrates a need for better methods, tools and processes for integrating AI into an iterative creative process where the AI may produce quantity while the humans acts as a curator that selectively pushes towards the AI towards better results. Designing smarter and more human-centered AI systems that can simplify the challenging steps of the creative process while enriching and amplifying humans’ unique creative abilities will continue to be a important challenge in the future.

这些结果强调了 AI 如何作为一种工具,能够快速且轻松地产生大量想法,这些想法通常已经能够满足广泛的、平均的吸引力。然而,它也表明需要更好的方法、工具和流程,将 AI 整合到一个迭代的创意过程中,在这个过程中,AI 可以产生数量,而人类则充当策展人,有选择地推动 AI 朝着更好的结果发展。设计更智能、更以人为中心的 AI 系统,能够简化创意过程中的挑战性步骤,同时丰富和放大人类独特的创造能力,这将继续是未来的一个重要挑战。

2 Related Work

2 相关工作

Our research is informed by prior work on human co-creativity, the complexity of humor, human-AI collaboration in creativity, LLMs in creative content generation and the evaluation of creative outputs.

我们的研究参考了关于人类共创性、幽默的复杂性、人类与AI在创造力中的协作、大语言模型在创意内容生成中的应用以及创意输出评估的先前工作。

2.1 Human Co-Creativity and the Complexity of Humor

2.1 人类共创性与幽默的复杂性

Collaborative creativity between humans has long been recognized as a powerful means of enhancing the creative process[7]. Co-creativity allows individuals to combine diverse perspectives, skills, and ideas, leading to more innovative and high-quality outcomes [2, 38]. In group settings, social interactions can stimulate creativity by fostering motivation, providing immediate feedback, and encouraging risk-taking [22, 52].

人类之间的协作创造力长期以来被认为是增强创意过程的有力手段[7]。共同创造力使个体能够结合多样化的视角、技能和想法,从而产生更具创新性和高质量的成果[2, 38]。在团队环境中,社交互动可以通过激发动力、提供即时反馈和鼓励冒险来促进创造力[22, 52]。

Humor, as a sophisticated and complex form of human creativity, plays a significant role in social bonding and communication [39]. Creating humor is particularly challenging because it relies on timing, cultural context, shared knowledge, and the ability to subvert expectations [51]. What individuals find humorous is deeply influenced by their personal experiences, cultural backgrounds, and social environments [15]. The complexity of humor makes it a rich area for exploring the dynamics of co-creativity, as collaborators must navigate these nuances to produce content that resonates with others [32].

幽默作为一种复杂且精巧的人类创造力形式,在社会联结和沟通中扮演着重要角色 [39]。创造幽默尤其具有挑战性,因为它依赖于时机、文化背景、共享知识以及颠覆预期的能力 [51]。个人对幽默的感受深受其个人经历、文化背景和社会环境的影响 [15]。幽默的复杂性使其成为探索共创动态的丰富领域,因为合作者必须驾驭这些细微差别,以产生能引起他人共鸣的内容 [32]。


Figure 2: Mapping of Meme Templates to Topics (Work, Food, Sports) in the Study.

图 2: 研究中 Meme 模板到主题(工作、食物、运动)的映射。

2.2 Human-AI Collaboration in Creativity

2.2 人类与AI在创造力中的协作

Human-AI collaboration involves humans working alongside AI systems to co-create content by leveraging the strengths of both [46, 55]. In creative fields, this collaboration can enhance human creativity by providing novel ideas and alternative perspectives [24]. For instance, in collaborative writing, AI tools generate alternative text suggestions, acting as a "second mind" to stimulate divergent thinking [48]. In design, systems like the Creative Sketching Partner offer visual stimuli to help designers overcome fixation and explore new directions [31].

人机协作 (Human-AI collaboration) 是指人类与 AI 系统共同工作,通过结合双方的优势来共同创造内容 [46, 55]。在创意领域,这种协作可以通过提供新颖的想法和替代视角来增强人类的创造力 [24]。例如,在协作写作中,AI 工具生成替代文本建议,充当“第二大脑”以激发发散性思维 [48]。在设计领域,像 Creative Sketching Partner 这样的系统提供视觉刺激,帮助设计师克服固定思维并探索新方向 [31]。

In the realm of meme creation, however, there is limited research on human-AI collaborative processes. While LLMs can assist in generating meme content, the interplay between human creativity and AI-generated suggestions remains under explored. Studies in other creative domains suggest that human-AI collaboration can lead to more creative outputs than either humans or AI alone [28]. Yet, challenges persist, such as managing the AI’s lack of contextual sensitivity and ensuring that the collaboration enhances rather than hinders the creative process [42].

在表情包创作领域,关于人类与AI协作过程的研究却十分有限。尽管大语言模型能够协助生成表情包内容,但人类创造力与AI生成建议之间的相互作用仍未被充分探索。其他创意领域的研究表明,人类与AI的协作能够产生比单独由人类或AI完成的更具创意的成果 [28]。然而,挑战依然存在,例如如何应对AI缺乏情境敏感性的问题,以及确保这种协作能够增强而非阻碍创意过程 [42]。

Moreover, ethical considerations arise in human-AI co-creation, including issues of authorship and bias introduced during AI training [11, 42]. Understanding how humans interact with AI systems in creative tasks is crucial for designing tools that effectively support and enhance human creativity.

此外,人机共创中出现了伦理问题,包括作者身份和AI训练过程中引入的偏见问题 [11, 42]。了解人类在创造性任务中如何与AI系统互动,对于设计有效支持和增强人类创造力的工具至关重要。

2.3 LLMs in Creative Meme Generation

2.3 大语言模型在创意表情包生成中的应用

Large Language Models (LLMs) have demonstrated capabilities in generating human-like text, enabling applications in various creative domains [41]. Recent studies have explored the use of LLMs in autonomous creative content generation, including narratives [45, 50, 56], humor [57], and particularly memes [49].

大语言模型 (LLMs) 已经展示了生成类人文本的能力,使其能够在各种创意领域中得到应用 [41]。最近的研究探索了大语言模型在自主创意内容生成中的使用,包括叙事 [45, 50, 56]、幽默 [57],尤其是表情包 [49]。

In the context of meme generation, MemeCraft [49] utilizes LLMs to produce stance-driven memes with minimal human intervention, showcasing the models’ ability to create con textually rich, multimodal content. However, while LLMs can generate humorous and con textually appropriate memes, they often face challenges in capturing nuanced cultural references and emotional subtleties inherent in human creativity [20, 33]. Studies have indicated that LLMs may produce homogenized content, lacking diversity and originality compared to human-generated content [6, 18].

在表情包生成的背景下,MemeCraft [49] 利用大语言模型 (LLM) 以最少的人工干预生成立场驱动的表情包,展示了模型创建内容丰富、多模态内容的能力。然而,尽管大语言模型能够生成幽默且符合上下文的表情包,但它们通常在捕捉人类创造力中固有的微妙文化参考和情感细节方面面临挑战 [20, 33]。研究表明,与人类生成的内容相比,大语言模型可能会产生同质化的内容,缺乏多样性和原创性 [6, 18]。

Despite these limitations, LLMs have been found to outperform humans in certain divergent thinking tasks, exhibiting remarkable originality and fluency [25, 29]. In humor generation, some LLMs produce jokes rated comparably to human-written humor [23, 26], although they may still fall short in capturing the depth of human humor in various contexts [20]. These findings highlight both the potential and the challenges of using LLMs for autonomous creative tasks such as meme generation.

尽管存在这些限制,大语言模型在某些发散性思维任务中表现优于人类,展现出显著的原创性和流畅性 [25, 29]。在幽默生成方面,一些大语言模型产生的笑话被评为与人类编写的幽默相当 [23, 26],尽管它们在不同情境下捕捉人类幽默深度方面可能仍有不足 [20]。这些发现突显了使用大语言模型进行自主创意任务(如表情包生成)的潜力和挑战。

2.4 Evaluation Metrics for Creative Outputs

2.4 创意输出的评估指标

Evaluating creative outputs, such as memes, involves assessing aspects like creativity, humor, and share ability [47, 54]. Memes are unique cultural artifacts that blend visual and textual elements to convey messages resonating with diverse audiences [10]. The share ability of a meme reflects its potential to be widely circulated, influenced by factors like humor, rel at ability, and relevance to current cultural topics [34, 40].

评估创意输出(如表情包)涉及评估创造力、幽默感和可分享性等方面 [47, 54]。表情包是独特的文化产物,融合了视觉和文本元素,以传达与不同受众产生共鸣的信息 [10]。表情包的可分享性反映了其广泛传播的潜力,受幽默感、相关性和与当前文化话题的关联性等因素影响 [34, 40]。

Humor is a key driver of engagement and virality in memes, often relying on incongruity and the juxtaposition of unexpected elements [37, 44]. Memes that effectively utilize humor can facilitate social bonding and amplify sociopolitical discourse [3]. Creativity, encompassing originality and novelty, is critical in making memes stand out in the vast online content landscape [14].

幽默是表情包参与度和传播性的关键驱动力,通常依赖于不协调和意外元素的并置 [37, 44]。有效利用幽默的表情包可以促进社会联系并放大社会政治话语 [3]。创造力,包括原创性和新颖性,对于使表情包在庞大的在线内容中脱颖而出至关重要 [14]。

Prior studies have employed both qualitative and quantitative methods to evaluate memes. Qualitative analyses involve content analysis of themes and cultural relevance [19, 44], while quantitative approaches use machine learning models and sentiment analysis to predict meme virality based on textual and visual attributes [13, 34]. However, evaluating creative outputs poses challenges due to the subjective nature of creativity and humor, and the dynamic, context-dependent nature of memes [8].

先前的研究采用了定性和定量方法来评估模因。定性分析涉及主题和文化相关性的内容分析 [19, 44],而定量方法则使用机器学习模型和情感分析,基于文本和视觉属性预测模因的传播性 [13, 34]。然而,由于创造力和幽默的主观性,以及模因的动态性和上下文依赖性,评估创意输出存在挑战 [8]。


Figure 3: User Interface Overview:Baseline Ideation, Ideation with Chat Interface, Favorite Selection, and Final Image Creation.

图 3: 用户界面概览:基线构思、聊天界面构思、收藏选择以及最终图像生成。

3 Methodology

3 方法论

To explore the impact of human collaboration with LLMs on creative meme generation, we conducted a between-subject user study with three experimental groups. The following section presents the methodology of the user study.

为了探索人类与大语言模型(LLM)协作对创意表情包生成的影响,我们进行了一项三组实验的受试者间用户研究。以下部分介绍了用户研究的方法论。

3.1 Task

3.1 任务

The participants’ task in the study was to generate captions for memes. More specifically, the task consisted of three steps:

研究参与者的任务是为表情包生成标题。具体来说,任务包括三个步骤:

Ideation In the first step, we displayed one of six background images of popular memes(Figure 2)to the participants and asked them to come up with as many captions as they could within five minutes. We asked participants to focus their ideas on one of three topics: work, food, and sports. The goal was to keep the ideas relatively constrained, for comparability but also not to overwhelm users with having to come up with arbitrary ideas. The interface(Figure 3) displayed a blank meme template as well as the instructions to the user. Users could then enter any ideas for image captions and they were then displayed in a list next to the instructions. Once the user had created ideas, they were also able to mark them as favorite, edit or remove them. For users in the treatment condition that had access to a LLM, this part of the interface also featured a chat interface where they could prompt the LLM(Figure 3). Any responses by the LLM were additionally processed to automatically determine whether the response contained any ideas. If this was the case, they were automatically extracted and added to the idea list.

构思
在第一步中,我们向参与者展示了六张流行梗图背景图片中的一张(图 2),并要求他们在五分钟内想出尽可能多的标题。我们要求参与者将想法集中在三个主题之一:工作、食物和运动。目的是让想法相对受限,以便进行比较,同时也不让用户因需要提出任意想法而感到压力。界面(图 3)显示了一个空白的梗图模板以及给用户的说明。用户可以输入任何关于图片标题的想法,这些想法随后会显示在说明旁边的列表中。一旦用户创建了想法,他们还可以将其标记为收藏、编辑或删除。对于处于实验条件且可以访问大语言模型的用户,界面的这一部分还包含一个聊天界面,他们可以在其中提示大语言模型(图 3)。大语言模型的任何响应都会经过额外处理,以自动确定响应是否包含任何想法。如果是这样,它们会自动提取并添加到想法列表中。

Favorite Selection Once participants had completed the ideation step, they moved on to an overview of all the ideas they had come up with(Figure 3). From this full list, they had to select their top three ideas. These three ideas were then used in the last step.

最喜欢的创意选择

一旦参与者完成了创意构思步骤,他们就会进入对所有想法的概览阶段(图 3)。从这份完整列表中,他们需要选出自己最喜欢的三个创意。这三个创意将在最后一步中使用。

Image Creation In the last step, we asked our participants to add their ideas as captions to the meme template. The meme editor allowed users to add text to the image in arbitrary chunks(Figure 3). Each chunk could then be positioned and resized, edited or removed.

图像创作
在最后一步,我们要求参与者将他们的想法作为标题添加到表情包模板中。表情包编辑器允许用户在图像上任意位置添加文本块(图 3)。每个文本块都可以进行定位、调整大小、编辑或删除。

Each experimental group used different methods for creating memes, comparing the effects of creativity driven solely by humans, human-AI collaboration, and entirely AI-driven creation.

每个实验组使用不同的方法创建模因,比较了仅由人类驱动的创造力、人机协作以及完全由AI驱动的创作效果。

The first group (baseline) participants independently generated ideas and created memes without external assistance by an AI tool or otherwise. The second group also involved human participants who had to come up with memes but had access to a conversational interface. Through it, they were able to prompt an LLM to support them with generating ideas. In the third group, ideas were generated fully autonomously by the LLM.

第一组(基线)参与者独立生成想法并创建表情包,没有借助AI工具或其他外部帮助。第二组同样由人类参与者组成,他们需要构思表情包,但可以通过对话界面访问大语言模型 (LLM) 来支持他们的想法生成。第三组中,想法完全由大语言模型自主生成。

Following the main study, we evaluated the memes generated by the three groups in terms of their funniness, share ability, and creativity using a second online survey.

在主研究之后,我们通过第二次在线调查评估了三组生成的模因在趣味性、可分享性和创造力方面的表现。

3.2 Procedure

3.2 流程

For generating memes, after recording for their informed consent, we asked participants to spend at least four and at most five minutes on coming up with captions using our UI (Figure 4). Following this ideation phase, they selected their three favorite ideas and could then edit the image to add their idea as captions. After generating and downloading their creation, they moved on to the next image with a next topic. Each participant had to produce memes for three different combinations of images and topics. The permutation of image and topic was selected randomly but in a way that each participant used each image and each topic at most once. After generating ideas for three different topics and images, participants completed an survey recording feedback on their experience. The overall process of ideation, selecting favorites, editing the images, and completing the survey was scheduled to take no more than 40 minutes. For their work, participants received compensation equivalent to 15 USD.

在生成表情包的过程中,我们在获得参与者的知情同意后,要求他们使用我们的用户界面(图 4)花费至少四分钟、最多五分钟的时间来构思标题。在构思阶段结束后,他们选择自己最喜欢的三个想法,并可以编辑图像以将他们的想法添加为标题。生成并下载他们的创作后,他们继续处理下一个主题的图像。每位参与者需要为三种不同的图像和主题组合生成表情包。图像和主题的排列是随机选择的,但确保每位参与者最多只使用一次每个图像和每个主题。在为三个不同的主题和图像生成想法后,参与者完成了一项调查,记录他们对体验的反馈。整个构思、选择最喜欢的想法、编辑图像和完成调查的过程预计不超过40分钟。参与者完成工作后,获得了相当于15美元的报酬。

Following this first phase, we continued with rating the generated ideas (Figure 5). Since each participant had to selected three favorite ideas for each of the three images they captioned, we gathered 882 ideas marked as favorites across both study conditions with human involvement. Due to technical problems, only 415 for the baseline conditions and all 441 for the collaborative condition were usable.

在第一阶段之后,我们继续对生成的想法进行评分(图 5)。由于每位参与者需要为他们标注的三张图片各选择三个最喜欢想法,我们在两种研究条件下共收集了 882 个被标记为最喜欢想法。由于技术问题,基线条件下只有 415 个想法可用,而协作条件下的 441 个想法全部可用。

For each idea, we re-generated the captioned image to ensure consistent placement of the text. We then curated the images, excluding all those were the participants clearly entered a caption not matching the task or where the length of the caption obscured the majority of the image. Since for one of the images, we had to exclude more than two thirds of the images, we decided to fully exclude it from the study. This left us with 335 images from the baseline and 307 images from the collaborative condition.

对于每个想法,我们重新生成了带标题的图像,以确保文本位置的一致性。然后我们筛选了图像,排除了所有参与者明显输入了与任务不匹配的标题或标题长度遮挡了大部分图像的图片。由于其中一张图像我们不得不排除超过三分之二的图片,我们决定将其完全排除在研究之外。这使我们最终保留了基线条件下的335张图像和协作条件下的307张图像。


Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration, and AI-Driven Creation.

图 4: 表情包生成流程:人类(基线)、人机协作和 AI 驱动创作。


Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation process of memes created by humans, human-AI collaboration, and AI-driven approaches.

图 5: 表情包评估流程:该图展示了由人类、人机协作和 AI 驱动方法创建的表情包的评估过程。

For rating the quality of these images, we then randomly sampled 10 images for each combination of background picture and topic, which, at five remaining images and three topics, left us with 150 images from the baseline and 150 images from the collaborative condition.

为了评估这些图像的质量,我们随后从每个背景图片和主题的组合中随机抽取了10张图像,这样在剩余5张图像和3个主题的情况下,我们得到了150张基线图像和150张协作条件下的图像。

We then leveraged LLM to create fully AI generated captions for the third study condition. To this end, we prompted the model to generate captions for each combination of image and topic, giving us additional 150 images.

然后,我们利用大语言模型 (LLM) 为第三种研究条件创建完全由 AI 生成的标题。为此,我们提示模型为每张图片和主题的组合生成标题,从而获得了额外的 150 张图片。

For assessing the subjective quality of these images, we asked a second group of participants to complete an online survey for rating the images. In the survey, we displayed a random sample of 50 images to each participant. For each image, the participated provided feedback along three dimensions: humor, creativity and share ability. These categories were selected based on prior work.

为了评估这些图像的主观质量,我们邀请了第二组参与者完成了一项在线调查,对图像进行评分。在调查中,我们向每位参与者展示了随机抽取的50张图像。对于每张图像,参与者从三个维度提供反馈:幽默感、创造性和可分享性。这些类别的选择基于先前的研究。

We estimated that each rating would take 10–15 seconds, so participants should complete the task in about 10-15 minutes. For their participation, they received compensation equivalent to 10 USD.

我们估计每次评分需要 10–15 秒,因此参与者应在约 10-15 分钟内完成任务。作为参与奖励,他们获得了相当于 10 美元的报酬。

3.3 Prompting

3.3 提示 (Prompting)

For conducting the study, we used LLM in two functions: first, as part of the UI where participants could generate ideas with the assistance of a conversational UI. In this interface, participants were free to enter any prompt into the system. However, we set a system prompt to constrain the functionality and output of the system. This system prompt set the context for the LLM, including the fact that the goal of the system was to help users in creating meme ideas, the tone of the interaction to be helpful and polite, and it constrained the system to produce at most three ideas with a single response. Additionally, we always sent the current image to the LLM before any user prompt. The full prompts are available in the supplementary material.

为了进行研究,我们在两个功能中使用了LLM:首先,作为用户界面的一部分,参与者可以通过对话式界面生成想法。在这个界面中,参与者可以自由地向系统输入任何提示。然而,我们设置了一个系统提示来限制系统的功能和输出。这个系统提示为LLM设定了上下文,包括系统的目标是帮助用户创建表情包想法,交互的语气是友好和礼貌的,并且限制系统在单次响应中最多生成三个想法。此外,我们在任何用户提示之前都会将当前图像发送给LLM。完整的提示可以在补充材料中找到。

Secondly, we used the LLM to generate image captions for generating the memes for the pure AI condition. For this, we again sent the image first and then instructed the model to “generate 20 meme captions for this about the topic of ”, where was one of the three topics and $_{}$ was a brief description of the image of no more than 10 words. A full list of the generated captions is also part of the supplementary material.

其次,我们使用大语言模型生成图像描述,以便为纯AI条件生成模因。为此,我们再次先发送图像,然后指示模型“为这张生成20个关于主题的模因标题”,其中是三个主题之一,$_{}$是对图像的不超过10个词的简要描述。生成的标题完整列表也是补充材料的一部分。

3.4 Apparatus

3.4 装置

The user interface for the study was implemented using React while any data collection and the interaction with the OpenAI API for GPT-4o was performed by a NodeJS server. All processing of the prompts, random iz ation of tasks, etc. was performed on the server to ensure the integrity of the data.

该研究的用户界面使用 React 实现,而所有数据收集以及与 OpenAI API 的 GPT-4o 交互则由 NodeJS 服务器执行。所有提示的处理、任务的随机化等都在服务器上执行,以确保数据的完整性。

Both parts of the study were conducted fully online using our implementation of the meme-creation interface for the first part of the study and a commercial survey platforms for any subsequent surveys.

研究的两个部分均完全在线进行,第一部分使用了我们实现的 meme 创建界面,后续调查则使用了商业调查平台。

3.5 Data Collection

3.5 数据收集

While participants created memes, we recorded all ideas they came up with, both text and images, as well as a full log of their interaction with the LLM and its responses on the server. For recording the subjective perception, we used a commercial survey platform. The survey included questions for the participants to self-assess their creativity, as well as the NASA-TLX and general questions about the interface and the ideation process. We used the same platform for rating the generated ideas as well. Demographic data was provided via Prolific, which we used for participant recruiting.

在参与者创作表情包时,我们记录了他们在服务器上提出的所有想法,包括文本和图像,以及他们与大语言模型交互的完整日志及其响应。为了记录主观感受,我们使用了一个商业调查平台。调查包括参与者自我评估创造力的问题,以及NASA-TLX量表和关于界面及构思过程的常规问题。我们使用相同的平台对生成的想法进行评分。人口统计数据通过Prolific提供,我们使用该平台进行参与者招募。

3.6 Participants

3.6 参与者

For this first part of the study, we recruited 124 participants using the online platform Prolific. 26 participants were excluded due to not completing the task. The number of participants was determined after an initial power analysis for the study design. Given how the success of humor can be highly dependent on language skill, we selected only participants with good English skills. Additionally, we required participants to have used a LLM interface before at least once, to ensure they would be familiar with the concepts and interactions. This resulted in a diverse participant sample from 30 different countries. Of the participant, 63 indicated to identify as male and 35 as female. The average age was 28.8 years (sd: 8.7).

在本研究的第一部分,我们通过在线平台 Prolific 招募了 124 名参与者。其中 26 名参与者因未完成任务而被排除。参与者的数量是根据研究设计的初步功效分析确定的。鉴于幽默的成功与否高度依赖于语言能力,我们仅选择了英语能力较好的参与者。此外,我们要求参与者至少使用过一次大语言模型 (LLM) 界面,以确保他们熟悉相关概念和交互方式。最终,我们获得了来自 30 个不同国家的多样化参与者样本。在这些参与者中,63 人表示自己为男性,35 人表示自己为女性。平均年龄为 28.8 岁(标准差:8.7)。

For the second phase of the study, we simiarly recruited a second set of $\mathtt{N=}100$ participants with the same prerequisites for language skills but knowledge of LLMs was not a requirement. 98 of these completed the task, rating at least 50 images. Participants in this group were equally split between identifying as male and female with an average age of 32.6 (sd: 11.1) and originating in 29 different countries.

在研究的第二阶段,我们同样招募了第二组 $\mathtt{N=}100$ 名参与者,他们具备相同的语言技能前提条件,但不要求具备大语言模型 (LLM) 的知识。其中 98 人完成了任务,至少对 50 张图片进行了评分。该组的参与者在性别上平均分配,平均年龄为 32.6 岁(标准差:11.1),来自 29 个不同的国家。

4 Results

4 结果

The following section will describe the quantitative findings and their statistical analysis.

以下部分将描述定量研究结果及其统计分析。

4.1 Meme Creation

4.1 表情包创作

4.1.1 Idea Generation. During ideation, participants created an average of 6.1 ideas (sd: 3.2) with one participant managing to come with a total of 21 ideas for one of the images. As seen in Figure 7, participants that were able to use the LLM created noticeably more ideas than the participants in the baseline group. To get further insights how the presence of the chat affected the ideation process, we conducted statistical hypothesis tests on the number of ideas per participant. Following a Shapiro-Wilk test to determine nonnormality for both the absolute number of ideas $\langle W=0.811$ , $\mathcal{p}<$ 0.001) and the average number of ideas per participant $\langle W=0.820$ , $p<0.001\$ ), we used the Mann-Whitney-U test. This test indicated significant differences for the absolute count $^W=12652,p<0.001)$ and the average number of ideas (1519.5, $p<0.001]$ ).

4.1.1 创意生成。在创意生成阶段,参与者平均生成了6.1个创意(标准差:3.2),其中一位参与者为其中一张图片生成了多达21个创意。如图7所示,能够使用大语言模型的参与者明显比基线组的参与者生成了更多的创意。为了进一步了解聊天功能的存在如何影响创意生成过程,我们对每位参与者的创意数量进行了统计假设检验。通过Shapiro-Wilk检验确定绝对创意数量($\langle W=0.811$,$\mathcal{p}<$ 0.001)和每位参与者的平均创意数量($\langle W=0.820$,$p<0.001\$)的非正态性后,我们使用了Mann-Whitney-U检验。该检验表明绝对数量($^W=12652,p<0.001$)和平均创意数量(1519.5,$p<0.001]$)存在显著差异。


Figure 6: Participants using the LLM were able to produce significantly more ideas than participants who had no external support, according to the Mann-Whitney-U test $(^{\ast\ast\ast}$ : $p<0.001)$ )

图 6: 根据 Mann-Whitney-U 检验,使用大语言模型 (LLM) 的参与者能够比没有外部支持的参与者产生显著更多的想法 $(^{\ast\ast\ast}$ : $p<0.001)$ )

![](https://u254848-88c6-e493554b.yza1.seetacloud.com/miner/v2/analysis/pdf_img?as_attachment=