One Does Not Simply Meme Alone: Evaluating Co-Creativity Between LLMs and Humans in the Generation of Humor

一个人无法独自创作梗图：评估大语言模型与人类在幽默生成中的协同创造力

Zhikun Wu Thomas Weber Florian Müller KTH Royal Institute of Technology LMU Munich TU Darmstadt Stockholm, Sweden Munich, Germany Darmstadt, Germany zhikun@kth.se thomas.weber@ifi.lmu.de florian.mueller@tu-darmstadt.de

吴志坤 Thomas Weber Florian Müller
瑞典斯德哥尔摩德国慕尼黑德国达姆施塔特
皇家理工学院慕尼黑大学达姆施塔特工业大学
zhikun@kth.se thomas.weber@ifi.lmu.de florian.mueller@tu-darmstadt.de

Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration Across Humor, Creativity, and Share ability Metric.

图 1: AI、人类以及人机协作在幽默、创意和可分享性指标上生成的前 4 个表情包。

Abstract

摘要

Collaboration has been shown to enhance creativity, leading to more innovative and effective outcomes. While previous research has explored the abilities of Large Language Models (LLMs) to serve as co-creative partners in tasks like writing poetry or creating narratives, the collaborative potential of LLMs in humor-rich and culturally nuanced domains remains an open question. To address this gap, we conducted a user study to explore the potential of LLMs in co-creating memes—a humor-driven and culturally specific form of creative expression. We conducted a user study with three groups of 50 participants each: a human-only group creating memes without AI assistance, a human-AI collaboration group interacting with a state-of-the-art LLM model, and an AI-only group where the LLM autonomously generated memes. We assessed the quality of the generated memes through crowd sourcing, with each meme rated on creativity, humor, and share ability. Our results showed that LLM assistance increased the number of ideas generated and reduced the effort participants felt. However, it did not improve the quality of the memes when humans were collaborated with LLM. Interestingly, memes created entirely by AI performed better than both human-only and human-AI collaborative memes in all areas on average. However, when looking at the top-performing memes, human-created ones were better in humor, while humanAI collaborations stood out in creativity and share ability. These findings highlight the complexities of human-AI collaboration in creative tasks. While AI can boost productivity and create content that appeals to a broad audience, human creativity remains crucial for content that connects on a deeper level.

协作已被证明能够增强创造力，从而产生更具创新性和有效性的成果。尽管之前的研究已经探索了大语言模型 (LLMs) 在写诗或创作叙事等任务中作为共同创作伙伴的能力，但 LLMs 在幽默丰富且文化细微的领域中的协作潜力仍然是一个开放性问题。为了解决这一差距，我们进行了一项用户研究，以探索 LLMs 在共同创作模因（一种以幽默驱动且具有文化特定性的创意表达形式）中的潜力。我们进行了用户研究，将参与者分为三组，每组 50 人：一组是仅由人类组成的组，在没有 AI 协助的情况下创作模因；一组是人类与 AI 协作组，与最先进的 LLM 模型互动；最后一组是仅由 AI 组成的组，LLM 自主生成模因。我们通过众包评估了生成的模因的质量，每个模因在创造力、幽默感和可分享性方面进行了评分。我们的结果表明，LLM 的协助增加了生成的想法数量，并减少了参与者感受到的努力。然而，当人类与 LLM 协作时，它并没有提高模因的质量。有趣的是，完全由 AI 创作的模因在所有方面的平均表现都优于仅由人类创作和人类与 AI 协作的模因。然而，当观察表现最佳的模因时，人类创作的模因在幽默感方面表现更好，而人类与 AI 协作的模因在创造力和可分享性方面脱颖而出。这些发现突显了人类与 AI 在创意任务中协作的复杂性。虽然 AI 可以提高生产力并创作出吸引广泛受众的内容，但人类的创造力对于在更深层次上产生共鸣的内容仍然至关重要。

CCS Concepts

CCS 概念

• Human-centered computing $\rightarrow$ Empirical studies in HCI.

• 以人为中心的计算 (Human-centered computing) $\rightarrow$ 人机交互中的实证研究 (Empirical studies in HCI)。

Keywords

关键词

Human-AI Collaboration, LLM, Co-Creativity, Memes, Humor

人机协作、大语言模型、共创、表情包、幽默

ACM Reference Format:

ACM 参考格式:

Zhikun Wu, Thomas Weber, and Florian Müller. 2025. One Does Not Simply Meme Alone: Evaluating Co-Creativity Between LLMs and Humans in the Generation of Humor. In 30th International Conference on Intelligent User Interfaces (IUI ’25), March 24–27, 2025, Cagliari, Italy. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3708359.3712094

Zhikun Wu, Thomas Weber 和 Florian Müller. 2025. 一个人不会简单地独自创作表情包：评估大语言模型与人类在幽默生成中的共创能力。在第30届国际智能用户界面会议 (IUI ’25) 上，2025年3月24日至27日，意大利卡利亚里。ACM，纽约，美国，11页。https://doi.org/10.1145/3708359.3712094

1 Introduction

1 引言

From co-authoring articles to discussing vacation plans or pair programming, collaborating with other people is a core part of our daily lives in many ways. While we could perform all these activities on our own, the diverse perspectives and problem-solving strategies [22], increased motivation in groups [22] and continuous feedback [52] help to support creative processes [2]. Prior work showed that such collaborative work increases the performance of project teams [12], improves quality [53], and, in particular, enhances creativity [38] in numerous domains. With the rise of Large Language Models (LLMs), there is a growing trend of replacing human collaboration partners with such systems in typically creative activities in areas such as art [30, 58], music [16, 17, 21], and literature [28, 45, 50, 56]. In these areas, LLMs have demonstrated remarkable capabilities in generating content, often matching or even surpassing human performance in tasks like divergent thinking, which involves producing numerous and varied ideas [29].

从合著文章到讨论假期计划或结对编程，与其他人合作在许多方面是我们日常生活的核心部分。虽然我们可以独自完成所有这些活动，但多样化的视角和问题解决策略 [22]、团队中增加的动机 [22] 以及持续的反馈 [52] 有助于支持创造性过程 [2]。先前的研究表明，这种协作工作提高了项目团队的绩效 [12]，改善了质量 [53]，特别是在多个领域中增强了创造力 [38]。随着大语言模型 (LLMs) 的兴起，在艺术 [30, 58]、音乐 [16, 17, 21] 和文学 [28, 45, 50, 56] 等领域的典型创造性活动中，越来越多地使用这些系统替代人类合作伙伴。在这些领域中，大语言模型在生成内容方面展示了显著的能力，通常在发散思维等任务中匹配甚至超越人类表现，发散思维涉及产生大量多样化的想法 [29]。

However, much of the prior work on the creative aspect of LLMs focuses on their outputs as standalone creations, often neglecting their potential as true co-creative partners. In these studies, the researchers presented LLMs with various tasks that typically require creativity and evaluated the result produced by the LLM [23, 25, 26, 29]. These assessments focus on attributes like originality and fluency, where LLMs demonstrate strong performance on metrics such as the Torrance Test of Creative Thinking [25, 29]. While these works offer important insights into the possibilities of such models for creative tasks, such approaches fail to capture the iterative nature of human co-creativity. Unlike simple task delegation, co-creativity involves iterative co-creation, where humans and AI systems actively refine ideas through dialog and feedback loops, aligning with established frameworks of collaborative creativity[7, 38]. Recently, research has started to investigate this co-creative process in the joint creative work of humans and LLMs in a number of domains [27, 28, 35, 48] and found that LLMs can contribute novel suggestions, enrich human ideas and enhance the joint creative output [4, 36].

然而，之前关于大语言模型 (LLM) 创意方面的研究大多集中在将其输出视为独立的创作，往往忽视了它们作为真正共创伙伴的潜力。在这些研究中，研究人员为大语言模型提供了各种通常需要创造力的任务，并评估了大语言模型产生的结果 [23, 25, 26, 29]。这些评估侧重于原创性和流畅性等属性，大语言模型在诸如托兰斯创造性思维测试 (Torrance Test of Creative Thinking) 等指标上表现出色 [25, 29]。虽然这些研究为这类模型在创意任务中的可能性提供了重要见解，但这种方法未能捕捉到人类共创的迭代性质。与简单的任务委派不同，共创涉及迭代的共同创作，人类和 AI 系统通过对话和反馈循环积极完善想法，这与既定的协作创造力框架相一致 [7, 38]。最近，研究开始探讨人类和大语言模型在多个领域中的联合创意工作中的这一共创过程 [27, 28, 35, 48]，并发现大语言模型可以提供新颖的建议，丰富人类的想法，并增强联合创意输出 [4, 36]。

Additionally, one important aspect has not yet been investigated in the area of co-creative collaboration with LLMs: humor. Humor is an interesting area because it is one of the most sophisticated and complex forms of human creativity. Humor strengthens social bonds, addresses difficult topics, and offers new perspectives on everyday situations [3]. It relies on surprise, contrast, cultural context, and emotional resonance [44]. What we think is funny depends on our personal cultural, linguistic and political background [15]. With the rise of online social networks and increasing globalization, parts of this context such as pop culture are aligning across borders, resulting in humor that incorporates elements that are personal and local as well as elements that are globally valid [43]. A wellknown example of such humor, that incorporates global and local contexts, are internet memes [1], often in the form of captioned images. Such memes, as a cultural phenomenon, have emerged as a universal language of the internet and are used to express emotions, convey messages or appropriation and re contextual iz ation familiar elements [54]. This also made them relevant for research as a means of evaluating creative humor [47]. However, while prior studies have explored the autonomous generation of memes by LLMs [49], there exists no prior work in examining how human-AI collaboration with a multimodal LLM affects the creativity, humor, and share ability of memes.

此外，在与大语言模型（LLM）的共创协作领域中，幽默这一重要方面尚未被研究。幽默是一个有趣的领域，因为它是最复杂和精妙的人类创造力形式之一。幽默能够加强社会纽带，处理困难话题，并为日常情境提供新的视角 [3]。它依赖于惊喜、对比、文化背景和情感共鸣 [44]。我们认为有趣的事物取决于个人的文化、语言和政治背景 [15]。随着在线社交网络的兴起和全球化的加剧，部分背景如流行文化正在跨国界趋同，从而产生了既包含个人和本地元素，又包含全球通用元素的幽默 [43]。互联网迷因 [1] 是这种结合全球和本地背景的幽默的著名例子，通常以带标题的图片形式出现。这种迷因作为一种文化现象，已成为互联网的通用语言，用于表达情感、传递信息或重新语境化熟悉的元素 [54]。这也使得它们成为评估创造性幽默的研究手段 [47]。然而，尽管先前的研究探索了大语言模型自主生成迷因的能力 [49]，但目前还没有研究探讨人类与多模态大语言模型的协作如何影响迷因的创造力、幽默感和可分享性。

In this paper, we add to the body of work on human-AI cocreativity by exploring the potential of LLMs as co-creative partners for generating humor.While ’collaboration’ in HCI traditionally entails shared goals and mutual interdependence between multiple human collaborators[7, 38], we follow prior work that defines co-creativity in terms of iterative, dialog-based refinement of ideas [24, 58]. In such collaborative systems, the AI can serve as a valuable, but not necessarily equal, partner in the creative process. Consequently, we investigate how people interact with a “humor assistant” in the creation of internet memes and how the availability of such an assistant affects productivity. Additionally, we evaluate how memes generated with such an assistant compare to memes that were created purely by a human and purely by AI in terms of their humor and share ability. For this, we conducted two user studies: in the first, we asked participants to generate ideas for memes, either collaborative ly, using an LLM, or without any assistance, and rate the experience. Our study showed that participants who worked with the LLM assistant generated more ideas during the meme creation process than those who worked independently while perceiving the process as less arduous. This first study also yielded 335 images from the human-only group and 307 from the collaborative group which serve as the basis for the subsequent evaluation. For it, we sampled 150 images from each group, as well as 150 fully AI generated images, and asked a second group of participants to rate then in terms of how funny they considered them, how creative, and how likely it would be that they would shared them. This evaluation showed that memes generated entirely by AI surpassed both human and human-AI collaborations in all three dimensions.

在本文中，我们通过探索大语言模型（LLM）作为生成幽默的共创伙伴的潜力，丰富了人机共创的研究领域。尽管在人机交互（HCI）中，“协作”传统上意味着多个人类协作者之间的共同目标和相互依赖 [7, 38]，但我们遵循先前的研究，将共创定义为基于对话的迭代式想法精炼 [24, 58]。在这种协作系统中，AI 可以作为创意过程中一个有价值的、但不一定平等的伙伴。因此，我们研究了人们在与“幽默助手”共同创作网络迷因时的互动方式，以及这种助手的可用性如何影响生产力。此外，我们评估了由这种助手生成的迷因与完全由人类和完全由 AI 生成的迷因在幽默性和可分享性方面的比较。为此，我们进行了两项用户研究：在第一项研究中，我们要求参与者生成迷因的想法，要么使用大语言模型进行协作，要么在没有任何帮助的情况下独立完成，并对体验进行评分。我们的研究表明，与独立工作的参与者相比，使用大语言模型助手的参与者在迷因创作过程中产生了更多的想法，同时认为这一过程不那么费力。第一项研究还从纯人类组中获得了 335 张图片，从协作组中获得了 307 张图片，这些图片为后续评估提供了基础。为此，我们从每组中抽取了 150 张图片，以及 150 张完全由 AI 生成的图片，并要求第二组参与者根据他们认为这些图片有多有趣、多有创意以及他们分享这些图片的可能性进行评分。评估结果显示，完全由 AI 生成的迷因在所有三个维度上都超过了人类和人类-AI 协作生成的迷因。

These findings indicate that while LLM assistance can significantly boost productivity and reduce perceived effort in creative tasks, it does not necessarily enhance the quality of creative output. This suggests that AI models, by drawing from vast datasets, are quite adept at producing content that appeals to a wide audience. However, many of the top-rated memes were created with human involvement, suggesting that AI models primarily produce solid but average quality, while human input can help to iterate and curate on it to lift it to a higher level of quality.

这些发现表明，虽然大语言模型 (LLM) 的辅助可以显著提高生产力并减少创意任务中的感知努力，但它并不一定能提高创意输出的质量。这表明，AI 模型通过从庞大的数据集中提取信息，能够非常熟练地生成吸引广泛受众的内容。然而，许多评分最高的表情包都是在人类参与的情况下创作的，这表明 AI 模型主要生成的是扎实但平均质量的内容，而人类的输入可以帮助迭代和筛选，从而将其提升到更高的质量水平。

These results emphasize how AI can be tool to quickly and easily produce large volumes of ideas that can often already meed a broad, average appeal. However, it also demonstrates a need for better methods, tools and processes for integrating AI into an iterative creative process where the AI may produce quantity while the humans acts as a curator that selectively pushes towards the AI towards better results. Designing smarter and more human-centered AI systems that can simplify the challenging steps of the creative process while enriching and amplifying humans’ unique creative abilities will continue to be a important challenge in the future.

这些结果强调了 AI 如何作为一种工具，能够快速且轻松地产生大量想法，这些想法通常已经能够满足广泛的、平均的吸引力。然而，它也表明需要更好的方法、工具和流程，将 AI 整合到一个迭代的创意过程中，在这个过程中，AI 可以产生数量，而人类则充当策展人，有选择地推动 AI 朝着更好的结果发展。设计更智能、更以人为中心的 AI 系统，能够简化创意过程中的挑战性步骤，同时丰富和放大人类独特的创造能力，这将继续是未来的一个重要挑战。

2 相关工作

Our research is informed by prior work on human co-creativity, the complexity of humor, human-AI collaboration in creativity, LLMs in creative content generation and the evaluation of creative outputs.

我们的研究参考了关于人类共创性、幽默的复杂性、人类与AI在创造力中的协作、大语言模型在创意内容生成中的应用以及创意输出评估的先前工作。

2.1 Human Co-Creativity and the Complexity of Humor

2.1 人类共创性与幽默的复杂性

Collaborative creativity between humans has long been recognized as a powerful means of enhancing the creative process[7]. Co-creativity allows individuals to combine diverse perspectives, skills, and ideas, leading to more innovative and high-quality outcomes [2, 38]. In group settings, social interactions can stimulate creativity by fostering motivation, providing immediate feedback, and encouraging risk-taking [22, 52].

人类之间的协作创造力长期以来被认为是增强创意过程的有力手段[7]。共同创造力使个体能够结合多样化的视角、技能和想法，从而产生更具创新性和高质量的成果[2, 38]。在团队环境中，社交互动可以通过激发动力、提供即时反馈和鼓励冒险来促进创造力[22, 52]。

Humor, as a sophisticated and complex form of human creativity, plays a significant role in social bonding and communication [39]. Creating humor is particularly challenging because it relies on timing, cultural context, shared knowledge, and the ability to subvert expectations [51]. What individuals find humorous is deeply influenced by their personal experiences, cultural backgrounds, and social environments [15]. The complexity of humor makes it a rich area for exploring the dynamics of co-creativity, as collaborators must navigate these nuances to produce content that resonates with others [32].

幽默作为一种复杂且精巧的人类创造力形式，在社会联结和沟通中扮演着重要角色 [39]。创造幽默尤其具有挑战性，因为它依赖于时机、文化背景、共享知识以及颠覆预期的能力 [51]。个人对幽默的感受深受其个人经历、文化背景和社会环境的影响 [15]。幽默的复杂性使其成为探索共创动态的丰富领域，因为合作者必须驾驭这些细微差别，以产生能引起他人共鸣的内容 [32]。

Figure 2: Mapping of Meme Templates to Topics (Work, Food, Sports) in the Study.

图 2: 研究中 Meme 模板到主题（工作、食物、运动）的映射。

2.2 Human-AI Collaboration in Creativity

2.2 人类与AI在创造力中的协作

Human-AI collaboration involves humans working alongside AI systems to co-create content by leveraging the strengths of both [46, 55]. In creative fields, this collaboration can enhance human creativity by providing novel ideas and alternative perspectives [24]. For instance, in collaborative writing, AI tools generate alternative text suggestions, acting as a "second mind" to stimulate divergent thinking [48]. In design, systems like the Creative Sketching Partner offer visual stimuli to help designers overcome fixation and explore new directions [31].

人机协作 (Human-AI collaboration) 是指人类与 AI 系统共同工作，通过结合双方的优势来共同创造内容 [46, 55]。在创意领域，这种协作可以通过提供新颖的想法和替代视角来增强人类的创造力 [24]。例如，在协作写作中，AI 工具生成替代文本建议，充当“第二大脑”以激发发散性思维 [48]。在设计领域，像 Creative Sketching Partner 这样的系统提供视觉刺激，帮助设计师克服固定思维并探索新方向 [31]。

In the realm of meme creation, however, there is limited research on human-AI collaborative processes. While LLMs can assist in generating meme content, the interplay between human creativity and AI-generated suggestions remains under explored. Studies in other creative domains suggest that human-AI collaboration can lead to more creative outputs than either humans or AI alone [28]. Yet, challenges persist, such as managing the AI’s lack of contextual sensitivity and ensuring that the collaboration enhances rather than hinders the creative process [42].

在表情包创作领域，关于人类与AI协作过程的研究却十分有限。尽管大语言模型能够协助生成表情包内容，但人类创造力与AI生成建议之间的相互作用仍未被充分探索。其他创意领域的研究表明，人类与AI的协作能够产生比单独由人类或AI完成的更具创意的成果 [28]。然而，挑战依然存在，例如如何应对AI缺乏情境敏感性的问题，以及确保这种协作能够增强而非阻碍创意过程 [42]。

Moreover, ethical considerations arise in human-AI co-creation, including issues of authorship and bias introduced during AI training [11, 42]. Understanding how humans interact with AI systems in creative tasks is crucial for designing tools that effectively support and enhance human creativity.

此外，人机共创中出现了伦理问题，包括作者身份和AI训练过程中引入的偏见问题 [11, 42]。了解人类在创造性任务中如何与AI系统互动，对于设计有效支持和增强人类创造力的工具至关重要。

2.3 LLMs in Creative Meme Generation

2.3 大语言模型在创意表情包生成中的应用

Large Language Models (LLMs) have demonstrated capabilities in generating human-like text, enabling applications in various creative domains [41]. Recent studies have explored the use of LLMs in autonomous creative content generation, including narratives [45, 50, 56], humor [57], and particularly memes [49].

大语言模型 (LLMs) 已经展示了生成类人文本的能力，使其能够在各种创意领域中得到应用 [41]。最近的研究探索了大语言模型在自主创意内容生成中的使用，包括叙事 [45, 50, 56]、幽默 [57]，尤其是表情包 [49]。

In the context of meme generation, MemeCraft [49] utilizes LLMs to produce stance-driven memes with minimal human intervention, showcasing the models’ ability to create con textually rich, multimodal content. However, while LLMs can generate humorous and con textually appropriate memes, they often face challenges in capturing nuanced cultural references and emotional subtleties inherent in human creativity [20, 33]. Studies have indicated that LLMs may produce homogenized content, lacking diversity and originality compared to human-generated content [6, 18].

在表情包生成的背景下，MemeCraft [49] 利用大语言模型 (LLM) 以最少的人工干预生成立场驱动的表情包，展示了模型创建内容丰富、多模态内容的能力。然而，尽管大语言模型能够生成幽默且符合上下文的表情包，但它们通常在捕捉人类创造力中固有的微妙文化参考和情感细节方面面临挑战 [20, 33]。研究表明，与人类生成的内容相比，大语言模型可能会产生同质化的内容，缺乏多样性和原创性 [6, 18]。

Despite these limitations, LLMs have been found to outperform humans in certain divergent thinking tasks, exhibiting remarkable originality and fluency [25, 29]. In humor generation, some LLMs produce jokes rated comparably to human-written humor [23, 26], although they may still fall short in capturing the depth of human humor in various contexts [20]. These findings highlight both the potential and the challenges of using LLMs for autonomous creative tasks such as meme generation.

尽管存在这些限制，大语言模型在某些发散性思维任务中表现优于人类，展现出显著的原创性和流畅性 [25, 29]。在幽默生成方面，一些大语言模型产生的笑话被评为与人类编写的幽默相当 [23, 26]，尽管它们在不同情境下捕捉人类幽默深度方面可能仍有不足 [20]。这些发现突显了使用大语言模型进行自主创意任务（如表情包生成）的潜力和挑战。

2.4 Evaluation Metrics for Creative Outputs

2.4 创意输出的评估指标

Evaluating creative outputs, such as memes, involves assessing aspects like creativity, humor, and share ability [47, 54]. Memes are unique cultural artifacts that blend visual and textual elements to convey messages resonating with diverse audiences [10]. The share ability of a meme reflects its potential to be widely circulated, influenced by factors like humor, rel at ability, and relevance to current cultural topics [34, 40].

评估创意输出（如表情包）涉及评估创造力、幽默感和可分享性等方面 [47, 54]。表情包是独特的文化产物，融合了视觉和文本元素，以传达与不同受众产生共鸣的信息 [10]。表情包的可分享性反映了其广泛传播的潜力，受幽默感、相关性和与当前文化话题的关联性等因素影响 [34, 40]。

Humor is a key driver of engagement and virality in memes, often relying on incongruity and the juxtaposition of unexpected elements [37, 44]. Memes that effectively utilize humor can facilitate social bonding and amplify sociopolitical discourse [3]. Creativity, encompassing originality and novelty, is critical in making memes stand out in the vast online content landscape [14].

幽默是表情包参与度和传播性的关键驱动力，通常依赖于不协调和意外元素的并置 [37, 44]。有效利用幽默的表情包可以促进社会联系并放大社会政治话语 [3]。创造力，包括原创性和新颖性，对于使表情包在庞大的在线内容中脱颖而出至关重要 [14]。

Prior studies have employed both qualitative and quantitative methods to evaluate memes. Qualitative analyses involve content analysis of themes and cultural relevance [19, 44], while quantitative approaches use machine learning models and sentiment analysis to predict meme virality based on textual and visual attributes [13, 34]. However, evaluating creative outputs poses challenges due to the subjective nature of creativity and humor, and the dynamic, context-dependent nature of memes [8].

先前的研究采用了定性和定量方法来评估模因。定性分析涉及主题和文化相关性的内容分析 [19, 44]，而定量方法则使用机器学习模型和情感分析，基于文本和视觉属性预测模因的传播性 [13, 34]。然而，由于创造力和幽默的主观性，以及模因的动态性和上下文依赖性，评估创意输出存在挑战 [8]。

Figure 3: User Interface Overview:Baseline Ideation, Ideation with Chat Interface, Favorite Selection, and Final Image Creation.

图 3: 用户界面概览：基线构思、聊天界面构思、收藏选择以及最终图像生成。

3 Methodology

3 方法论

To explore the impact of human collaboration with LLMs on creative meme generation, we conducted a between-subject user study with three experimental groups. The following section presents the methodology of the user study.

为了探索人类与大语言模型（LLM）协作对创意表情包生成的影响，我们进行了一项三组实验的受试者间用户研究。以下部分介绍了用户研究的方法论。

3.1 Task

3.1 任务

The participants’ task in the study was to generate captions for memes. More specifically, the task consisted of three steps:

研究参与者的任务是为表情包生成标题。具体来说，任务包括三个步骤：

Ideation In the first step, we displayed one of six background images of popular memes(Figure 2)to the participants and asked them to come up with as many captions as they could within five minutes. We asked participants to focus their ideas on one of three topics: work, food, and sports. The goal was to keep the ideas relatively constrained, for comparability but also not to overwhelm users with having to come up with arbitrary ideas. The interface(Figure 3) displayed a blank meme template as well as the instructions to the user. Users could then enter any ideas for image captions and they were then displayed in a list next to the instructions. Once the user had created ideas, they were also able to mark them as favorite, edit or remove them. For users in the treatment condition that had access to a LLM, this part of the interface also featured a chat interface where they could prompt the LLM(Figure 3). Any responses by the LLM were additionally processed to automatically determine whether the response contained any ideas. If this was the case, they were automatically extracted and added to the idea list.

构思
在第一步中，我们向参与者展示了六张流行梗图背景图片中的一张（图 2），并要求他们在五分钟内想出尽可能多的标题。我们要求参与者将想法集中在三个主题之一：工作、食物和运动。目的是让想法相对受限，以便进行比较，同时也不让用户因需要提出任意想法而感到压力。界面（图 3）显示了一个空白的梗图模板以及给用户的说明。用户可以输入任何关于图片标题的想法，这些想法随后会显示在说明旁边的列表中。一旦用户创建了想法，他们还可以将其标记为收藏、编辑或删除。对于处于实验条件且可以访问大语言模型的用户，界面的这一部分还包含一个聊天界面，他们可以在其中提示大语言模型（图 3）。大语言模型的任何响应都会经过额外处理，以自动确定响应是否包含任何想法。如果是这样，它们会自动提取并添加到想法列表中。

Favorite Selection Once participants had completed the ideation step, they moved on to an overview of all the ideas they had come up with(Figure 3). From this full list, they had to select their top three ideas. These three ideas were then used in the last step.

最喜欢的创意选择

一旦参与者完成了创意构思步骤，他们就会进入对所有想法的概览阶段（图 3）。从这份完整列表中，他们需要选出自己最喜欢的三个创意。这三个创意将在最后一步中使用。

Image Creation In the last step, we asked our participants to add their ideas as captions to the meme template. The meme editor allowed users to add text to the image in arbitrary chunks(Figure 3). Each chunk could then be positioned and resized, edited or removed.

图像创作
在最后一步，我们要求参与者将他们的想法作为标题添加到表情包模板中。表情包编辑器允许用户在图像上任意位置添加文本块（图 3）。每个文本块都可以进行定位、调整大小、编辑或删除。

Each experimental group used different methods for creating memes, comparing the effects of creativity driven solely by humans, human-AI collaboration, and entirely AI-driven creation.

每个实验组使用不同的方法创建模因，比较了仅由人类驱动的创造力、人机协作以及完全由AI驱动的创作效果。

The first group (baseline) participants independently generated ideas and created memes without external assistance by an AI tool or otherwise. The second group also involved human participants who had to come up with memes but had access to a conversational interface. Through it, they were able to prompt an LLM to support them with generating ideas. In the third group, ideas were generated fully autonomously by the LLM.

第一组（基线）参与者独立生成想法并创建表情包，没有借助AI工具或其他外部帮助。第二组同样由人类参与者组成，他们需要构思表情包，但可以通过对话界面访问大语言模型 (LLM) 来支持他们的想法生成。第三组中，想法完全由大语言模型自主生成。

Following the main study, we evaluated the memes generated by the three groups in terms of their funniness, share ability, and creativity using a second online survey.

在主研究之后，我们通过第二次在线调查评估了三组生成的模因在趣味性、可分享性和创造力方面的表现。

3.2 Procedure

3.2 流程

For generating memes, after recording for their informed consent, we asked participants to spend at least four and at most five minutes on coming up with captions using our UI (Figure 4). Following this ideation phase, they selected their three favorite ideas and could then edit the image to add their idea as captions. After generating and downloading their creation, they moved on to the next image with a next topic. Each participant had to produce memes for three different combinations of images and topics. The permutation of image and topic was selected randomly but in a way that each participant used each image and each topic at most once. After generating ideas for three different topics and images, participants completed an survey recording feedback on their experience. The overall process of ideation, selecting favorites, editing the images, and completing the survey was scheduled to take no more than 40 minutes. For their work, participants received compensation equivalent to 15 USD.

在生成表情包的过程中，我们在获得参与者的知情同意后，要求他们使用我们的用户界面（图 4）花费至少四分钟、最多五分钟的时间来构思标题。在构思阶段结束后，他们选择自己最喜欢的三个想法，并可以编辑图像以将他们的想法添加为标题。生成并下载他们的创作后，他们继续处理下一个主题的图像。每位参与者需要为三种不同的图像和主题组合生成表情包。图像和主题的排列是随机选择的，但确保每位参与者最多只使用一次每个图像和每个主题。在为三个不同的主题和图像生成想法后，参与者完成了一项调查，记录他们对体验的反馈。整个构思、选择最喜欢的想法、编辑图像和完成调查的过程预计不超过40分钟。参与者完成工作后，获得了相当于15美元的报酬。

Following this first phase, we continued with rating the generated ideas (Figure 5). Since each participant had to selected three favorite ideas for each of the three images they captioned, we gathered 882 ideas marked as favorites across both study conditions with human involvement. Due to technical problems, only 415 for the baseline conditions and all 441 for the collaborative condition were usable.

在第一阶段之后，我们继续对生成的想法进行评分（图 5）。由于每位参与者需要为他们标注的三张图片各选择三个最喜欢想法，我们在两种研究条件下共收集了 882 个被标记为最喜欢想法。由于技术问题，基线条件下只有 415 个想法可用，而协作条件下的 441 个想法全部可用。

For each idea, we re-generated the captioned image to ensure consistent placement of the text. We then curated the images, excluding all those were the participants clearly entered a caption not matching the task or where the length of the caption obscured the majority of the image. Since for one of the images, we had to exclude more than two thirds of the images, we decided to fully exclude it from the study. This left us with 335 images from the baseline and 307 images from the collaborative condition.

对于每个想法，我们重新生成了带标题的图像，以确保文本位置的一致性。然后我们筛选了图像，排除了所有参与者明显输入了与任务不匹配的标题或标题长度遮挡了大部分图像的图片。由于其中一张图像我们不得不排除超过三分之二的图片，我们决定将其完全排除在研究之外。这使我们最终保留了基线条件下的335张图像和协作条件下的307张图像。

Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration, and AI-Driven Creation.

图 4: 表情包生成流程：人类（基线）、人机协作和 AI 驱动创作。

Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation process of memes created by humans, human-AI collaboration, and AI-driven approaches.

图 5: 表情包评估流程：该图展示了由人类、人机协作和 AI 驱动方法创建的表情包的评估过程。

For rating the quality of these images, we then randomly sampled 10 images for each combination of background picture and topic, which, at five remaining images and three topics, left us with 150 images from the baseline and 150 images from the collaborative condition.

为了评估这些图像的质量，我们随后从每个背景图片和主题的组合中随机抽取了10张图像，这样在剩余5张图像和3个主题的情况下，我们得到了150张基线图像和150张协作条件下的图像。

We then leveraged LLM to create fully AI generated captions for the third study condition. To this end, we prompted the model to generate captions for each combination of image and topic, giving us additional 150 images.

然后，我们利用大语言模型 (LLM) 为第三种研究条件创建完全由 AI 生成的标题。为此，我们提示模型为每张图片和主题的组合生成标题，从而获得了额外的 150 张图片。

For assessing the subjective quality of these images, we asked a second group of participants to complete an online survey for rating the images. In the survey, we displayed a random sample of 50 images to each participant. For each image, the participated provided feedback along three dimensions: humor, creativity and share ability. These categories were selected based on prior work.

为了评估这些图像的主观质量，我们邀请了第二组参与者完成了一项在线调查，对图像进行评分。在调查中，我们向每位参与者展示了随机抽取的50张图像。对于每张图像，参与者从三个维度提供反馈：幽默感、创造性和可分享性。这些类别的选择基于先前的研究。

We estimated that each rating would take 10–15 seconds, so participants should complete the task in about 10-15 minutes. For their participation, they received compensation equivalent to 10 USD.

我们估计每次评分需要 10–15 秒，因此参与者应在约 10-15 分钟内完成任务。作为参与奖励，他们获得了相当于 10 美元的报酬。

3.3 Prompting

3.3 提示 (Prompting)

For conducting the study, we used LLM in two functions: first, as part of the UI where participants could generate ideas with the assistance of a conversational UI. In this interface, participants were free to enter any prompt into the system. However, we set a system prompt to constrain the functionality and output of the system. This system prompt set the context for the LLM, including the fact that the goal of the system was to help users in creating meme ideas, the tone of the interaction to be helpful and polite, and it constrained the system to produce at most three ideas with a single response. Additionally, we always sent the current image to the LLM before any user prompt. The full prompts are available in the supplementary material.

为了进行研究，我们在两个功能中使用了LLM：首先，作为用户界面的一部分，参与者可以通过对话式界面生成想法。在这个界面中，参与者可以自由地向系统输入任何提示。然而，我们设置了一个系统提示来限制系统的功能和输出。这个系统提示为LLM设定了上下文，包括系统的目标是帮助用户创建表情包想法，交互的语气是友好和礼貌的，并且限制系统在单次响应中最多生成三个想法。此外，我们在任何用户提示之前都会将当前图像发送给LLM。完整的提示可以在补充材料中找到。

Secondly, we used the LLM to generate image captions for generating the memes for the pure AI condition. For this, we again sent the image first and then instructed the model to “generate 20 meme captions for this about the topic of ”, where was one of the three topics and $_{}$ was a brief description of the image of no more than 10 words. A full list of the generated captions is also part of the supplementary material.

其次，我们使用大语言模型生成图像描述，以便为纯AI条件生成模因。为此，我们再次先发送图像，然后指示模型“为这张生成20个关于主题的模因标题”，其中是三个主题之一，$_{}$是对图像的不超过10个词的简要描述。生成的标题完整列表也是补充材料的一部分。

3.4 Apparatus

3.4 装置

The user interface for the study was implemented using React while any data collection and the interaction with the OpenAI API for GPT-4o was performed by a NodeJS server. All processing of the prompts, random iz ation of tasks, etc. was performed on the server to ensure the integrity of the data.

该研究的用户界面使用 React 实现，而所有数据收集以及与 OpenAI API 的 GPT-4o 交互则由 NodeJS 服务器执行。所有提示的处理、任务的随机化等都在服务器上执行，以确保数据的完整性。

Both parts of the study were conducted fully online using our implementation of the meme-creation interface for the first part of the study and a commercial survey platforms for any subsequent surveys.

研究的两个部分均完全在线进行，第一部分使用了我们实现的 meme 创建界面，后续调查则使用了商业调查平台。

3.5 Data Collection

3.5 数据收集

While participants created memes, we recorded all ideas they came up with, both text and images, as well as a full log of their interaction with the LLM and its responses on the server. For recording the subjective perception, we used a commercial survey platform. The survey included questions for the participants to self-assess their creativity, as well as the NASA-TLX and general questions about the interface and the ideation process. We used the same platform for rating the generated ideas as well. Demographic data was provided via Prolific, which we used for participant recruiting.

在参与者创作表情包时，我们记录了他们在服务器上提出的所有想法，包括文本和图像，以及他们与大语言模型交互的完整日志及其响应。为了记录主观感受，我们使用了一个商业调查平台。调查包括参与者自我评估创造力的问题，以及NASA-TLX量表和关于界面及构思过程的常规问题。我们使用相同的平台对生成的想法进行评分。人口统计数据通过Prolific提供，我们使用该平台进行参与者招募。

3.6 Participants

3.6 参与者

For this first part of the study, we recruited 124 participants using the online platform Prolific. 26 participants were excluded due to not completing the task. The number of participants was determined after an initial power analysis for the study design. Given how the success of humor can be highly dependent on language skill, we selected only participants with good English skills. Additionally, we required participants to have used a LLM interface before at least once, to ensure they would be familiar with the concepts and interactions. This resulted in a diverse participant sample from 30 different countries. Of the participant, 63 indicated to identify as male and 35 as female. The average age was 28.8 years (sd: 8.7).

在本研究的第一部分，我们通过在线平台 Prolific 招募了 124 名参与者。其中 26 名参与者因未完成任务而被排除。参与者的数量是根据研究设计的初步功效分析确定的。鉴于幽默的成功与否高度依赖于语言能力，我们仅选择了英语能力较好的参与者。此外，我们要求参与者至少使用过一次大语言模型 (LLM) 界面，以确保他们熟悉相关概念和交互方式。最终，我们获得了来自 30 个不同国家的多样化参与者样本。在这些参与者中，63 人表示自己为男性，35 人表示自己为女性。平均年龄为 28.8 岁（标准差：8.7）。

For the second phase of the study, we simiarly recruited a second set of $\mathtt{N=}100$ participants with the same prerequisites for language skills but knowledge of LLMs was not a requirement. 98 of these completed the task, rating at least 50 images. Participants in this group were equally split between identifying as male and female with an average age of 32.6 (sd: 11.1) and originating in 29 different countries.

在研究的第二阶段，我们同样招募了第二组 $\mathtt{N=}100$ 名参与者，他们具备相同的语言技能前提条件，但不要求具备大语言模型 (LLM) 的知识。其中 98 人完成了任务，至少对 50 张图片进行了评分。该组的参与者在性别上平均分配，平均年龄为 32.6 岁（标准差：11.1），来自 29 个不同的国家。

4 Results

4 结果

The following section will describe the quantitative findings and their statistical analysis.

以下部分将描述定量研究结果及其统计分析。

4.1 Meme Creation

4.1 表情包创作

4.1.1 Idea Generation. During ideation, participants created an average of 6.1 ideas (sd: 3.2) with one participant managing to come with a total of 21 ideas for one of the images. As seen in Figure 7, participants that were able to use the LLM created noticeably more ideas than the participants in the baseline group. To get further insights how the presence of the chat affected the ideation process, we conducted statistical hypothesis tests on the number of ideas per participant. Following a Shapiro-Wilk test to determine nonnormality for both the absolute number of ideas $\langle W=0.811$ , $\mathcal{p}<$ 0.001) and the average number of ideas per participant $\langle W=0.820$ , $p<0.001\$ ), we used the Mann-Whitney-U test. This test indicated significant differences for the absolute count $^W=12652,p<0.001)$ and the average number of ideas (1519.5, $p<0.001]$ ).

4.1.1 创意生成。在创意生成阶段，参与者平均生成了6.1个创意（标准差：3.2），其中一位参与者为其中一张图片生成了多达21个创意。如图7所示，能够使用大语言模型的参与者明显比基线组的参与者生成了更多的创意。为了进一步了解聊天功能的存在如何影响创意生成过程，我们对每位参与者的创意数量进行了统计假设检验。通过Shapiro-Wilk检验确定绝对创意数量（$\langle W=0.811$，$\mathcal{p}<$ 0.001）和每位参与者的平均创意数量（$\langle W=0.820$，$p<0.001\$）的非正态性后，我们使用了Mann-Whitney-U检验。该检验表明绝对数量（$^W=12652,p<0.001$）和平均创意数量（1519.5，$p<0.001]$）存在显著差异。

Figure 6: Participants using the LLM were able to produce significantly more ideas than participants who had no external support, according to the Mann-Whitney-U test $(^{\ast\ast\ast}$ : $p<0.001)$ )

图 6: 根据 Mann-Whitney-U 检验，使用大语言模型 (LLM) 的参与者能够比没有外部支持的参与者产生显著更多的想法 $(^{\ast\ast\ast}$ : $p<0.001)$ )

Figure 7: While there were no significant differences in overall workload, the “Effort” subscale of the NASA TLX was significantly different according to the Mann-Whitney-U test $(^{*}{:}p<0.05)$

图 7: 虽然总体工作量没有显著差异，但根据 Mann-Whitney-U 检验，NASA TLX 的“Effort”子量表存在显著差异 $(^{*}{:}p<0.05)$

Figure 8: Pairwise comparison of how participants rated the memes with respect to the three scales “funny”, “creative”, and “shareable”. $\mathrm{~~\ddot{~~}~}{:p<0.05}$ , **: $p<0.01$ , $^{\star\star}\colon p<0.001$ , pairwise t-test/Mann-Whitney-U test, Bonferroni adjusted)

图 8: 参与者在“有趣”、“创意”和“可分享”三个维度上对表情包进行的两两比较。$\mathrm{~~\ddot{~~}~}{:p<0.05}$，**: $p<0.01$，$^{\star\star}\colon p<0.001$，两两t检验/Mann-Whitney-U检验，Bonferroni校正）

4.1.2 Workload. While there are significant differences for the number of created ideas, there is no evidence that this also affected the workload that was required to achieve this result. Statistical analysis of the Raw TLX showed no significant differences (ShapiroWilk test: $W=0.980$ , $p=0.1632$ , t-test: $t=-0.955$ , $d f=88.811$ , $\textstyle p=0.342\rangle$ ). We found the same to be true for each of the six TLX subscales, except for the question “How hard did you have to work to accomplish your level of performance?”, where participants using the LLM entered significantly lower values (Shapiro-Wilk test: $W=$ 0.934, $p<0.001$ , Mann-Whitney-U test: $W=755$ , $p=0.027$ ).

4.1.2 工作量。尽管在生成想法的数量上存在显著差异，但没有证据表明这也会影响实现这一结果所需的工作量。Raw TLX 的统计分析显示没有显著差异 (Shapiro-Wilk 检验: $W=0.980$, $p=0.1632$, t 检验: $t=-0.955$, $d f=88.811$, $\textstyle p=0.342\rangle$)。我们发现，除了“你为了达到你的表现水平付出了多少努力？”这个问题外，六个 TLX 子量表中的每一个都显示出相同的结果。在使用大语言模型的参与者中，该问题的得分显著较低 (Shapiro-Wilk 检验: $W=0.934$, $p<0.001$, Mann-Whitney-U 检验: $W=755$, $p=0.027$)。

4.1.3 General Feedback. Similarly, of the general questions about the user’s experience while creating the memes and the ideation process, we received responses that indicated no significant differences except for two questions.

4.1.3 总体反馈。同样，在关于用户在创作表情包和构思过程中的体验的总体问题中，我们收到的回复表明，除了两个问题外，没有显著差异。

For the first of these was the question whether participants felt that they created a lot of ideas. Results here match the actual idea count, with the LLM-supported users also subjectively noting that they created sign ici ant ly more ideas (Shapiro-Wilk test: $W=0.909$ , $p<0.001$ , Mann-Whitney-U test: $W=1308.5$ , $\begin{array}{r}{p=0.043)}\end{array}$ ), although the difference is less stark than with the actual number of ideas.

第一个问题是参与者是否感到他们创造了很多想法。这里的结果与实际的想法数量相符，使用大语言模型 (LLM) 支持的用户主观上也认为他们创造了显著更多的想法 (Shapiro-Wilk 检验: $W=0.909$, $p<0.001$, Mann-Whitney-U 检验: $W=1308.5$, $\begin{array}{r}{p=0.043)}\end{array}$)，尽管差异不如实际想法数量那么明显。

For the question on perceived ownership “The generated captions are my ideas”, participants that did not use the LLM perceived a higher degree of ownership for the generated ideas (Shapiro-Wilk test: $W,=,0.0.766$ , $\textit{p}<\ 0.001$ , Mann-Whitney-U test: $W,=,562$ , $p<0.001]$ ). However, even when using the chat, participants still generally felt ownership for the ideas.

对于感知所有权的问题“生成的标题是我的想法”，未使用大语言模型的参与者对生成的想法感知到更高的所有权（Shapiro-Wilk 检验：$W,=,0.0.766$，$\textit{p}<\ 0.001$，Mann-Whitney-U 检验：$W,=,562$，$p<0.001]$）。然而，即使在使用聊天工具时，参与者通常仍然对这些想法感到所有权。

4.2 Meme Rating

4.2 表情包评分

In the second phase of our experiments, we had a group of people rate the memes according to three criteria: how funny they thought they were, how creative they considered them, and how likely it was that they would share them. Along with the memes from the two conditions before, we had an additional condition with memes created exclusively using AI with no human input.

在我们实验的第二阶段，我们让一组人根据三个标准对表情包进行评分：他们认为表情包有多有趣、他们认为表情包有多有创意，以及他们有多大可能会分享这些表情包。除了之前两种条件下的表情包外，我们还增加了一个条件，即完全使用 AI 生成且没有任何人类输入的表情包。

Considering the fact that the data from the “funny” and “creative” scale were likely normally distributed (Shapiro-Wilk test: $W\ =$ 0.994, $p=0.062$ and $W=0.995$ , $p=0.155$ respectively), we used the ANOVA and pairwise t-tests, Bonferroni adjusted, to compare these two. The third scale, “share ability” was likely not not normally distributed (Shapiro-Wilk test: $W=0.988$ , $p<0.001]$ ), we used the Kruskal-Wallis test instead as well as pairwise Mann-Whitney-U tests, also Bonferroni adjusted.

考虑到“有趣”和“创意”量表的数据可能呈正态分布（Shapiro-Wilk 检验：$W\ =$ 0.994，$p=0.062$ 和 $W=0.995$，$p=0.155$），我们使用 ANOVA 和成对 t 检验（Bonferroni 校正）来比较这两个量表。第三个量表“分享能力”可能不呈正态分布（Shapiro-Wilk 检验：$W=0.988$，$p<0.001]$），因此我们改用 Kruskal-Wallis 检验以及成对 Mann-Whitney-U 检验（同样经过 Bonferroni 校正）。

According to these tests, each condition showed significant differences, as shown in Table 1. The pairwise comparison highlighted that it were consistently the memes generated by the LLM alone that were rated more positive than those where created with human involvement. The only exception to this was “share ability” where the comparison between the cooperative and pure AI creation was not significant. Memes created by humans with the help of the LLM were not rated significantly different than those from the baseline, i.e. without AI, for any of the three dimensions.

根据这些测试，每种条件都显示出显著差异，如表 1 所示。成对比较表明，由大语言模型单独生成的 meme 始终比人类参与创建的 meme 获得更积极的评价。唯一的例外是“分享能力”，其中合作创作与纯 AI 创作之间的比较并不显著。在大语言模型的帮助下由人类创建的 meme 在三个维度中的任何一个方面与基线（即没有 AI 的情况）相比都没有显著差异。

To ensure any unintended side-effects by of the image or topic selection, we performed the same statistical analysis to determine whether the image or the topic had any notable influence on the rating. This showed that the ratings across the images are relatively consistent, with only one pairing of images showing significantly different ratings for the question how funny the memes were perceived. The topic, on the other hand, seems to have had an impact on the rating of the memes, since work related memes were consistently rated significantly more funny, creative, and shareable.

为了确保图像或主题选择不会产生任何意外的副作用，我们进行了相同的统计分析，以确定图像或主题是否对评分有显著影响。结果显示，不同图像的评分相对一致，只有一对图像在“表情包有多有趣”这一问题的评分上存在显著差异。另一方面，主题似乎对表情包的评分有影响，因为与工作相关的表情包在有趣、创意和可分享性方面始终获得显著更高的评分。

We therefore analyzed the three scales from the survey again for each topic individually (see Table 1), which demonstrated that the previously found significant differences do not come evenly from across all data but seem to stem primarily from the memes about the topic of “work”.

因此，我们再次针对每个主题分别分析了调查中的三个量表（见表 1），结果表明，之前发现的显著差异并非均匀地来自所有数据，而是主要源于与“工作”主题相关的模因。

5 Discussion

5 讨论

The results of our study provide some early insights into how the availability of LLM support influences people’s creative process. Further, we investigated how the output of human-LLM co-creation is viewed, compare to purely LLM-generated content. In the following, we discuss our results with regard to the research questions.

我们的研究结果初步揭示了LLM支持的可获得性如何影响人们的创作过程。此外，我们还探讨了与纯LLM生成内容相比，人类-LLM共创的输出是如何被看待的。接下来，我们将围绕研究问题讨论我们的结果。

Table 1: Results of the statistical analysis of the meme ratings using ANOVA and Kruska-Wallis test (underlined).

表 1: 使用 ANOVA 和 Kruskal-Wallis 检验（下划线）对表情包评分进行统计分析的结果。

Shapiro-Wilk 检验所有主题 W JP F/x2 P ANOVA/ Kruskal-Wallis 检验

有趣创意 0.994 0.062 2

可分享 0.995 0.988 0.155 2

关于“工作”的表情包关于“运动”的表情包 0.001 2 11.761

Shapiro-Wilk ANOVA/KW Shapiro-Wilk ANOVA/KW Shapiro-Wilk 关于“食物”的表情包 ANOVA/KW

W P JP F/x2 P

有趣 0.984 0.075 2 7.1

创意 0.981 0.034 2 11.22

可分享 0.981 0.033 2 8.470

5.1 LLM support increases content output without increasing effort but might diminish the feeling of ownership

5.1 大语言模型支持在不增加努力的情况下增加内容输出，但可能会削弱所有权感

In our study, participants who worked with the LLM assistant came up with significantly more ideas when creating memes compared to those participants working alone. Interestingly, even though they generated more ideas, they did not feel like the task was more demanding. The NASA-TLX results showed that the overall workload was not much different between the two groups, but participants in the LLM-assisted group did report that they had to put in less effort.

在我们的研究中，与单独工作的参与者相比，使用大语言模型 (LLM) 助手的参与者在创作表情包时提出了显著更多的想法。有趣的是，尽管他们生成了更多的想法，但他们并不觉得任务变得更加繁重。NASA-TLX 结果显示，两组之间的总体工作量没有太大差异，但使用 LLM 助手的参与者确实报告说他们需要付出的努力更少。

These results suggest that using the LLM can make the creative process of crafting humorous content more efficient by helping people generate more ideas without feeling overwhelmed. The AI assistant seemed to help them explore more options without putting in extra effort. This is consistent with previous studies, which suggest that AI tools can support users in generating more creative ideas by reducing obstacles associated with brainstorming and creative development[24, 31].

这些结果表明，使用大语言模型 (LLM) 可以使创作幽默内容的过程更加高效，帮助人们产生更多想法而不会感到不知所措。AI 助手似乎帮助他们探索更多选项，而无需付出额外的努力。这与之前的研究一致，表明 AI 工具可以通过减少与头脑风暴和创意开发相关的障碍来支持用户产生更多创意想法 [24, 31]。

Additionally, participants who used the LLM reported a slightly reduced sense of ownership over their creations, indicating that AI assistance might affect the user’s connection to their work. However, participants generally still felt that they owned the idea. We attribute this finding to the fact that, in our online testing environment, AI assistance mainly contributed during the idea generation stage, whereas in the stages of idea screening and creating the meme image, no AI assistance or suggestions were provided. The final decision was always left up to the participants. Since feeling a sense of ownership and personal investment plays a significant role in creative motivation and satisfaction[5], it is essential to think about how we can balance the involvement of AI in the creative process.

此外，使用大语言模型（LLM）的参与者报告称，他们对创作的归属感略有下降，这表明 AI 辅助可能会影响用户与其作品的连接。然而，参与者通常仍然认为他们拥有创意。我们将这一发现归因于在我们的在线测试环境中，AI 辅助主要在创意生成阶段发挥作用，而在创意筛选和制作表情包图像的阶段，没有提供 AI 辅助或建议。最终的决定权始终掌握在参与者手中。由于归属感和个人投入感在创造动机和满意度中起着重要作用 [5]，因此思考如何平衡 AI 在创作过程中的参与至关重要。

5.2 The increased productivity of human-AI teams does not lead to better results - just to more results

5.2 人机协作生产力的提升并未带来更好的结果，只是产生了更多的结果

The participants in our study developed more ideas in collaboration with the AI than when they worked alone. However, this did not translate into higher quality in the memes selected by our participants - which always happened without LLM support - in terms of the metrics we collected. This raises questions about the link between quantity and quality in human-AI collaboration. Although coming up with more ideas could increase the chances of producing something high-quality, our study did not find a significant difference.

在我们的研究中，参与者与AI合作时比单独工作时产生了更多的想法。然而，这并没有转化为参与者选择的模因（meme）在质量上的提升——这些选择总是在没有大语言模型支持的情况下进行的——根据我们收集的指标。这引发了关于人机协作中数量与质量之间关系的问题。尽管产生更多想法可能会增加产生高质量内容的机会，但我们的研究并未发现显著差异。

In relation to our findings, several prior studies suggest similar outcomes in the context of human-AI co-creativity. For example, Wan et al. [48] found that participants who used an LLM during prewriting activities produced more creative ideas. However, these ideas were not significantly better in quality compared to those created by participants working alone. Similarly, Rezwana and Maher[42]pointed out ethical and practical challenges in humanAI co-creativity. They noted that while AI can help generate more content, it does not always improve the depth or quality of the work because it struggles with understanding the context and subtleties of creativity. These studies align with our results, showing that while human-AI collaboration can increase productivity and the number of ideas, it does not always lead to higher-quality output.

与我们的发现相关，几项先前的研究在人机共创的背景下提出了类似的结果。例如，Wan 等人 [48] 发现，在预写活动中使用大语言模型的参与者产生了更多创意。然而，这些创意在质量上并没有显著优于那些单独工作的参与者所创造的创意。同样，Rezwana 和 Maher [42] 指出了人机共创中的伦理和实践挑战。他们指出，虽然 AI 可以帮助生成更多内容，但它并不总能提高工作的深度或质量，因为它在理解创意背景和细微差别方面存在困难。这些研究与我们的结果一致，表明虽然人机协作可以提高生产力和创意数量，但并不总能带来更高质量的产出。

From this, we conclude that the use of AI support in the context of the metrics investigated in this study does not lead to better results in terms of humor. Conversely, we can also assume that users with AI support achieve a consistent result faster and with more variations, without this representing an additional mental burden for the users.

由此我们得出结论，在本研究调查的指标背景下，使用AI支持并不会在幽默方面带来更好的结果。相反，我们也可以假设，获得AI支持的用户能够更快且以更多变化达到一致的结果，而不会给用户带来额外的心理负担。

5.3 LLMs appeal to a broad taste in humor, but humans can be wittier still

5.3 大语言模型 (LLM) 对幽默的广泛吸引力，但人类仍能更机智

Over all assessed metrics, we found that memes created solely by AI performed better than memes created solely by humans or in collaboration between humans and AI. We attribute this initially quite surprising result to the following: The LLM used was trained on large data sets with many cultural references and different types of humor. During the training process, the LLM was most likely to learn the types of humor that it saw most frequently during the training process, i.e. those that resonated best with the crowd. As a result, such LLMs are good at creating content that appeals to a wide audience. The AI picks up on general trends in its training data, which helps it to produce content that most people find appealing. On the other hand, human-created content tends to draw from personal experiences and specific cultural backgrounds[9]. This broad appeal of AI-generated memes, however, may also be influenced by the composition of the evaluators. The people in the online study, as well as those who evaluated the memes, were random users from a crowd sourcing platform. While this approach brings in a variety of perspectives, it might not capture the subtle differences in humor appreciation across specific demographic groups. Humor is personal and influenced by factors like life experiences, cultural background, and social norms. Therefore, understanding these subtleties would likely require a more targeted evaluation across specific audience segments, which could provide deeper insights into how different groups perceive humor. Even when people work with AI, they often rely on their own experiences when choosing ideas, which creates a similar challenge. It is hard to compete with the AI’s ability to cater to popular tastes.

在所有评估的指标中，我们发现完全由 AI 生成的模因（meme）表现优于完全由人类生成或由人类与 AI 协作生成的模因。我们将这一最初令人惊讶的结果归因于以下原因：使用的大语言模型（LLM）是在包含大量文化参考和不同类型幽默的数据集上训练的。在训练过程中，LLM 最有可能学习到在训练过程中出现频率最高的幽默类型，即那些最能引起大众共鸣的类型。因此，这类 LLM 擅长创建能够吸引广泛受众的内容。AI 能够捕捉其训练数据中的普遍趋势，这有助于它生成大多数人觉得有吸引力的内容。另一方面，人类创作的内容往往源于个人经历和特定的文化背景[9]。然而，AI 生成模因的这种广泛吸引力可能也受到评估者构成的影响。在线研究中的参与者以及评估模因的人都是来自众包平台的随机用户。虽然这种方法引入了多样化的视角，但它可能无法捕捉到特定人群在幽默欣赏上的细微差异。幽默是个人化的，受到生活经历、文化背景和社会规范等因素的影响。因此，理解这些细微差别可能需要针对特定受众群体进行更有针对性的评估，这可以提供关于不同群体如何感知幽默的更深层次见解。即使人类与 AI 合作，他们在选择创意时也常常依赖自己的经验，这带来了类似的挑战。很难与 AI 迎合大众口味的能力竞争。

This interpretation is further supported by the analysis of the top-performing memes (Figure 1). The funniest memes were mainly created by humans, while those rated highest for creativity and share ability were the result of human-AI collaborations. Among the top 4 humor memes, human creators claimed most of the top positions. In terms of creativity, humans still took half of the rankings, with the rest coming from human-AI teams. For shareability, human-AI collaborations made up half of the highest-ranked memes.

这一解释通过对表现最佳的表情包分析得到了进一步支持（图 1）。最有趣的表情包主要由人类创作，而在创意性和可分享性方面评分最高的则是人类与 AI 协作的结果。在最幽默的 4 个表情包中，人类创作者占据了大部分高位。在创意性方面，人类仍然占据了一半的排名，其余来自人类与 AI 团队。在可分享性方面，人类与 AI 协作占据了最高排名表情包的一半。

These results highlight that while AI is effective at creating broadly appealing content on average, the individual human touch resonates most deeply on certain dimensions. The top-ranked human creators likely brought in personal experiences, cultural nuances, or innovative ideas that AI, limited to patterns from existing data, cannot fully replicate. At the same time, human-AI collaboration showed real potential, particularly in creativity and share ability, suggesting that AI can offer new ideas or perspectives that, combined with human creativity, lead to content that’s both original and widely appealing.

这些结果表明，虽然AI在平均意义上能够有效创作出广泛吸引人的内容，但在某些维度上，人类的个人触感最能引起深刻共鸣。排名靠前的人类创作者可能融入了个人经历、文化细微差别或创新思维，这些都是AI仅凭现有数据模式无法完全复制的。同时，人机协作展现出了真正的潜力，特别是在创造力和分享性方面，这表明AI能够提供新的想法或视角，与人类的创造力相结合，从而产生既具原创性又广受欢迎的内容。

6 Limitations and Future Work

6 局限性与未来工作

We are convinced that the presented user study provides valuable insights into the creativity process when generating humorous content together with LLMs and the perceived funniness of AI-generated content compared to human (co-) authored content. However, the design and results of our study imply a number of limitations and directions for future work, which we discuss below.

我们相信，所呈现的用户研究为与LLMs共同生成幽默内容时的创意过程以及AI生成内容与人类（共同）创作内容相比的趣味性感知提供了宝贵的见解。然而，我们的研究设计和结果暗示了一些局限性和未来工作的方向，我们将在下面讨论。

6.1 Short-term vs. long-term interaction with the AI

6.1 与AI的短期与长期交互

Our study only investigated short-term interactions with the system in a single session. We did not explore how prolonged use of such an AI support system might affect creativity, satisfaction, or the development of new or improved skills. This short-term focus limits our understanding of how users’ creative strategies evolve or how they rely on AI over time, which could lead to a decline in quality of the output. Future studies could therefore give participants the opportunity to use AI tools over a longer period of time. By following changes in creative strategies over time, reliance on AI and sense of ownership of generated content, this could provide a better understand the long-term effects of AI collaboration in creative tasks.

我们的研究仅调查了单次会话中与系统的短期互动。我们没有探讨长期使用此类AI支持系统可能如何影响创造力、满意度或新技能的发展与提升。这种短期关注限制了我们理解用户的创意策略如何演变或他们如何随时间依赖AI，这可能导致输出质量的下降。因此，未来的研究可以为参与者提供更长时间使用AI工具的机会。通过跟踪创意策略随时间的变化、对AI的依赖以及对生成内容的所有权感，这可以更好地理解AI在创意任务中合作的长期影响。

6.2 Limited collaboration between humans and the LLM

6.2 人类与大语言模型的有限协作

We found that many participants did not fully utilize the potential of the LLM. Some users limited themselves to fulfilling the minimum requirements without revising their ideas in collaboration with the system. Less than half of the participants interacted with the LLM multiple times, and only six participants had more than eight interactions, possibly affecting the quality of their results. A possible reason for this might be our implementation of the study interface. Although the chatbox interface used during the idea generation phase was familiar and easy to use, it lacked structure to guide the creative process. This open-ended approach resulted in varied interactions depending on participants’ individual backgrounds and ideation strategies. Further reasons could include the setting of un ambitious goals for the participants, or even an inherent problem with crowd sourcing platforms for conducting such studies [? ]. Future work could integrate more structured prompts or more collaborative tools into AI systems to encourage deeper engagement and iterative idea development.

我们发现许多参与者并未充分利用大语言模型的潜力。一些用户仅限于满足最低要求，而没有与系统协作修改他们的想法。不到一半的参与者与大语言模型进行了多次互动，只有六位参与者进行了超过八次互动，这可能影响了他们的结果质量。一个可能的原因是我们研究界面的实现方式。尽管在创意生成阶段使用的聊天框界面熟悉且易于使用，但它缺乏引导创意过程的结构。这种开放式方法导致了根据参与者个人背景和创意策略的不同互动。进一步的原因可能包括为参与者设定的目标不够雄心勃勃，甚至是众包平台进行此类研究的固有问题[?]。未来的工作可以将更多结构化的提示或更多协作工具集成到AI系统中，以鼓励更深入的参与和迭代的创意开发。

6.3 Cultural and Social Influence on Humor and Creativity

6.3 文化和社会对幽默与创造力的影响

What we find funny is heavily influenced by personal backgrounds such as social and cultural factors. In our study, the participants who contributed content and those who rated content came from a wide range of different cultural and social backgrounds. This approach is likely to have led to a wide range of interpretations and preferences for what is considered humorous or creative, thus impacting the outcomes. However, we did not systematically collect detailed demographic information (e.g., ethnicity or language proficiency), limiting our ability to fully explore how varying cultural backgrounds shape humor perception and meme creation. Future work should gather richer demographic and qualitative data—for instance, through open-ended survey questions or interviews—to capture the nuances in how different cultural, linguistic, or social backgrounds perceive and produce humor. Such qualitative insights could also reveal how participants interpret creativity in the context of meme-making, providing a more holistic picture of why certain ideas resonate (or fail to resonate) with different audiences.

我们认为有趣的事物很大程度上受到个人背景的影响，如社会和文化因素。在我们的研究中，贡献内容的参与者和评价内容的参与者来自广泛不同的文化和社会背景。这种方法可能导致了对幽默或创意的广泛解释和偏好，从而影响了结果。然而，我们并未系统地收集详细的人口统计信息（例如，种族或语言能力），这限制了我们全面探索不同文化背景如何塑造幽默感知和表情包创作的能力。未来的工作应收集更丰富的人口统计和定性数据——例如，通过开放式调查问题或访谈——以捕捉不同文化、语言或社会背景如何感知和产生幽默的细微差别。这些定性见解还可以揭示参与者在表情包制作背景下如何解释创意，从而更全面地了解为什么某些想法能够（或未能）引起不同受众的共鸣。

7 Conclusion

7 结论

In this paper, we examined the role of LLMs as co-creators in generating humorous content, focusing on the creation of internet memes. Our findings demonstrated that participants who collaborated with an LLM assistant produced a significantly higher number of ideas without reporting an increase in perceived workload, suggesting improvements in both productivity and efficiency. However, this increase in ideas did not consistently lead to higher-quality content when humans were involved. Memes created through human-AI collaboration were rated about the same as those made by humans alone in terms of humor, creativity, and share ability. Interestingly, memes generated entirely by AI scored better, on average, than both human-only and human-AI collaborative memes. But when we looked at the top-performing memes, human-created content was strongest in humor, while human-AI collaborations excelled in creativity and share ability.

在本文中，我们探讨了大语言模型 (LLM) 作为共同创作者在生成幽默内容中的作用，重点关注互联网迷因的创作。我们的研究结果表明，与LLM助手合作的参与者产生了显著更多的想法，且没有报告感知工作量的增加，这表明生产力和效率都有所提高。然而，当人类参与时，这种想法的增加并不总是导致更高质量的内容。通过人类与AI合作创作的迷因在幽默性、创造性和可分享性方面与仅由人类创作的迷因评分相当。有趣的是，完全由AI生成的迷因平均得分高于仅由人类创作和人类与AI合作的迷因。但当我们观察表现最佳的迷因时，人类创作的内容在幽默性方面最强，而人类与AI合作在创造性和可分享性方面表现突出。

These findings show that human-AI collaboration in creative tasks is complex. While AI can increase productivity and produce content that appeals to a wide audience, human creativity is still key for creating content that connects more deeply in certain areas. Participants working with the LLM reported feeling less ownership over their work, suggesting that integrating AI into the creative process needs to be done carefully to keep users connected to their creations. Also, the short-term nature of our study, the limited use of the AI’s full potential due to the open-ended interface, and the similar backgrounds of our participants suggest that more research is needed to understand the long-term effects of AI assistance on creativity and collaboration.

这些发现表明，在创意任务中，人类与AI的合作是复杂的。虽然AI可以提高生产力并创作出吸引广泛受众的内容，但在某些领域，人类创造力仍然是创作出更深层次连接内容的关键。与大语言模型合作的参与者报告称，他们对作品的归属感较低，这表明在创意过程中整合AI需要谨慎进行，以保持用户与其创作的连接。此外，我们研究的短期性质、由于开放式界面而未能充分利用AI的全部潜力，以及参与者的相似背景，都表明需要更多的研究来理解AI辅助对创造力和合作的长期影响。

Looking forward, future studies should explore how long-term use of AI affects creative strategies, satisfaction, and skill development. AI interfaces could be improved by providing more structured guidance and encouraging deeper engagement, helping users make better use of AI’s capabilities while maintaining ownership over their work. By addressing these challenges, we can develop smarter, more human-centered AI systems that not only boost productivity but also enhance human creativity.

展望未来，未来的研究应探索长期使用 AI 如何影响创意策略、满意度和技能发展。AI 界面可以通过提供更有条理的指导和鼓励更深入的参与来改进，帮助用户更好地利用 AI 的能力，同时保持对其工作的所有权。通过应对这些挑战，我们可以开发出更智能、更以人为中心的 AI 系统，这些系统不仅能提高生产力，还能增强人类的创造力。

Shapiro-Wilk 检验	所有主题	W	JP F/x2 P	ANOVA/ Kruskal-Wallis 检验
有趣创意		0.994	0.062	2
可分享		0.995 0.988	0.155	2
关于“工作”的表情包	关于“运动”的表情包	0.001	2	11.761
Shapiro-Wilk ANOVA/KW	Shapiro-Wilk ANOVA/KW	Shapiro-Wilk	关于“食物”的表情包	ANOVA/KW
W	P	JP	F/x2	P
有趣	0.984	0.075	2	7.1
创意	0.981	0.034	2	11.22
可分享	0.981	0.033	2	8.470

[论文翻译]一个人无法独自创作梗图：评估大语言模型与人类在幽默生成中的协同创造力

原文地址：https://arxiv.org/pdf/2501.11433

One Does Not Simply Meme Alone: Evaluating Co-Creativity Between LLMs and Humans in the Generation of Humor

一个人无法独自创作梗图：评估大语言模型与人类在幽默生成中的协同创造力

Abstract

摘要

CCS Concepts

Keywords

关键词

ACM Reference Format:

ACM 参考格式:

1 Introduction

1 引言

2 Related Work

2 相关工作

2.1 Human Co-Creativity and the Complexity of Humor

2.1 人类共创性与幽默的复杂性

2.2 Human-AI Collaboration in Creativity

2.2 人类与AI在创造力中的协作

2.3 LLMs in Creative Meme Generation

2.3 大语言模型在创意表情包生成中的应用

2.4 Evaluation Metrics for Creative Outputs

2.4 创意输出的评估指标

3 Methodology

3 方法论

3.1 Task

3.1 任务

最喜欢的创意选择

3.2 Procedure

3.2 流程

3.3 Prompting

3.3 提示 (Prompting)

3.4 Apparatus

3.4 装置

3.5 Data Collection

3.5 数据收集

3.6 Participants

3.6 参与者

4 Results

4 结果

4.1 Meme Creation

4.1 表情包创作

4.2 Meme Rating

4.2 表情包评分

5 Discussion

5 讨论

5.1 LLM support increases content output without increasing effort but might diminish the feeling of ownership

5.1 大语言模型支持在不增加努力的情况下增加内容输出，但可能会削弱所有权感

5.2 The increased productivity of human-AI teams does not lead to better results - just to more results

5.2 人机协作生产力的提升并未带来更好的结果，只是产生了更多的结果

5.3 LLMs appeal to a broad taste in humor, but humans can be wittier still

5.3 大语言模型 (LLM) 对幽默的广泛吸引力，但人类仍能更机智

6 Limitations and Future Work

6 局限性与未来工作

6.1 Short-term vs. long-term interaction with the AI

6.1 与AI的短期与长期交互

6.2 Limited collaboration between humans and the LLM

6.2 人类与大语言模型的有限协作

6.3 Cultural and Social Influence on Humor and Creativity

6.3 文化和社会对幽默与创造力的影响

7 Conclusion

7 结论