Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles

层次化提示分类法：符合人类认知原则的大语言模型通用评估框架

Abstract

摘要

Assessing the effectiveness of large language models (LLMs) in performing different tasks is crucial for understanding their strengths and weaknesses. This paper presents Hierarchical Prompting Taxonomy (HPT), grounded on human cognitive principles and designed to assess LLMs by examining the cognitive demands of various tasks. The HPT utilizes the Hierarchical Prompting Framework (HPF), which structures five unique prompting strategies in a hierarchical order based on their cognitive requirement on LLMs when compared to human mental capabilities. It assesses the complexity of tasks with the Hierarchical Prompting Index (HPI), which demonstrates the cognitive competencies of LLMs across diverse datasets and offers insights into the cognitive demands that datasets place on different LLMs. This approach enables a comprehensive evaluation of an LLM’s problem-solving abilities and the intricacy of a dataset, offering a standardized metric for task complexity. Extensive experiments with multiple datasets and LLMs show that HPF enhances LLM performance by $2%-63%$ compared to baseline performance, with GSM8k being the most cognitive ly complex task among reasoning and coding tasks with an average HPI of 3.20 confirming the effectiveness of HPT. To support future research and reproducibility in this domain, the implementations of HPT and HPF are available here.

评估大语言模型(LLM)在不同任务中的表现效果对于理解其优势和不足至关重要。本文提出基于人类认知原则的分层提示分类法(HPT)，旨在通过分析各类任务的认知需求来评估大语言模型。HPT采用分层提示框架(HPF)，该框架根据LLM与人类心智能力对比的认知需求，将五种独特的提示策略按层级结构排列。通过分层提示指数(HPI)评估任务复杂度，该指数展示了LLM在不同数据集上的认知能力，并揭示了数据集对不同LLM的认知需求。这种方法能全面评估LLM的问题解决能力和数据集的复杂程度，为任务复杂度提供了标准化度量指标。在多个数据集和LLM上的大量实验表明，与基线性能相比，HPF将LLM性能提升了$2%-63%$，其中GSM8k在推理和编码任务中认知复杂度最高，平均HPI达到3.20，验证了HPT的有效性。为支持该领域未来研究和可重复性，HPT和HPF的实现代码已开源。

Code — https://github.com/de vich and 579/HPT

代码 — https://github.com/devichand579/HPT

1 Introduction

1 引言

Large Language Models (LLMs) have revolutionized natural language processing (NLP), enabling significant advancements in a wide range of applications. Conventional evaluation frameworks often apply a standard prompting approach to assess different LLMs, regardless of the complexity of the task, which may result in biased and suboptimal outcomes. Moreover, applying the same prompting approach across all samples within a dataset without considering each sample’s relative complexity adds to the unfair situation. To achieve a more balanced evaluation framework, it is essential to account for both the task-solving ability of LLMs and the varying cognitive complexities of the dataset samples. This limitation highlights the need for more sophisticated evaluation methods that can adapt to varying levels of task complex- ity. Within this study, complexity is defined as the cognitive demands associated with solving a task or the cognitive load introduced by a prompting strategy on LLMs. Henceforth, the term ”complexity” will be applied solely in this context. Task complexity, within the realm of human cognition, pertains to the cognitive requirements that a task imposes, which includes the diverse levels of mental effort necessary for processing, analyzing, and synthesizing information. According to Sweller (1988), tasks become more complex as they require greater cognitive resources, engaging working memory in more demanding processes such as reasoning and problem-solving. Similarly, Anderson et al. (2014) highlights that human cognitive abilities span a continuum from basic recall to higher-order thinking, with increasing difficulty correlating to tasks that demand analysis, synthesis, and evaluation. When applied to LLMs, the complexity of prompting strategies can be systematically evaluated by mapping them onto this human cognitive hierarchy. This alignment allows for an assessment of how LLMs perform tasks that reflect varying degrees of cognitive load, thereby providing a structured framework for understanding the cognitive demands associated with various tasks. By imposing this cognitive complexity framework on LLMs, this paper establishes a universal evaluation method, grounded in human cognitive principles, that enables more precise comparisons of model performance across tasks with varying levels of difficulty.

大语言模型(LLM)彻底改变了自然语言处理(NLP)领域，推动各类应用取得重大进展。传统评估框架通常采用标准提示方法评估不同大语言模型，而忽略任务复杂性差异，这可能导致评估结果存在偏差且非最优。此外，对数据集中所有样本统一应用相同提示策略而不考虑各样本的相对复杂度，进一步加剧了评估的不公平性。为实现更均衡的评估框架，必须同时考虑大语言模型的任务解决能力和数据样本的认知复杂度差异。这一局限性凸显了需要开发能适应不同任务复杂度的更精细评估方法。本研究中，复杂度被定义为解决任务所需的认知需求，或提示策略给大语言模型带来的认知负荷。后文所述"复杂度"皆特指此定义。

在人类认知领域，任务复杂度指任务对认知能力的要求，包括处理、分析和综合信息所需的不同层次脑力消耗。Sweller (1988)指出，当任务需要更多认知资源并涉及工作记忆参与推理、问题解决等高要求过程时，其复杂度随之提升。Anderson等人(2014)强调人类认知能力涵盖从基础记忆到高阶思维的连续统，任务难度随着对分析、综合和评估能力要求的提升而增加。将这一认知层次结构映射到大语言模型的提示策略评估中，可系统性地衡量模型在反映不同认知负荷任务中的表现，从而建立理解各类任务认知需求的结构化框架。

通过将人类认知复杂度框架应用于大语言模型评估，本文建立了一种基于人类认知原理的通用评估方法，能更精确地比较模型在不同难度任务中的性能表现。

Figure 1: Hierarchical Prompting Framework includes five distinct prompting strategies, each designed for different levels of task complexity to ensure the appropriate prompt is selected for the given task. A $\checkmark$ indicates task completion, while $\mathrm{a}\times$ signifies task in completion.

图 1: 分层提示框架包含五种不同的提示策略，每种策略针对不同复杂度的任务设计，确保为给定任务选择合适的提示。$\checkmark$ 表示任务完成，$\mathrm{a}\times$ 表示任务未完成。

This paper introduces the HPT, a set of rules that maps the human cognitive principles for assessing the complexity of different prompting strategies. It employs the HPF shown in Figure 1, a prompt selection framework that selects the prompt with the optimal cognitive load on LLM required in solving the task. HPF enhances the interaction with LLMs, and improves performance across various tasks by ensuring prompts resonate with human cognitive principles. The main contributions of this paper are as follows:

本文介绍了HPT，这是一套基于人类认知原则来评估不同提示策略复杂度的规则。该研究采用了图1所示的HPF框架，这是一个能选择使大语言模型(LLM)在完成任务时认知负荷最优的提示选择框架。HPF通过确保提示符合人类认知原则，增强了与大语言模型的交互，并提升了各类任务的性能表现。本文的主要贡献如下：

HPF can be effectively compared to an ”Open Book” examination as shown in Figure 2, where questions represent tasks and textbooks serve as prompting strategies. In this analogy, the exam questions vary in complexity, from simple factual recall to intricate analytical problems, analogous to the tasks in HPT that are evaluated based on their cognitive demands. Similarly, textbooks provide structured guidance for solving these questions, much like the HPF organizes prompts in increasing levels of complexity to support LLMs. For example, a straightforward glossary lookup corresponds to a low-complexity task, while solving a multi-step analytical problem requiring synthesis of concepts represents a highcomplexity task. The effort a student invests in answering a question mirrors the HPI, which measures the cognitive load placed on the LLM. Just as students perform better with structured resources like textbooks, LLMs improve with welldesigned hierarchical prompting strategies, enabling them to tackle progressively complex tasks effectively.

HPF可以有效地比作"开卷考试"，如图2所示，其中问题代表任务，教科书则充当提示策略。在此类比中，考试题目的复杂度各不相同，从简单的事实回忆到复杂的分析问题，类似于HPT中根据认知需求评估的任务。同样，教科书为解答这些问题提供了结构化指导，就像HPF通过组织复杂度递增的提示来支持大语言模型。例如，简单的术语表查找对应低复杂度任务，而需要综合概念的多步骤分析问题则代表高复杂度任务。学生解答问题所投入的努力反映了HPI，即衡量大语言模型承受的认知负荷。正如学生在教科书等结构化资源辅助下表现更好，大语言模型也能通过精心设计的分层提示策略得到提升，从而有效处理日益复杂的任务。

The remainder of the paper is structured as follows: Section 2 reviews the related work on prompting and evaluation in LLMs. Section 3 details the HPT and its associated frameworks. Section 4 outlines the experimental setup, results, and ablation studies. Section 5 concludes the paper. Section 6 discusses the ethical impact of the work.

本文的剩余部分结构如下：第2节回顾了大语言模型 (LLM) 中提示和评估的相关工作。第3节详细介绍了HPT及其相关框架。第4节概述了实验设置、结果和消融研究。第5节对全文进行了总结。第6节讨论了本研究的伦理影响。

Figure 2: Analogical framework comparing the HPF with ”Open Book” examination methodology. The diagram illus- trates how HPF components (below) mirror traditional educational assessment elements (above), with parallel relationships between task complexity levels, resource utilization (prompts/textbooks), and performance metrics (HPI/student effort). This comparison demonstrates how LLM task complexity scales similarly to educational assessment complexity, from simple lookup tasks to complex synthesis problems

图 2: 展示HPF与"开卷考试"方法论的类比框架示意图。该图阐释了HPF组件(下方)如何对应传统教育评估要素(上方)，包括任务复杂度层级、资源利用(prompts/教科书)和性能指标(HPI/学生努力程度)之间的平行关系。此对比揭示了大语言模型任务复杂度与教育评估复杂度具有相似的递进规律，从简单查找任务到复杂综合问题。

2 相关工作

The advent of LLMs has revolutionized NLP by demonstrating significant improvements in few-shot and zero-shot learning capabilities. Brown et al. (2020) introduced GPT-3, a 175 billion parameter auto regressive model, showcasing its ability to perform a wide range of tasks such as question-answering, reading comprehension, translation, and natural language inference without fine-tuning. This study highlighted the potential of very large models for in-context learning while also identifying limitations in commonsense reasoning and specific comprehension tasks. Similarly, Liu et al. (2021) surveyed prompt-based learning, emphasizing the role of prompt engineering in leveraging pre-trained models for few-shot and zero-shot adaptation to new tasks with minimal labeled data.

大语言模型(LLM)的出现通过显著提升少样本和零样本学习能力，彻底改变了自然语言处理领域。Brown等人(2020)提出的GPT-3是一个1750亿参数的自回归模型，展示了其在无需微调情况下完成问答、阅读理解、翻译和自然语言推理等广泛任务的能力。该研究揭示了超大规模模型在上下文学习方面的潜力，同时也指出了其在常识推理和特定理解任务中的局限性。类似地，Liu等人(2021)对基于提示的学习进行了综述，强调提示工程在利用预训练模型进行少样本和零样本适应新任务中的关键作用，仅需少量标注数据即可实现。

2.1 Prompt Engineering

2.1 提示工程 (Prompt Engineering)

Prompting plays a vital role in unlocking the full potential of LLMs. By designing specific input prompts, the LLM’s responses can be guided, significantly influencing the quality and relevance of the output. Effective prompting strategies have enhanced LLM performance on tasks ranging from simple question-answering to complex reasoning and problemsolving. Recent research has explored various approaches to prompting and reasoning evaluation in LLMs. Chain-ofThought (CoT) prompting (Wei et al. 2022b) elicits step-bystep reasoning, improving performance on complex tasks. Specializing smaller models (Fu et al. 2023) and using large models as reasoning teachers (Ho, Schmid, and Yun 2022) have demonstrated the potential for enhancing reasoning capabilities. Emergent abilities in LLMs, which appear suddenly at certain scale thresholds, have also been a topic of interest. Wei et al. (2022a) examined these abilities in fewshot prompting, discussing the underlying factors and implications for future scaling. Complementing this, Kojima et al. (2022) demonstrated that LLMs could exhibit multistep reasoning capabilities in a zero-shot setting by simply modifying the prompt structure, thus highlighting their potential as general reasoning engines. Yao et al. (2023) introduced the Tree-of-Thoughts framework, enabling LLMs to deliberate over coherent text units and perform heuristic searches for complex reasoning tasks. This approach generalizes over chain-of-thought prompting and has shown significant performance improvements in tasks requiring planning and search, such as creative writing and problem-solving games. Kong et al. (2024) introduced role-play prompting to improve zeroshot reasoning by constructing role-immersion interactions, which implicitly trigger chain-of-thought processes and enhance performance across diverse reasoning benchmarks. Progressive-hint prompting (Zheng et al. 2023) has been proposed to conceptualize answer generation and guide LLMs toward correct responses. Meta cognitive prompting (Wang and Zhao 2024) incorporates self-aware evaluations to enhance understanding abilities.

提示词 (Prompting) 在释放大语言模型全部潜能方面起着至关重要的作用。通过设计特定的输入提示词，可以引导大语言模型的响应，显著影响输出的质量和相关性。从简单的问答到复杂的推理和问题解决，有效的提示策略已提升了大语言模型在各种任务中的表现。

近期研究探索了大语言模型中多种提示与推理评估方法。思维链 (Chain-of-Thought, CoT) 提示 (Wei et al. 2022b) 通过逐步推理的引导，提升了复杂任务的表现。专用小型模型 (Fu et al. 2023) 和将大模型作为推理教师 (Ho, Schmid, and Yun 2022) 的方法，都展示了增强推理能力的潜力。

大语言模型在特定规模阈值突然显现的涌现能力也备受关注。Wei et al. (2022a) 研究了少样本提示中的这些能力，探讨了其成因及对未来扩展的启示。与此相呼应的是，Kojima et al. (2022) 证明仅通过修改提示结构，大语言模型就能在零样本设置中展现多步推理能力，凸显其作为通用推理引擎的潜力。

Yao et al. (2023) 提出的思维树 (Tree-of-Thoughts) 框架使大语言模型能够对连贯文本单元进行推敲，并为复杂推理任务执行启发式搜索。该方法泛化了思维链提示，在需要规划和搜索的任务 (如创意写作和解谜游戏) 中展现出显著性能提升。

Kong et al. (2024) 引入角色扮演提示，通过构建角色沉浸式交互来改进零样本推理，这种方法能隐式触发思维链过程，并在多样化推理基准测试中提升表现。渐进提示 (Zheng et al. 2023) 被提出用于概念化答案生成并引导大语言模型给出正确响应。元认知提示 (Wang and Zhao 2024) 则通过自我感知评估来增强理解能力。

These works collectively highlight the advancements in leveraging LLMs through innovative prompting techniques, addressing their emergent abilities, reasoning capabilities, interaction strategies, robustness, and evaluation methodologies. Despite significant advancements, the current LLM research reveals several limitations, particularly in terms of prompt design, handling complex reasoning tasks, and evaluating model performance across diverse scenarios. While promising, the emergent abilities of LLMs often lack predictability and control, and the robustness of these LLMs in the face of misleading prompts remains a concern.

这些研究共同凸显了通过创新提示技术利用大语言模型所取得的进展，包括解决其涌现能力、推理能力、交互策略、鲁棒性及评估方法等方面。尽管取得了重大进展，当前大语言模型研究仍存在若干局限性，尤其在提示设计、处理复杂推理任务以及跨多样化场景评估模型性能等方面。虽然前景广阔，但大语言模型的涌现能力往往缺乏可预测性和可控性，且这些模型在面对误导性提示时的鲁棒性仍令人担忧。

2.2 Prompt Optimization and Selection

2.2 提示优化与选择

The challenge of optimizing prompts for LLMs has been addressed in several key studies, each contributing unique methodologies to enhance model performance and efficiency. Shen et al. (2023) introduce PFLAT, a metric utilizing flatness regular iz ation to quantify prompt utility, which leads to improved results in classification tasks. Do et al. (2024) propose a structured three-step methodology that contains data clustering, prompt generation, and evaluation, effectively balancing generality and specificity in prompt selection. ProTeGi (Pryzant et al. 2023) offers a non-parametric approach inspired by gradient descent, leveraging natural language ”gradients” to iterative ly refine prompts. Wang et al. (2024) present PromISe, which transforms prompt optimization into an explicit chain of thought, employing self-introspection and refinement techniques. Zhou et al. (2023b) proposed DYNAICL, a framework for efficient prompting that dynamically allocates in-context examples based on a meta-controller’s predictions, achieving better performance-efficiency tradeoffs compared to uniform example allocation.

优化大语言模型提示的挑战已在多项关键研究中得到解决，每项研究都提出了独特方法来提升模型性能和效率。Shen等人(2023)提出了PFLAT指标，利用平坦度正则化量化提示效用，在分类任务中取得了更好效果。Do等人(2024)提出包含数据聚类、提示生成和评估的三步结构化方法，有效平衡了提示选择的通用性与特异性。ProTeGi(Pryzant等人2023)受梯度下降启发提出非参数化方法，利用自然语言"梯度"迭代优化提示。Wang等人(2024)开发的PromISe将提示优化转化为显式思维链，采用自省和优化技术。Zhou等人(2023b)提出DYNAICL框架，通过元控制器预测动态分配上下文示例，相比均匀分配实现了更好的性能-效率平衡。

These studies aim to automate prompt design, moving away from traditional manual trial-and-error methods while emphasizing efficiency and s cal ability across various tasks and models. They report significant improvements in LLMs performance, with enhancements ranging from $5%$ to $31%$ across different benchmarks. This body of work underscores the increasing importance of prompt optimization and selection in unlocking the potential of LLMs and points toward future research avenues, such as exploring theoretical foundations, integrating multiple optimization techniques, and distinguishing between task-specific and general-purpose strategies.

这些研究旨在实现提示设计的自动化，摒弃传统的手工试错方法，同时强调跨任务和模型的高效性与可扩展性。研究显示大语言模型性能显著提升，在不同基准测试中改进幅度达5%至31%。该系列工作凸显了提示优化与选择在释放大语言模型潜力方面日益增长的重要性，并指出未来研究方向，例如探索理论基础、整合多种优化技术，以及区分任务专用策略与通用策略。

2.3 Evaluation Benchmarks

2.3 评估基准

To facilitate the evaluation and understanding of LLM capabilities, Zhu et al. (2024) introduced Prompt Bench, a unified library encompassing a variety of LLMs, datasets, evaluation protocols, and adversarial prompt attacks. This modular and extensible tool aims to support collaborative research and advance the comprehension of LLM strengths and weaknesses. Further exploring reasoning capabilities, Qiao et al. (2023) categorized various prompting methods and evaluated their effectiveness across different model scales and reasoning tasks, identifying key open questions for achieving robust and general iz able reasoning. (Wang et al. 2021) introduced a multi-task Benchmark for robustness Evaluation of LLMs extends the original GLUE (Wang et al. 2018) benchmark to assess model robustness against adversarial inputs. It incorporates perturbed versions of existing GLUE tasks, such as paraphrasing, negation, and noise, to test models’ abilities with challenging data. The study highlights that despite their success on clean datasets, state-of-the-art models often struggle with adversarial examples, underscoring the importance of robustness evaluations in model development.

为便于评估和理解大语言模型的能力，Zhu等人 (2024) 提出了Prompt Bench，这是一个包含多种大语言模型、数据集、评估协议和对抗性提示攻击的统一库。这个模块化且可扩展的工具旨在支持协作研究，促进对大语言模型优势和局限的理解。Qiao等人 (2023) 进一步探索推理能力，对各种提示方法进行分类，并评估它们在不同模型规模和推理任务中的有效性，指出了实现稳健且可泛化推理的关键开放性问题。Wang等人 (2021) 引入了一个用于大语言模型鲁棒性评估的多任务基准，将原始的GLUE (Wang等人 2018) 基准扩展到评估模型对抗对抗性输入的鲁棒性。它整合了现有GLUE任务的扰动版本，如改写、否定和噪声，以测试模型在挑战性数据上的能力。该研究强调，尽管最先进的模型在干净数据集上表现优异，但在对抗样本上往往表现不佳，凸显了鲁棒性评估在模型开发中的重要性。

3 Hierarchical Prompting Taxonomy 3.1 Governing Rules

3 分层提示分类法 3.1 基本规则

Figure 3 illustrates the HPT, a taxonomy that systematically reflects human cognitive functions as outlined in Bloom (1956). Each rule embodies complex cognitive processes based on established principles from learning and psychology.

图 3: 展示了HPT (Human Process Taxonomy) ，这是一个系统化反映人类认知功能的分类体系，其理论基础源自Bloom (1956) 提出的认知框架。该分类法的每条规则都体现了基于学习理论与心理学原理的复杂认知过程。

Figure 3: Hierarchical Prompting Taxonomy: A taxonomy designed to assess the complexity of prompting strategies based on the criteria: Basic Recall and Reproduction, Understanding and Interpretation, Analysis and Reasoning, and Application of Knowledge and Reasoning.

图 3: 分层提示分类法: 一种基于以下标准评估提示策略复杂度的分类体系: 基础回忆与复现、理解与阐释、分析与推理、知识应用与推理。

than mere understanding because it requires examining structure and identifying patterns and connections.

不仅仅是理解，因为它需要审视结构并识别模式和关联。

Application of Knowledge and Execution: This mirrors the application and evaluation stages of (Bloom 1956), where individuals must not only understand and analyze but also use knowledge to perform multi-step tasks, solve complex problems, and execute decisions. It represents the most cognitive ly complex tasks, which require synthesis of information and practical decision-making, highlighting the critical leap from understanding theory to executing it in practice.
知识应用与执行：这对应于 (Bloom 1956) 理论中的应用与评估阶段，个体不仅需要理解和分析，还需运用知识完成多步骤任务、解决复杂问题并执行决策。该阶段代表认知复杂度最高的任务，要求综合信息并进行实际决策，突显了从理论理解到实践执行的关键跨越。

In HPT, the progression from basic recall to application of knowledge reflects increasing cognitive complexity, consistent with educational and cognitive frameworks, where more advanced cognitive processes build on foundational ones, demanding deeper engagement and mental effort.

在HPT中，从基础记忆到知识应用的进阶过程反映了认知复杂度的提升，这与教育和认知框架一致。更高级的认知过程建立在基础认知之上，需要更深层次的投入和心理努力。

3.2 Hierarchical Prompting Framework

3.2 分层提示框架

The HPF consists of five prompting strategies, each assigned a complexity level. These levels are determined by the degree to which the strategies are shaped by the four principles of the HPT. The complexity levels of the prompting strategies are assigned based on human assessment of their relative cognitive loads over a set of 7 different tasks, guaranteeing that the cognitive abilities of LLMs are in harmony with those of humans. This approach enables the assessment of tasks in terms of their complexity and the cognitive load they impose on both humans and LLMs by utilizing HPI. Section 4.4 examines the hierarchical structure of the HPF in conjunction with the LLM-as-a-Judge framework, validating that the cognitive demands on LLMs can be aligned with those of

HPF包含五种提示策略，每种策略都分配了复杂度等级。这些等级由策略受HPT四项原则影响的程度决定。提示策略的复杂度等级基于人类对7种不同任务中相对认知负荷的评估而设定，确保大语言模型的认知能力与人类保持一致。该方法通过利用HPI，能够从任务复杂度及其对人类和大语言模型施加的认知负荷角度进行评估。第4.4节结合LLM-as-a-Judge框架研究了HPF的层次结构，验证了大语言模型的认知需求可与人类需求相匹配。

humans.

人类

The set of five prompting strategies were chosen from a diverse range of existing strategies to populate the framework, guided by a human judgment policy, prioritizing comprehensiveness in cognitive demands rather than the sheer number of strategies. See Appendix A for more details. Consequently, the HPF can be replicated or expanded with other relevant prompting strategies that exhibit similar cognitive demands, making the framework adaptable. The five prompting strategies, listed from least to most complex, are as follows:

从现有多样化策略中选取了五种提示策略来构建该框架，这些策略由人工判断策略指导，优先考虑认知需求的全面性而非策略数量。更多细节见附录A。因此，HPF可通过其他具有类似认知需求的相关提示策略进行复制或扩展，使框架具备适应性。按复杂度从低到高排列的五种提示策略如下：

This approach requires recalling prior prompts, interpreting previous responses, and analyzing them to effectively solve the task, resulting in a highly cognitive ly demanding strategy.

这种方法需要回忆先前的提示、解读之前的响应并进行分析，才能有效解决任务，是一种认知要求极高的策略。

Generated Knowledge Prompting (GKP) (Liu et al. 2022): Prompts that require integrating external knowledge to generate relevant information represent the most complex and cognitive ly demanding strategy. This approach is strongly influenced by rules 2, 3, and 4, as it involves correlating knowledge with the prompt and applying and analyzing external information, making it the most cognitive ly demanding within the HPT framework. In the experiments, Llama-3 8B is used to generate external knowledge.
生成知识提示 (GKP) (Liu et al. 2022): 需要整合外部知识以生成相关信息的提示代表了最复杂且认知要求最高的策略。该方法深受规则2、3和4的影响，因为它涉及将知识与提示相关联，并应用和分析外部信息，使其成为HPT框架中认知要求最高的方法。实验中使用了Llama-3 8B来生成外部知识。

3.3 Hierarchical Prompting Index

3.3 分层提示索引

HPI is an evaluation metric for assessing the task complexity of LLMs over different datasets, which is influenced by the HPT rules. A lower HPI for a dataset suggests that the corresponding LLM is more adept at solving the task with fewer cognition processes. For each dataset instance, we begin with the least complex prompting strategy and progressively move through the HPF prompting strategies until the instance is resolved. The HPI corresponds to the complexity level of the prompting strategy where the LLM first tackles the instance.

HPI 是一种用于评估大语言模型在不同数据集上任务复杂度的指标，其受 HPT 规则影响。数据集的 HPI 值越低，表明对应的大语言模型能以更少的认知过程熟练解决该任务。对于每个数据集实例，我们从复杂度最低的提示策略开始，逐步尝试 HPF 提示策略，直至实例被解决。HPI 对应大语言模型首次解决该实例时所使用的提示策略复杂度等级。

Algorithm 1: HPI Metric

算法 1: HPI 评估指标

| HPI_List= for sample i in the evaluation dataset do forlevelrintheHPFdo if LLMresolvesthetaskthen HPI_List[]= break end if end for if LLMfailedtoresolvethetaskthen |
| HPIList[]= m + HPIDataset end if end for |
| HPI = ≥=1 HPI-List[] |

$m$ is the total number of levels in the HPF, and $n$ is the total number of samples in the evaluation dataset. $\mathsf{H P I}_{D a t a s e t}$ represents the penalty introduced into the framework by human assessments. For further details on human annotation, see Appendix A.

$m$ 是 HPF 中的总层级数，$n$ 是评估数据集中的总样本数。$\mathsf{H P I}_{D a t a s e t}$ 表示人类评估引入框架的惩罚项。关于人工标注的更多细节，请参阅附录 A。

4 Results

4 结果

4.1 Experimental Setup Datasets

4.1 实验设置数据集

The experiments utilized a diverse set of datasets, including MMLU, GSM8k, HumanEval, BoolQ, CSQA, SamSum, and IWSLT en-fr covering areas such as reasoning, coding, mathematics, question-answering, sum mari z ation, and machine translation, to evaluate the framework’s robustness and applicability. For further details on evaluation dataset sizes, see

实验使用了多样化的数据集，包括MMLU、GSM8k、HumanEval、BoolQ、CSQA、SamSum和IWSLT en-fr，涵盖推理、编程、数学、问答、摘要和机器翻译等领域，以评估框架的鲁棒性和适用性。有关评估数据集规模的更多详情，请参阅

Appendix A.

附录 A.

Reasoning: MMLU (Hendrycks et al. 2021) includes multiple-choice questions across 57 subjects, covering areas like humanities, social sciences, physical sciences, basic mathematics, U.S. history, computer science, and law. CommonSenseQA (CSQA) (Talmor et al. 2019) contains 12,000 questions to assess commonsense reasoning.

推理：MMLU (Hendrycks et al. 2021) 包含57个学科的多选题，涵盖人文、社会科学、物理科学、基础数学、美国历史、计算机科学和法律等领域。CommonSenseQA (CSQA) (Talmor et al. 2019) 包含12,000个用于评估常识推理的问题。

Coding: HumanEval (Chen et al. 2021a) features 164 coding challenges, each with a function signature, docstring, body, and unit tests, designed to avoid training data overlap with LLMs.

编码：HumanEval (Chen et al. 2021a) 包含164个编程挑战，每个挑战都包含函数签名、文档字符串、函数体和单元测试，旨在避免与大语言模型的训练数据重叠。

Mathematics: Grade School Math 8K (GSM8k) (Cobbe et al. 2021) comprises 8.5K diverse math problems for multi-step reasoning, focusing on basic arithmetic and early Algebra.

数学：小学数学8K (GSM8k) (Cobbe et al. 2021) 包含8500道多样化数学题，用于多步推理训练，侧重基础算术和初级代数。

Question-Answering: BoolQ (Clark et al. 2019) consists of 16,000 True/False questions based on Wikipedia passages for binary reading comprehension.

问答：BoolQ (Clark et al. 2019) 包含基于维基百科段落的16,000个真/假问题，用于二元阅读理解。

Sum mari z ation: SamSum (Gliwa et al. 2019) features 16,000 human-generated chat logs with summaries for dialogue sum mari z ation.

摘要：SamSum (Gliwa et al. 2019) 包含16,000条人工生成的聊天记录及对应的对话摘要。

Machine Translation: IWSLT-2017 en-fr (IWSLT) (Cettolo et al. 2017) is a parallel corpus with thousands of EnglishFrench sentence pairs from TED Talks for translation tasks.

机器翻译：IWSLT-2017 en-fr (IWSLT) (Cettolo et al. 2017) 是一个包含数千个英语-法语句子对的平行语料库，这些句子对来自TED演讲，用于翻译任务。

Large Language Models

大语言模型

For the evaluation, LLMs with parameter sizes ranging from 7 billion to 12 billion from top open-source models and top proprietary models were selected to determine the effectiveness of the proposed framework across varied parameter scales and architectures.

在评估中，我们选取了参数规模从70亿到120亿的顶级开源模型和专有大语言模型，以验证所提框架在不同参数量级和架构上的有效性。

Additional Evaluation Metrics

附加评估指标

• Coding: The Pass $\ @\mathrm{k}$ (Chen et al. 2021b) metric measures the probability of at least one correct solution among the top k outputs, used for evaluating code generation. • Sum mari z ation: ROUGE-L (Lin 2004) evaluates the longest common sub sequence between generated text and reference, focusing on sequence-level similarity for summaries and translations. • Machine Translation: BLEU (Papineni et al. 2002) is a precision-based metric that assesses machine-generated text quality by comparing n-grams with reference texts.

• 编码：Pass $\ @\mathrm{k}$ (Chen et al. 2021b) 指标衡量前k个输出中至少存在一个正确解的概率，用于评估代码生成质量。
• 摘要：ROUGE-L (Lin 2004) 通过计算生成文本与参考文本间的最长公共子序列来评估摘要和翻译的序列级相似性。
• 机器翻译：BLEU (Papineni et al. 2002) 是基于精确度的指标，通过对比n-gram与参考文本来评估机器生成文本的质量。

In the experiments, thresholds of 0.15, and 0.20 were established for sum mari z ation and machine translation tasks to define the conditions required for task completion at each complexity level of the HPF. These thresholds allowed for iterative refinement of HPF prompting strategies.

在实验中，为摘要(summarization)和机器翻译任务设定了0.15和0.20的阈值，用于定义HPF每个复杂度级别下任务完成所需的条件。这些阈值支持对HPF提示策略进行迭代优化。

4.2 Results on Standard Benchmarks: MMLU, GSM8K, and Humaneval

4.2 标准基准测试结果：MMLU、GSM8K和Humaneval

The evaluation of HPF effectiveness as shown in Figure 4 spans three standard benchmarks: MMLU, GSM8k, and HumanEval. On the MMLU benchmark, which tests general knowledge across multiple domains, all models showed notable improvements over their baseline performance. MistralNemo 12B demonstrated the most substantial MMLU enhancement $(+21.8%)$ , while Claude 3.5 Sonnet achieved a consistent improvement of $3.5%$ . In mathematical reasoning, assessed through GSM8k, the results revealed a correlation with the model scale. Larger models like GPT-4 and Claude 3.5 Sonnet showed modest gains $(+4.4%$ and $+1.3%$ respectively), while smaller models exhibited more variable performance. The HumanEval benchmark, which assesses code generation capabilities, revealed the most dramatic improvements across all models. Mistral 7B achieved an exception $62.5%$ improvement in HumanEval scores, followed by Mistral-Nemo 12B with an impressive $51.4%$ improvement, and Gemma-2 9B with a $50.8%$ enhancement. These findings indicate that HPF improves performance across all benchmarks for most of the LLMs, its impact is particularly pronounced in programming tasks, suggesting that the technique may be especially valuable for enhancing code-related capabilities.

图4所示的HPF有效性评估涵盖三个标准基准：MMLU、GSM8k和HumanEval。在测试多领域通用知识的MMLU基准上，所有模型相较基线性能均有显著提升。MistralNemo 12B展现出最显著的MMLU增强$(+21.8%)$，而Claude 3.5 Sonnet实现了稳定的$3.5%$提升。通过GSM8k评估的数学推理任务中，结果显示性能与模型规模存在相关性。GPT-4和Claude 3.5 Sonnet等大型模型呈现适度提升$(+4.4%$和$+1.3%$)，较小模型则表现波动较大。在评估代码生成能力的HumanEval基准上，所有模型均取得最显著进步：Mistral 7B以$62.5%$的异常提升领跑，Mistral-Nemo 12B紧随其后实现$51.4%$的惊人改进，Gemma-2 9B则获得$50.8%$增强。这些发现表明HPF能提升大多数大语言模型在所有基准上的表现，其对编程任务的增强效果尤为突出，暗示该技术对提升代码相关能力可能具有特殊价值。

Table 1 highlights the improved performance of various LLMs on MMLU, with all models showing an HPI index below three. This indicates that reasoning over most MMLU samples requires minimal cognitive effort for these models, compared to baseline multi-shot CoT methods (5 shot), which typically require more than five examples and are more cognitively demanding according to HPT. Interestingly, while Claude 3.5 Sonnet achieves the highest MMLU accuracy, GPT-4o records the best HPI score, showing that minimal cognitive effort does not necessarily equate to the best performance. The enhancement in GSM8k is relatively smaller compared to MMLU, with decreased performances for both Mistral 7B and Gemma 7B. The high HPI values for Gemma 7B and Mistral 7B indicate that none of the five prompting strategies in HPF posed significant cognitive challenges for these LLMs, highlighting a limitation of the HPF. As shown in Table 2, Claude 3.5 Sonnet achieves a perfect pass $@1$ of 1.00 with low HPI values, outperforming GPT-4o, which scores 0.95 but has a higher HPI. Gemma 7B struggles with the lowest pass $@1$ of 0.79 and the highest HPI of 3.71, indicating a need for more complex prompting strategy.

表 1: 展示了各大大语言模型在MMLU上的性能提升，所有模型的HPI指数均低于3。这表明与基准多样本思维链(CoT)方法(5样本)相比，这些模型对大多数MMLU样本的推理只需最低认知努力。根据HPT指标，传统方法通常需要超过五个示例且认知需求更高。值得注意的是，虽然Claude 3.5 Sonnet取得了最高的MMLU准确率，但GPT-4o获得了最佳HPI分数，说明最低认知努力并不等同于最佳表现。与MMLU相比，GSM8k的提升幅度较小，Mistral 7B和Gemma 7B的性能均有所下降。Gemma 7B和Mistral 7B的高HPI值表明，HPF中的五种提示策略对这些大语言模型都未构成显著认知挑战，这凸显了HPF的局限性。如表 2: 所示，Claude 3.5 Sonnet以1.00的完美通过率($@1$)和低HPI值表现最优，优于通过率0.95但HPI较高的GPT-4o。Gemma 7B表现最弱，通过率($@1$)仅为0.79且HPI高达3.71，表明其需要更复杂的提示策略。

Interestingly, HPF significantly enhanced the performance of most LLMs across three benchmark datasets, even when the HPI difference was less than 1 relative to the best performing LLMs. This highlights that tailoring the prompting strategy to align with the complexity of each dataset instance can lead to substantial improvements, achieving performance levels comparable to state-of-the-art LLMs such as GPT-4o and Claude 3.5 Sonnet on these benchmarks.

有趣的是，即使与表现最佳的大语言模型(LLM)相比HPI差异小于1，HPF仍显著提升了多数大语言模型在三个基准数据集上的性能。这表明根据每个数据集实例的复杂度定制提示策略能带来显著改进，使性能达到与GPT-4o和Claude 3.5 Sonnet等顶尖大语言模型相当的水平。

4.3 Results on Other Datasets

4.3 其他数据集上的结果

Table 1 presents the performance of LLMs on the BoolQ and CSQA datasets. Notably, no significant insights emerge from the results, aside from GPT-4o performing unexpectedly poorly, which contrasts with its typical performance. With most LLMs achieving near-perfect scores, the BoolQ dataset appears to lack the complexity needed to serve as an effective benchmark for modern LLMs, as they perform exceptionally well even with minimal cognitive prompting strategies. This underscores the utility of HPF in evaluating dataset complexities relative to an LLM, offering researchers valuable insights for designing more challenging and robust benchmarks.

表 1: 大语言模型在BoolQ和CSQA数据集上的性能表现。值得注意的是，除GPT-4o表现意外低迷（与其典型表现形成反差）外，结果中并未显现显著洞见。由于大多数大语言模型均获得接近满分的成绩，BoolQ数据集似乎缺乏作为现代大语言模型有效基准测试所需的复杂度——即便采用最低限度的认知提示策略，这些模型仍能表现卓越。这凸显了层次化提示框架 (HPF) 在评估数据集相对于大语言模型复杂度方面的效用，为研究者设计更具挑战性和鲁棒性的基准测试提供了宝贵洞见。

Figure 4: Performance Comparison of HPT-based Evaluation vs. Standard Evaluation: Performance improvements (in $%$ ) when using HPT-based evaluation compared to standard evaluation across three benchmarks: MMLU, GSM8k, and HumanEval. Positive values indicate performance gains with HPT, while negative values indicate performance decreases. The baseline standard evaluation scores are sourced from Hugging Face leader board and official research reports. Figure 5: Hierarchy of prompting strategies with LLM-as-aJudge framework with GPT-4o as the judge.

图 4: 基于HPT的评估与标准评估性能对比：在MMLU、GSM8k和HumanEval三个基准测试中，使用基于HPT的评估相比标准评估的性能提升(以%计)。正值表示HPT带来性能提升，负值表示性能下降。基线标准评估分数来源于Hugging Face排行榜和官方研究报告。
图 5: 采用GPT-4o作为评判者的大语言模型即评判框架(LMM-as-aJudge)的提示策略层级结构。

Table 3 presents the performance of LLMs on IWSLT and SamSum datasets at varying thresholds. GPT-4o consistently achieved the highest scores across all thresholds, while most models, except Gemma 7B, performed similarly. Interestingly, Claude 3.5 Sonnet, which excelled in reasoning tasks, did not perform as strongly in sum mari z ation and translation tasks. The threshold selection is guided by the observed performance plateau across most LLMs as the threshold increases. For a detailed explanation of the threshold selection process, please refer to Appendix B.

表 3 展示了不同阈值下大语言模型(LLM)在IWSLT和SamSum数据集上的表现。GPT-4o在所有阈值下均保持最高分，而除Gemma 7B外的大多数模型表现相近。值得注意的是，在推理任务中表现出色的Claude 3.5 Sonnet在摘要和翻译任务中表现相对较弱。阈值选择基于观察到大多数大语言模型性能随阈值增加而趋于平缓的现象。关于阈值选择过程的详细说明，请参阅附录B。

Table 1: HPI (lower is better) and accuracy of LLMs across MMLU, GSM8K, BoolQ, and CSQA datasets. Blue indicates datasets where the LLM with the best HPI does not achieve the best performance. Green indicates the LLM with the best performance over the maximum number of datasets.

表 1: HPI (越低越好) 和大语言模型在 MMLU、GSM8K、BoolQ 和 CSQA 数据集上的准确率。蓝色表示 HPI 最佳的大语言模型未取得最佳性能的数据集，绿色表示在最多数据集上取得最佳性能的大语言模型。

DATASETS Models	MMLU HPI	MMLU Accuracy	GSM8k HPI	GSM8k Accuracy	BoolQ HPI	BoolQ Accuracy	CSQA HPI	CSQA Accuracy
GPT-40	1.81	91.61	1.71	96.43	1.32	96.82	1.65	92.54
Claude3.5Sonnet	1.84	92.16	1.35	97.72	1.20	99.81	2.01	86.15
Mistral-Nemo12B	2.45	89.75	3.01	86.80	1.75	99.87	2.06	90.17
Gemma-29B	2.34	87.28	2.17	91.28	1.30	98.28	1.94	88.86
Llama-38B	2.84	82.63	2.34	86.20	1.37	99.30	2.43	84.76
Gemma 7B	2.93	83.31	6.70	27.88	1.45	99.42	2.50	83.78
Mistral 7B	2.89	81.45	5.11	46.93	1.41	98.07	2.49	82.06

Table 2: HPI (lower is better) and Pass $@1$ of LLMs on the HumanEval dataset. Blue indicates datasets where the LLM with the best HPI does not achieve the best performance. Green indicates the LLMs with the best performance over the dataset.

表 2: HumanEval 数据集上各LLM的HPI (数值越低越好) 和Pass@1指标。蓝色标注表示HPI最佳的大语言模型在该数据集上未取得最佳性能，绿色标注表示在该数据集上性能最优的大语言模型。

DATASET Models	HumanEval HPI	HumanEval Pass@1
GPT-40	2.25	0.95
Claude3.5Sonnet	1.04	1.00
Mistral-Nemo12B	2.07	0.96
Gemma-29B	1.01	0.91
Llama-38B	1.03	1.00
Gemma 7B	3.71	0.79
Mistral 7B	1.10	0.93

4.4 Complexity Levels with LLM-as-a-Judge

4.4 基于大语言模型 (LLM) 评判的复杂度分级

This study evaluated prompting strategies by assessing how GPT-4o, as the LLM judge, replicates the hierarchical complexity levels of these strategies using a systematic scoring approach across tasks. Figure 5 shows a consistent hierarchy with less variability than human judges, indicating a strong alignment between LLM and human judgment. These results validate the proposed framework and demonstrate the corresponden ce between human cognitive principles and LLM behavior. Figure 6 shows the scoring distribution across the four HPT rules for each strategy. Further details related to dataset specifications and scoring method are in Appendix C.

本研究通过评估GPT-4o作为大语言模型裁判，如何采用系统性评分方法在不同任务中复现提示策略的层次复杂度水平来验证其有效性。图5显示其层级一致性高于人类裁判，表明大语言模型与人类判断具有高度吻合性。这些结果验证了所提框架的有效性，并揭示了人类认知原则与大语言模型行为之间的对应关系。图6展示了各策略在四条HPT规则下的评分分布。关于数据集规范与评分方法的详细信息见附录C。

4.5 Parallels with System 1 and System 2 Thinking

4.5 与系统1和系统2思维的类比

HPF align closely with the principles of System 1 and System 2 thinking from dual-process cognitive theories (Booch et al. 2021). HPT categorizes tasks and HPF structures prompts based on their cognitive complexity, mirroring how humans allocate cognitive resources. For tasks with low cognitive demands, HPF employs simple prompts that parallel System 1 thinking. These tasks, like fact recall or basic classification, require minimal reasoning, allowing the LLM to respond quickly and efficiently without extensive computation. For instance, asking an LLM to ”identify the capital of a coun- try” is analogous to a person retrieving a familiar fact using System 1.

HPF与双过程认知理论中的系统1和系统2思维原则高度契合 (Booch et al. 2021)。HPT根据认知复杂度对任务进行分类，HPF则基于该分类构建提示结构，这种机制模拟了人类分配认知资源的方式。对于低认知需求任务，HPF采用类似系统1思维的简单提示。这类任务（如事实回忆或基础分类）只需极少推理，使得大语言模型无需大量计算即可快速高效响应。例如，让大语言模型"识别某国首都"就类似于人类通过系统1提取熟悉信息。

Figure 6: Scoring distribution for each of the four rules of the HPT for the prompting strategies in the HPF.

图 6: HPF中不同提示策略在HPT四项规则下的得分分布。

In contrast, tasks with high cognitive demands involve prompts that guide the LLM through complex reasoning, abstraction, or multi-step problem-solving—analogous to System 2 thinking. Examples include generating logical arguments or solving intricate problems, where deliberate and resource-intensive processes are necessary. Just as System 2 engages when a problem exceeds the capacity of System 1, higher levels of HPF are invoked for tasks requiring deeper analysis.

相比之下，认知需求高的任务需要引导大语言模型进行复杂推理、抽象或多步骤问题解决的提示，类似于系统2思维。例如生成逻辑论证或解决复杂问题，这些任务需要深思熟虑且消耗资源的过程。正如当问题超出系统1能力范围时系统2会介入，需要深度分析的任务也会调用更高层次的高阶提示框架(HPF)。

HPF explicitly measures this transition with HPI, assessing the cognitive load required for each task. By tailoring prompting strategies to task complexity, HPF optimizes LLM performance, much like humans adaptively switch between System 1 and System 2 based on the situation. This parallel highlights how HPT bridges computational strategies with human-like cognitive models, enabling more nuanced task evaluation and resource allocation.

HPF通过HPI(HPI) 明确测量这种转换，评估每项任务所需的认知负荷。通过根据任务复杂度定制提示策略，HPF优化了大语言模型(LLM) 性能，就像人类根据情境在系统1和系统2之间自适应切换一样。这种相似性凸显了HPT如何将计算策略与类人认知模型相连接，从而实现更细致的任务评估和资源分配。

Table 3: HPI (lower is better), BLEU score for IWSLT, and ROUGE-L score for SamSum, of LLMs with threshold

数据集	IWSLT	IWSLT	SamSum	SamSum
	HPI	BLEU	HPI	ROUGE-L
模型	0.15	0.20	0.15	0.20
GPT-40	2.66	3.08	0.32	0.32
Claude 3.5 Sonnet	4.63	4.87	0.20	0.20
Mistral-Nemo12B	2.87	3.40	0.27	0.27
Gemma-29B	4.40	4.75	0.21	0.20
Llama-3 8B	3.40	3.92	0.24	0.23
Gemma 7B	5.39	5.84	0.08	0.09
Mistral 7B	3.52	4.14	0.22	0.22

表 3: 大语言模型在 IWSLT 数据集上的 HPI (越低越好) 和 BLEU 分数，以及在 SamSum 数据集上的 ROUGE-L 分数，阈值分别为 0.15 和 0.20

4.6 Adaptive HPF

4.6 自适应高通滤波器 (Adaptive HPF)

The Adaptive HPF automates the selection of the optimal complexity level in the HPF using a prompt-selector, Llama $38\mathrm{B}$ in a zero-shot setting, bypassing iterative steps. Figure 7 shows that Adaptive HPF yields higher HPI but lower evaluation scores than the standard HPF. This suggests that Adaptive HPF struggles to select the optimal complexity level, likely due to hallucinations by the prompt-selector when choosing the prompting strategy. For more results and ablation studies, see Appendix E.

自适应HPF通过提示选择器在零样本设置下使用Llama $38\mathrm{B}$自动选择HPF中的最佳复杂度级别，省去了迭代步骤。图7显示，自适应HPF比标准HPF产生了更高的HPI但更低的评估分数。这表明自适应HPF难以选择最佳复杂度级别，可能是由于提示选择器在选择提示策略时产生了幻觉。更多结果和消融研究见附录E。

Figure 7: HPI of datasets for LLMs in Adaptive HPF.

图 7: 自适应HPF中大语言模型数据集的HPI

5 Conclusion

5 结论

The HPT provides a strong and efficient approach for assessing LLMs by focusing solely on the cognitive demands of different tasks. The results emphasize that the HPF is effective in evaluating diverse datasets, using the most cognitive ly effective prompting strategies tailored to task complexity, which results in improved LLM performance across multiple datasets. This method not only offers an in-depth insight into LLM’s problem-solving abilities but also suggests that dynamically choosing suitable prompting strategies can enhance LLM performance, setting the stage for future improvements in LLM evaluation methods. This study paves the way for formulating and designing evaluation methods grounded in human cognitive principles, aligning them with the problem-solving capabilities of LLMs. Additionally, it facilitates the development of more robust benchmarks and in-context learning methodologies to effectively assess LLM performance across various tasks.

HPT通过专注于不同任务的认知需求，为评估大语言模型(LLM)提供了一种强大而高效的方法。研究结果表明，HPF能有效评估多样化数据集，通过采用根据任务复杂度定制的认知最优提示策略，从而提升LLM在多个数据集上的表现。该方法不仅深入揭示了LLM的解题能力，还表明动态选择合适提示策略可增强LLM性能，为未来改进LLM评估方法奠定了基础。本研究为基于人类认知原理制定评估方法开辟了道路，使其与大语言模型的解题能力相匹配。此外，它还促进了更健壮的基准测试和情境学习方法的开发，以有效评估LLM在不同任务中的表现。

6 Ethical Statement

6 伦理声明

The $\mathsf{H P I}_{D a t a s e t}$ assigned by experts to the datasets: MMLU, GSM8k, Humaneval, BoolQ, CSQA, IWSLT, and SamSum may introduce bias into the comparative analysis. This potential bias stems from the subjective nature of expert scoring, which can be influenced by individual experience and perspective. Despite this, the datasets utilized in this study are publicly available and widely recognized in the research community, thereby minimizing the risk of unanticipated ethical issues. Nevertheless, it is crucial to acknowledge the possibility of scoring bias to ensure transparency and integrity in the analysis.

专家为MMLU、GSM8K、Humaneval、BoolQ、CSQA、IWSLT和SamSum等数据集分配的$\mathsf{HPI}_{Dataset}$可能会在对比分析中引入偏差。这种潜在偏差源于专家评分的主观性，可能受到个人经验和视角的影响。尽管如此，本研究使用的数据集均为公开资源且在学术界广泛认可，从而将意外伦理风险降至最低。但必须承认评分偏差的可能性，以确保分析过程的透明性与严谨性。

References

参考文献

AI@Meta. 2024. Llama 3 Model Card. https://github.com/ meta-llama/llama3/blob/main/MODEL CARD.md.

AI@Meta. 2024. Llama 3 模型卡。https://github.com/meta-llama/llama3/blob/main/MODEL CARD.md。

Anderson, L.; Krathwohl, D.; Cruikshank, K.; Airasian,P.; Raths, J.; Pintrich, P.; Mayer, R.; and Wittrock, M. 2014. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s. Always learning. Pearson. ISBN 9781292042848. Anthropic. 2024. Claude 3.5 Sonnet. https://www.anthropic. com/claude-3-5-sonnet. Accessed: 2024-09-16. Bloom, B. 1956. Taxonomy of Educational Objectives: The Classification of Educational Goals. Number v. 1 in Taxonomy of Educational Objectives: The Classification of Educational Goals. Longmans, Green. ISBN 9780679302094. Booch, G.; Fabiano, F.; Horesh, L.; Kate, K.; Lenchner, J.; Linck, N.; Loreggia, A.; Murgesan, K.; Mattei, N.; Rossi, F.; et al. 2021. Thinking fast and slow in AI. In Proceedings of

Anderson, L.; Krathwohl, D.; Cruikshank, K.; Airasian,P.; Raths, J.; Pintrich, P.; Mayer, R.; and Wittrock, M. 2014. 学习、教学和评估的分类学: 布鲁姆分类法的修订版. Always learning. Pearson. ISBN 9781292042848.
Anthropic. 2024. Claude 3.5 Sonnet. https://www.anthropic. com/claude-3-5-sonnet. 访问日期: 2024-09-16.
Bloom, B. 1956. 教育目标分类学: 教育目标的分类. Number v. 1 in Taxonomy of Educational Objectives: The Classification of Educational Goals. Longmans, Green. ISBN 9780679302094.
Booch, G.; Fabiano, F.; Horesh, L.; Kate, K.; Lenchner, J.; Linck, N.; Loreggia, A.; Murgesan, K.; Mattei, N.; Rossi, F.; et al. 2021. AI中的快与慢思考. In Proceedings of

A Human Annotation and Judgement Policy A.1 Human Annotation Policy

人工标注与评判策略

A.1 人工标注策略

$\mathsf{H P I}{D a t a s e t}$ is introduced to penalize the HPI of tasks or samples unsolvable by the LLM, aligning the framework more closely with human cognitive demands and enhancing its comprehensiveness. We implemented a rigorous human annotation process to ensure the quality of $\mathsf{H P I}_{D a t a s e t}$ scored by human experts for the datasets. Human annotators were tasked with calculating the HPI for each sample in a given dataset. The HPI quantifies the cognitive demands imposed on human expert proficiency in completing a task, based on the HPT, where higher values indicate greater cognitive demands. Each sample was scored on a scale from $I$ (lowest complexity level) to 5 (highest complexity level) for the following criteria:

$\mathsf{H P I}{数据集}$ 用于惩罚大语言模型无法解决的任务或样本的HPI值，使框架更贴近人类认知需求并提升其全面性。我们实施了严格的人工标注流程来确保由人类专家为数据集评定的$\mathsf{H P I}_{数据集}$质量。标注人员需计算给定数据集中每个样本的HPI值，该指标基于HPT理论量化了完成任务对人类专家认知能力的要求，数值越高代表认知需求越大。每个样本按以下标准进行1级（最低复杂度）至5级（最高复杂度）评分：

Higher scores for the four rules signify a stronger influence of the respective rules, indicating that completing the task requires greater cognitive effort. The $\mathsf{H P I}_{D a t a s e t}$ for each dataset, as shown in Table 4, was calculated by taking the mean of the values from these four criteria, acknowledging the challenge of estimating or computing the individual weights of the influence of each rule.

四条规则的得分越高，表明相应规则的影响越大，意味着完成任务需要更多的认知努力。如表 4 所示，每个数据集的 $\mathsf{H P I}_{D a t a s e t}$ 是通过这四个标准的平均值计算得出的，同时承认估算或计算每条规则影响权重的挑战性。

The Representative Set Size in Table 4 refers to the subset of the dataset evaluated by human annotators, ensuring that the assessment reflects the overall task. Human annotation, while time-consuming and costly, provides a gold standard for calibrating the evaluation process of this paper. Selecting $5%$ of the dataset as the representative set size balances quality assessment and feasibility, capturing the dataset’s diversity and ensuring that human annotations encompass a broad range of cases without needing to annotate every sample.

表4中的代表集大小是指由人工标注员评估的数据集子集，确保评估能反映整体任务。人工标注虽然耗时且成本高昂，但为本文的评估过程提供了黄金标准。选择数据集的$5%$作为代表集大小，在质量评估和可行性之间取得了平衡，既能捕捉数据集的多样性，又无需标注每个样本就能确保人工标注涵盖广泛案例。

Table 4: ${\sf H P I}{D a t a s e t}$ scores across datasets evaluated by human annotators. The table lists the evaluation set size, representative set size, and $\mathsf{H P I}{D a t a s e t}$ for various datasets. $\mathsf{H P I}_{D a t a s e t}$ scores provide a measure of task complexity relative to human annotators.

表 4: ${\sf H P I}{D a t a s e t}$ 在不同数据集上由人工标注者评估的得分。该表列出了各数据集的评估集大小、代表集大小和 $\mathsf{H P I}{D a t a s e t}$ 。 $\mathsf{H P I}_{D a t a s e t}$ 得分提供了相对于人工标注者的任务复杂度度量。

数据集	评估集大小	代表集大小	HPIDataset
MMILU	14500	725	3.03
GSM8K	1320	66	2.14
Humaneval	160	8	4.68
BoolQ	3270	162	1.71
CSQA	1221	60	2.52
IWSLT	890	45	1.92
SamSum	819	40	2.23

A.2 Human Judgement Policy

A.2 人类判断策略

To populate the HPF with relevant prompting strategies across a wide range of strategies, human annotators who adhered to the annotation policy for assessing $\mathsf{H P I}_{D a t a s e t}$ were instructed to follow a judgment policy for a predefined set of prompting strategies. They were instructed to evaluate the influence of the four rules of the HPT on solving the annotated tasks using each prompting strategy, rating their influence as

为了在HPF中填充涵盖广泛策略的相关提示策略，遵循$\mathsf{H P I}_{数据集}$评估标注政策的人类标注员被要求按照预定义提示策略集的判断政策执行。他们需评估HPT四条规则对使用每种提示策略解决标注任务的影响程度，并将其影响评级为

High (H), Moderate (M), or Low (L). It’s important to note that a high rating on rule 4 has a greater influence than a high rating on rule 3, and similarly for the other two rules. Considering the rating as shown in Table 5 and varying influences of these rules, five prompting strategies that prioritize comprehensive coverage of cognitive demands while ensuring the set optimally widens the variation across complexity levels were selected for populating the HPF.

高 (H)、中 (M) 或低 (L)。需要注意的是，规则4的高评级比规则3的高评级影响更大，其他两条规则也类似。根据表5所示的评级以及这些规则的不同影响，选择了五种提示策略来填充HPF，这些策略优先考虑全面覆盖认知需求，同时确保集合能最优地扩大复杂度级别的变化范围。

提示策略	规则1	规则2	规则3	规则4
角色提示 (Role Prompting)	L	L	L	L
情感提示 (Emotion)	L	L	M	L
零样本思维链 (Zero-shot CoT)	L	L	M	L
元提示 (Meta Prompting)	M	H	M	L
三样本思维链 (Three-shot CoT)	H	H	M	L
五样本思维链 (Five-shot CoT)	H	H	H	L
验证链 (Chain-of-Verification)	H	H	H	H
渐进式提示 (Least-to-Most)	H	H	H	L
自洽提示 (Self-Consistency)	H	H	H	M
通用知识提示 (GKP)	L	H	H	H

Table 5: Human judgment of influence of the rules of taxonomy on different prompting strategies in solving the tasks of the representative set. The ratings are provided based on a voting system involving all human annotators. Green represents the prompting strategies selected for populating the complexity levels of the HPF.

表 5: 人类对分类规则在不同提示策略解决代表性任务集时影响力的评判。评分基于所有人工标注员参与的投票系统得出。绿色表示被选入填充HPF复杂度等级的提示策略。

B Threshold Selection for Sum mari z ation and Translation Tasks

B 摘要和翻译任务的阈值选择

In addition to the 0.15 and 0.20 thresholds presented in the main paper, extended evaluations were conducted on the IWSLT and SamSum datasets using thresholds of 0.25 and 0.30 with GPT-4o, Mistral-Nemo 12B, and Llama-3 to assess the impact of varying thresholds on LLM performance.

除了主论文中提到的0.15和0.20阈值外，还在IWSLT和SamSum数据集上使用GPT-4o、Mistral-Nemo 12B和Llama-3进行了0.25和0.30阈值的扩展评估，以研究不同阈值对大语言模型性能的影响。

B.1 Sum mari z ation Task

B.1 摘要任务

In the sum mari z ation task, increasing the threshold evaluates an LLM’s ability to condense content while retaining key information. Higher thresholds like 0.25 and 0.30 reveal the trade-offs between conciseness and informative ness. However, as shown in Figure 8, there was no significant improvement in ROUGE-L, except for a slight increase with GPT-4o. The experiments showed a sharp rise in HPI, reflecting the increased task complexity. These results suggest that LLM performance has plateaued, with no further gains at higher thresholds. This validates the use of 0.15 and 0.20 thresholds in the main paper are sufficient for optimal LLM performance.

在摘要任务中，提高阈值可以评估大语言模型在压缩内容的同时保留关键信息的能力。0.25和0.30等高阈值揭示了简洁性与信息量之间的权衡。然而，如图8所示，除了GPT-4o略有提升外，ROUGE-L指标并未出现显著改善。实验数据显示HPI值急剧上升，反映出任务复杂度的增加。这些结果表明大语言模型的性能已趋于稳定，在更高阈值下无法获得额外收益。这验证了主论文采用0.15和0.20阈值对于实现大语言模型最佳性能已经足够。

Figure 8: Comparison of HPI and ROUGE-L score across different threshold values in sum mari z ation task.

图 8: 摘要任务中不同阈值下 HPI 与 ROUGE-L 分数的对比。

B.2 Translation Task

B.2 翻译任务

In machine translation, higher thresholds (0.25 and 0.30) impose stricter evaluations, assessing how well models capture the nuances of the source text. Lower thresholds (0.15 and 0.20) focus on general adequacy, while higher ones test performance under more challenging conditions. As shown in Figure 9, no BLEU improvements were observed across any LLMs, with models either reaching saturation or showing decreased performance alongside a rapid rise in HPI. This validates the selection of 0.15 and 0.20 thresholds in the main paper as sufficient for optimal LLM performance.

在机器翻译中，较高的阈值（0.25和0.30）会施加更严格的评估，衡量模型对源文本细微差别的捕捉能力。较低的阈值（0.15和0.20）侧重于整体充分性，而较高阈值则测试模型在更具挑战性条件下的表现。如图9所示，所有大语言模型均未观察到BLEU分数提升，模型要么达到饱和状态，要么在HPI快速上升的同时表现出性能下降。这验证了主论文选择0.15和0.20阈值对于大语言模型最优性能已足够充分。

Figure 9: Comparison of HPI and BLEU score across different threshold values in the translation task.

图 9: 翻译任务中不同阈值下 HPI 和 BLEU 分数的对比。

C LLM-as-a-Judge

C 大语言模型即裁判

C.1 Scoring Prompt Template

C.1 评分提示模板

The system prompt is designed to guide the LLM judge in evaluating different prompting strategies based on four specific criteria: Basic Recall and Reproduction, Understanding and Interpretation, Analysis and Reasoning, and Application of Knowledge and Execution. Each criterion is scored on a scale of 1-5. The evaluation uses GPT-4o as a judge, with the following system prompt:

该系统提示(prompt)旨在引导大语言模型(LLM)评委基于四项具体标准评估不同提示策略：基础记忆与复现、理解与阐释、分析与推理、知识应用与执行。每项标准按1-5分制评分。评估采用GPT-4o作为评委，系统提示如下：

You are a judge evaluating different prompting strategies and you need to score these prompting strategies based on pre-defined criteria. Different prompting strategies leverage varied amounts of intelligence from the model to achieve the required answer. So, assign the scores very carefully based on your analysis of the prompt and its effect on your intelligence to achieve the given answer as well as the number of multi-step prompts which increases the complexity of execution.

你是一位评委，正在评估不同的提示策略，需要根据预设标准对这些提示策略进行评分。不同的提示策略会利用模型不同程度的智能来获得所需答案。因此，请根据你对提示及其对智能影响的分析，以及增加执行复杂度的多步提示数量，谨慎分配分数。

score1: Basic Recall and Reproduction Definition: The need of the model to remember and reproduce factual information without interpretation or analysis to answer the prompt Range: 1-5

score1: 基础记忆与复现
定义: 模型需要在不进行解释或分析的情况下记忆并复现事实信息以回答提示
范围: 1-5

C.2 Hybrid Dataset

C.2 混合数据集

The hybrid dataset is composed of 1106 samples uniformly distributed over seven different task-specific datasets, covering a wide range of language understanding and generation tasks. This diversity allows for a comprehensive evaluation of the prompting strategies across various problem types. The evaluation uses a hybrid dataset composed of samples from various task-specific datasets and each dataset contributes specific types of tasks:

混合数据集由1106个样本组成，均匀分布在七个不同的任务特定数据集上，涵盖了广泛的语言理解和生成任务。这种多样性使得能够全面评估各种问题类型的提示策略。评估使用的混合数据集由来自不同任务特定数据集的样本组成，每个数据集贡献特定类型的任务：

C.3 Scoring Method

C.3 评分方法

For each prompting strategy (Role Prompting, Zero-shot CoT, Three-shot CoT, Least to Most Prompting, Generated Knowledge Prompting), the system:

针对每种提示策略（角色提示、零样本思维链、少样本思维链、逐步提示、生成知识提示），系统：

This study ensured that both the human judge and the LLM judge utilized the same scoring methodology to eliminate any potential bias in the comparison.

本研究确保人类评审员和大语言模型评审员采用相同的评分方法，以消除比较中的潜在偏差。

D Limitations

D 限制

Human Annotation Constraints: A limitation of this study is the reliance on human evaluation for inducing the $\mathsf{H P I}_{D a t a s e t}$ penalty into the HPF. While this study assessed $5%$ of the datasets, expanding coverage would offer a more comprehensive analysis. However, due to constraints in human resources for manual annotation, we could not include a larger portion. Future work could address this by increasing manpower or automating parts of the evaluation process.

人工标注限制：本研究的局限性在于依赖人工评估将 $\mathsf{H P I}_{D a t a s e t}$ 惩罚引入HPF。虽然本研究评估了 $5%$ 的数据集，但扩大覆盖范围将提供更全面的分析。然而，由于人工标注的人力资源限制，我们无法纳入更大比例的数据。未来工作可通过增加人力或自动化部分评估流程来解决这一问题。

HPF Optimization: The effectiveness of the HPF heavily relies on the quality of the prompts used at each level of the taxonomy. Crafting high-quality prompts that accurately reflect the subtleties of each level demands considerable expertise and repeated refinement. This study only investigated a limited set of prompting strategies within the HPF, indicating a need for further research into creating diverse structural frameworks and incorporating additional prompting strategies.

HPF优化：HPF的有效性很大程度上依赖于分类体系中每一层级所使用的提示词(prompt)质量。要制作出准确反映各层级细微差别的高质量提示词，需要大量专业知识和反复优化。本研究仅探讨了HPF框架内有限的提示策略，这表明未来需要进一步研究创建多样化的结构框架并整合更多提示策略。

Zero-shot Prompt Selection: HPF dynamically determines the optimal cognitive complexity level by iterating through the framework’s levels, which leads to increased inference time. While this study investigated Adaptive HPF for zeroshot prompt selection, it faced considerable hallucinations. Future research should focus on automating HPF using finetuning or reinforcement learning-based approaches to select the appropriate complexity level without manual iteration. This strategy would optimize inference time and improve overall performance.

零样本提示选择：HPF通过迭代框架层级动态确定最佳认知复杂度，这会导致推理时间增加。虽然本研究探索了自适应HPF在零样本提示选择中的应用，但仍存在显著幻觉问题。未来研究应聚焦于通过微调或基于强化学习的方法实现HPF自动化，无需手动迭代即可选择合适复杂度层级。该策略将优化推理时间并提升整体性能。

E Adaptive HPF

E 自适应高通滤波器 (Adaptive HPF)

E.1 HPI for Adaptive HPF

E.1 自适应HPF的HPI

The prompt-selector can dynamically select the most suitable prompting strategy for a given task’s complexity from the HPF’s hierarchy of complexity levels. To determine the most effective prompting strategy to complete the task, the prompt-selector was given a maximum number of iterations equivalent to the number of levels in the manual HPF. The score for ith iteration is $i+x$ , where $x$ is the complexity level by the prompt-selector. If the LLM fails to complete the task after all iterations, it is assigned a penalty, $\mathsf{H P I}_{D a t a s e t}$ . $x$ represents the level of the HPF selected by prompt-selector at ith iteration at which the task is addressed, $m$ represents the total number of levels in the HPF, and $n$ denotes the total number of samples in the evaluation set.

提示选择器(prompt-selector)能够根据任务复杂度从手动构建的层次化提示框架(HPF)中动态选择最合适的提示策略。为确定完成任务的最有效提示策略，系统设定最大迭代次数等于HPF的层级数。第i次迭代的得分为$i+x$，其中$x$为提示选择器判定的复杂度等级。若大语言模型(LLM)在所有迭代后仍未能完成任务，则施加惩罚项$\mathsf{H P I}_{D a t a s e t}$。变量$x$表示第i次迭代时提示选择器选定的HPF层级，$m$表示HPF总层数，$n$代表评估集样本总数。

Algorithm 2: Prompt-Selector for Adaptive HPF

算法 2: 自适应HPF提示选择器

E.2 Hallucination in Adaptive HPF

E.2 自适应HPF中的幻觉现象

Hallucinations in prompt-selector refer to instances where the LLM generates incorrect or misleading prompting levels or nonsensical information that is not supported by the HPF. These hallucinations can occur across various tasks, including question answering, multiple-choice questions, translation, and sum mari z ation.

提示选择器中的幻觉 (hallucination) 是指大语言模型生成与层级提示框架 (HPF) 不符的错误提示级别或无意义信息的现象。这类幻觉可能出现在问答、选择题、翻译和摘要等多种任务场景中。

For the BoolQ task, the prompt-selector initially struggles, indicated by the iterations where it reaches Level 4 with hallucinations (represented by ’...’). However, by the fourth iteration, Llama-3 8B manages to answer correctly at Level 2. For the CSQA task, prompt-selector exhibits hallucinations initially, shown by Level 4 and Level 0 (not included in HPF) responses. Eventually, it corrects itself by the third iteration, providing the correct answer at Level 2. For the IWSLT task, prompt-selector demonstrates a consistent pattern of hallucinations across multiple iterations. Even though Llama-3 8B attempts the translation at Level 2 multiple times, it ultimately fails to provide a correct translation, indicating a persistent hallucination. For the SamSum task, prompt-selector shows initial hallucinations in the first three iterations (Level 4). However, by the fourth and fifth iterations, the promptselector starts producing lower levels. Finally, Llama-3 8B achieves the correct answer at Level 2 in the last iteration .

在BoolQ任务中，提示选择器(prompt-selector)最初表现挣扎，表现为带有幻觉(用"..."表示)的4级响应。但到第四次迭代时，Llama-3 8B成功在2级给出正确答案。对于CSQA任务，提示选择器初期出现幻觉现象，表现为4级和0级(未包含在HPF中)响应，最终在第三次迭代时自我修正，在2级提供正确答案。在IWSLT任务中，提示选择器表现出持续的多轮次幻觉模式，尽管Llama-3 8B多次尝试2级翻译，但始终未能输出正确译文，显示顽固性幻觉问题。SamSum任务中，提示选择器在前三次迭代(4级)呈现初始幻觉，但在第四、五次迭代开始产生更低级别响应，最终Llama-3 8B在最后一次迭代中以2级达成正确答案。

The results in Table 6 and Table 7 indicate that the promptselector exhibits hallucinations in selecting complexity levels across various tasks and iterations resulting in higher HPI for Adaptive HPF, with performance varying significantly. While the LLM can eventually produce correct answers, as seen in the BoolQ and SamSum tasks, it often requires multiple attempts and may still fail in tasks like IWSLT translation.

表6和表7的结果表明，提示选择器(promptselector)在不同任务和迭代中选择复杂度级别时会出现幻觉，导致自适应HPF(Adaptive HPF)的HPI更高，且性能差异显著。虽然大语言模型最终能生成正确答案(如BoolQ和SamSum任务所示)，但通常需要多次尝试，在IWSLT翻译等任务中仍可能失败。

Table 6: HPI (lower is better) of LLMs across datasets (with thresholds) for Adaptive HPF.

表 6: 大语言模型在不同数据集上的自适应HPF HPI值(越低越好) (含阈值)

模型	BoolQ	CSQA	IWSLT (0.15)	IWSLT (0.20)	SamSum (0.15)	SamSum (0.20)
Llama-38B	5.2173	5.9136	6.2006	6.2841	5.0316	5.5756
Mistra 7B	5.0483	5.9073	6.2478	6.4604	4.7423	5.1336
Phi-33.8B	5.1386	5.6793	6.3955	6.4936	5.0961	5.7778
Gemma 7B	5.1514	5.5771	6.5947	6.6605	5.7229	6.4347

Table 7: Performance scores of LLMs across datasets for Adaptive HPF.

数据集	指标	阈值	Llama-38B	Phi-33.8B	Mistral7B	Gemma 7B
BoolQ	准确率		0.88577	0.91115	0.91752	0.91166
CSQA	准确率		0.59451	0.68019	0.60111	0.68549
IWSLT	BLEU	0.15	0.21140	0.15557	0.20000	0.08447
		0.2	0.21146	0.15354	0.20568	0.07730
		0.15	0.24407	0.20586	0.26910	0.16023
SamSum	ROUGE-1	0.2	0.24981	0.21580	0.28335	0.16179

表 7: 大语言模型在自适应 HPF 各数据集上的性能得分。

E.3 Prompt Template for Prompt-Selector

E.3 提示选择器 (Prompt-Selector) 的提示模板

The prompt-selector in adaptive HPF selects the prompting level based on the task complexity to address the task. Llama $38\mathbf{B}$ serves as the prompt-selector in the experiments. The prompt template was meticulously designed to ensure maximum clarity, aiming to reduce hallucinations and select the most effective prompting strategy.

自适应HPF中的提示选择器(prompt-selector)根据任务复杂度选择提示级别来处理任务。实验中采用Llama $38\mathbf{B}$作为提示选择器。提示模板经过精心设计以确保最大清晰度，旨在减少幻觉并选择最有效的提示策略。

Prompt Template: Choose the most effective prompting strategy among five available strategies for the task. Begin with the lowest indexed strategy and progress to higher indexed strategies if the earlier ones are not effective. For a given task, the prompting strategies are:

提示模板：从五种可用策略中选择最高效的提示策略执行任务。从编号最低的策略开始尝试，若无效则逐步尝试更高编号策略。给定任务的提示策略包括：

Select only the index (do not provide the name) of the most effective prompting strategy.

仅选择最有效提示策略的索引（不要提供名称）。

F Computational Budget

F 计算预算

All evaluation experiments and ablation studies were conducted on V100 GPUs (16GB and 32GB variants), utilizing a total of around 9,000 computation hours, this equates to approximate ly 1.125 petaflop-hours of computational resources.

所有评估实验和消融研究均在V100 GPU(16GB和32GB版本)上进行，总计使用了约9,000计算小时，相当于约1.125 petaflop-hour的计算资源。

G Large Language Models Used for Evaluation

用于评估的大语言模型

The HPF supports leading open source and proprietary LLMs and includes mechanisms for optimizing performance through advanced quantization techniques. The experiments were conducted on the following instruction-tuned LLMs, and the model description and licenses are discussed in Table 8.

HPF支持领先的开源和专有大语言模型，并通过先进的量化技术提供性能优化机制。实验在以下经过指令调优的大语言模型上进行，模型描述和许可协议详见表8。

The LLMs were loaded in 4-bit precision format, with a maximum generation limit of 1024 tokens per run to ensure concise outputs. The temperature was set to 0.6 to control prediction randomness, while top-p sampling $(\mathtt{p}{=}0.9)$ enabled the exploration of diverse continuations. Additionally, a repetition penalty was applied to discourage the generation of repeated phrases, promoting coherent and varied text output.

大语言模型以4位精度格式加载，每次运行最多生成1024个token以确保输出简洁。温度参数设为0.6以控制预测随机性，同时采用top-p采样 $(p=0.9)$ 来探索多样化的续写可能。此外，应用重复惩罚机制避免生成重复短语，从而提升文本输出的连贯性与多样性。

H Prompt Templates

H 提示模板

H.1 Level 1: Role Prompting

H.1 第一级：角色提示

Role prompting represents the most basic interaction with an LLM, assigning it a specific role or task without additional context or examples. This approach relies solely on the initial instruction to guide responses. For instance, asking the LLM to “act as a translator” prompts it to translate text based on its training data. While straightforward, this method may lack depth, resulting in less accurate or nuanced outputs. Table 9 shows the prompt templates used for role prompting across all datasets in the experiments.

角色提示 (Role prompting) 是与大语言模型最基本的交互方式，仅为其分配特定角色或任务，不提供额外上下文或示例。该方法完全依赖初始指令引导回答。例如，要求大语言模型"扮演翻译员"会促使其基于训练数据翻译文本。虽然简单直接，但这种方法可能缺乏深度，导致输出准确性或细致度不足。表9展示了实验中所有数据集使用的角色提示模板。

H.2 Level 2: Zero-shot Chain-of-Thought Prompting

H.2 第二级：零样本思维链提示 (Zero-shot Chain-of-Thought Prompting)

Zero-shot Chain-of-Thought (CoT) prompting enhances basic role prompting by requiring the LLM to generate a reasoning process for a task, despite not being explicitly trained on similar examples. This method encourages the LLM to break down the problem and solve it step-by-step using its internal knowledge, improving response quality through logical progression and coherence. Table 10 displays the prompt templates used for Zero-CoT across all datasets in the experiments.

零样本思维链 (Zero-shot Chain-of-Thought, CoT) 提示通过要求大语言模型为任务生成推理过程来增强基础角色提示，即使没有针对类似示例进行明确训练。这种方法鼓励大语言模型分解问题并利用其内部知识逐步解决，通过逻辑推进和连贯性提高响应质量。表 10 展示了实验中所有数据集使用的 Zero-CoT 提示模板。

Table 8: License information for LLMs used in the experiments.

表 8: 实验中使用的LLM许可信息

模型	许可类型	使用限制
GPT-40	专有许可	商业用途需付费API访问，受OpenAI服务条款约束
Claude 3.5 Sonnet	专有许可	商业用途需付费API访问，受Anthropic服务条款约束
Mistral-Nemo12B	专有许可	使用可能仅限于授权合作伙伴或特定用例
Gemma-2 9B Llama-3 8B	研究许可	仅限非商业用途，研究目的
	研究许可	可能适用特定限制，通常用于非商业研究用途
Mistral 7B	开源许可	允许广泛使用，必须包含原始许可和声明
Gemma 7B	开源/研究许可	根据许可类型，可能具有非商业限制或允许广泛使用
Phi-3 3.8B	开源许可	允许广泛使用，必须包含原始许可和声明

Table 9: Prompt templates of different datasets for Role Prompting.

表 9: 角色提示(Role Prompting)不同数据集的提示模板。

数据集	提示
BoolQ	作为一名全知者，根据文章："passage"，回答以下问题的真/假："question"。
CSQA	作为一名批判性思考者，选择答案："question"，A."option 1"，B."option 2"，C. ，'. ，' ，
IWSLT	作为一名翻译人员，将"english text"翻译成法语。
SamSum	作为一名总结者，概括对话："dialogue"。
GSM8k	作为一名专业数学家，根据问题："question"，计算问题的数值答案。
HumanEval	一名专业程序员。
MMLU	作为一名批判性思考者，选择答案：' uodo,， ''" uodo,， ''uosnb,， :amsue 'option 3"，D. "option 4"。

H.3 Level 3: Three-Shot Chain-of-Thought Prompting

H.3 第三级：三样本思维链提示 (Three-Shot Chain-of-Thought Prompting)

Three-shot Chain-of-Thought (CoT) prompting builds on the zero-shot approach by providing the LLM with three task examples, including the reasoning steps used to reach the solution. These examples help the LLM grasp the required structure and logic, enabling it to better replicate the problemsolving process and produce more accurate, con textually relevant responses. Table 11 shows the prompt templates used for 3-CoT across all datasets in the experiments.

三样本思维链(CoT)提示在零样本方法的基础上，为大语言模型提供三个任务示例(包含推导步骤的完整解题过程)。这些示例能帮助模型掌握所需的结构与逻辑，使其更好地复现问题解决流程，生成更准确且符合语境的响应。表11展示了实验中所有数据集使用的3-CoT提示模板。

H.4 Level 4: Least-to-Most Prompting

H.4 第四级：从最少到最多提示法

Least-to-most prompting is an advanced technique that gradually increases prompt complexity, starting with simpler tasks and progressing to more complex challenges. This method allows the LLM to build confidence and leverage insights from easier prompts to tackle harder ones, enhancing its ability to generalize from straightforward examples to intricate scenarios. Table 12 displays the prompt templates used for Least-to-Most Prompting across all datasets in the experiments.

最少到最多提示 (Least-to-most prompting) 是一种进阶技术，通过从简单任务开始逐步增加提示复杂度，最终应对更复杂的挑战。这种方法让大语言模型能够建立信心，并利用简单提示中获得的洞见来解决更难的问题，从而提升其从简单示例泛化到复杂场景的能力。表12展示了实验中所有数据集使用的"最少到最多提示"模板。

H.5 Level 5: Generated Knowledge Prompting

H.5 第5级：生成知识提示 (Generated Knowledge Prompting)

Generated Knowledge prompting is one of the most complex techniques in HPF, where the LLM not only addresses the task but also integrates relevant additional information to enhance its response. This method prompts another LLM to produce auxiliary knowledge, creating a richer context for understanding and solving the problem. By leveraging self-generated insights, the LLM can deliver more detailed, accurate, and nuanced answers. Table 13 shows the prompt templates used for Generated Knowledge Prompting across all datasets in the experiments.

生成知识提示 (Generated Knowledge prompting) 是层次化提示框架 (HPF) 中最复杂的技术之一，它要求大语言模型不仅完成任务，还需整合相关附加信息以提升响应质量。该方法通过提示另一个大语言模型生成辅助知识，为问题理解和解决构建更丰富的上下文。借助自生成洞察力，大语言模型能够输出更详尽、准确且细致的答案。表13展示了实验中所有数据集使用的生成知识提示模板。

Table 10: Prompt templates of different datasets for Zero-shot Chain-of-Thought Prompting.

表 10: 零样本思维链提示 (Zero-shot Chain-of-Thought Prompting) 不同数据集的提示模板。

数据集	提示模板
BoolQ	sb s ss, ss s ‘‘question"。让我们一步一步思考。
CSQA	‘‘option 4",E. ‘‘option 5"。让我们一步一步思考。
IWSLT
SamSum GSM8k	基于问题: "question"，计算问题的数值答案。让我们一步一步思考。
HumanEval	让我们一步一步思考。
MMLU	‘‘option 3",D. ‘‘option 4"。让我们一步一步思考。

Table 11: Prompt templates of different datasets for Three-Shot Chain-of-Thought Prompting.

表 11: 三样本思维链提示 (Three-Shot Chain-of-Thought Prompting) 各数据集提示模板。

数据集	提示模板
BoolQ	根据段落:"passage2"，回答问题"question2"并回答True/False。答案:"answer2"。解释:"explaination2"。基于段落:"passage3"，回答问题"option1-1", B."option2-1", C.
CSQA	选择答案:"questionl", A. "option3-1", D. "option4-1", E. 解释:"explaination1"。( A. "option1-2", B. "option2-2", C. "option5-2"，答案:"answer2"， "option3-3", D. "option4-3", E. 解释:"explaination3"。4", E.‘option 5"。 "option5-1"，答案:"answerl", 选择答案:"question2", "option3-2", D. "option4-2", E. 解释:"explaination2"。 "option5-3", 答案:"answer3", 选择答案:"question", do,,'a' uotdo,,," uodo,,‘g'"l uodo,,''uosnb,,
IWSLT	"english text2"翻译为法语。法语:"french text2"。翻译"english text3"为法语。法语:"french text3"。翻译"english text"为法语。
SamSum	总结对话:"dialogue1"。摘要:"summary1"。总结对话:"dialogue2"。摘要:"summary2"。总结对话:"dialogue3"。摘要:"summary3"。总结对话:"dialogue"。基于问题:"gsm8k-question1"，计算数值"gsm8k_ans1"。
GSM8k	回答问题:"gsm8k-question2"，计算问题的数值答案。答案:"gsm8k_ans2"。基于问题:"gsm8k_question3"，计算问题的数值答案。答案:"gsm8k-ans3"。基于问题:"question"，计算问题的数值答案。
HumanEval	根据给定约束条件:"humaneval_code1"，补全代码。代码:"humaneval-sol1"。根据给定约束条件:"humaneval_code2"，补全代码。根据给定约束条件:"humaneval_code3"，补全代码。代码:"humaneval-sol3"。
MMLU	‘option 3", D.‘option 4"。是非自愿的。B.被告的陈述是自愿的。C.被告在陈述时未被拘留。D. "mmlu-ques3"。A. 错误，错误。B. 错误，正确 C. 正确，错误 D.正确，正确。答案:B 解释:"mmlu-exp3"。选择

Table 12: Prompt templates of different datasets for Least-to-Most Prompting.

表 12: 不同数据集在Least-to-Most Prompting中的提示模板

数据集	提示
BoolQ	prompt 1: 总结这段文字的主要观点："passage"。prompt b t b "question"，文章："passage"。prompt 4: 根据文章内容，回答这个问题：'question"，相关信息："previous response"。
CSQA	prompt 1: 分析这个问题："question"。Cs e o "‘option 5"。prompt 3: 根据分析："previous response"，排除错误选项，正确选项为：A.‘‘option 1",B.‘‘option 2",C.‘‘option 3",D.‘‘option 4",E.‘‘option 5"。prompt 4: 从选项中选出正确答案：A. ‘‘option 1",B. ' uod,，' odo,，'‘ uo,，' odo,,
IWSLT	prompt 2: 识别并列出这段文本中的关键短语或术语：'english text"。r response'"。关键短语的翻译："previous response"。
SamSum	prompt 1: 列出这段对话中的主要观点或关键想法："dialogue"。prompt 2: 详细阐述以下关键点，提供更多细节或背景："previous response"。prompt 3: 使用列出的关键点及其详细说明，起草这段文本的简明摘要："dialogue"。prompt 4: 改进这份摘要草稿，使其更加简洁
GSM8k	分析问题："question"。将问题分解为子问题："question"。计算问题的子问题答案："pred"。根据之前的计算，计算这个问题的数值答案："question"："pred"
HumanEval	基于所述约束条件的代码："code"基于之前的计算："pred"
MMLU	问题："question"，选项：A."option 1" B."option 2" C."option 3" D. "option 4"。根据分析："question"，排除错误选项 D."option 4"。

Table 13: Prompt templates of different datasets for Generated Knowledge Prompting.

表 13: 生成知识提示 (Generated Knowledge Prompting) 各数据集的提示模板。

数据集	提示
BoolQ	知识生成提示：生成关于段落的知识："段落"。
CSQA	sutsn uodo,， ‘'" uotdo,，'a' uodo,， ‘'" uoado,, 问题的知识："知识" 知识生成提示：生成关于问题的知识："问题"。
IWSLT	关键词定义："知识" 知识生成提示：为文本中的每个单词生成法语定义："英文文本"。
SamSum	推理提示：使用对话的解释总结对话："对话" 知识生成提示：生成关于对话的解释："对话"。
GSM8K	基于问题："问题"，使用对问题的解释计算问题的数值答案："预测"
HumanEval	基于提到的约束完成代码："代码" 使用约束的知识："预测"
MMLU	选择答案。"问题"，选项：A. "选项 1" B. "选项 2" C.

[论文翻译]层次化提示分类法：符合人类认知原则的大语言模型通用评估框架

原文地址：https://arxiv.org/pdf/2406.12644v4