[论文翻译]预测大语言模型问答性能的语义一致性方法


原文地址:https://arxiv.org/pdf/2311.01152


Predicting Question-Answering Performance of Large Language Models through Semantic Consistency

预测大语言模型问答性能的语义一致性方法

Ella Rabinovich Eitan Farchi

Ella Rabinovich Eitan Farchi

Orna Raz

Orna Raz

Samuel Ackerman Ateret Anaby-Tavor

Samuel Ackerman Ateret Anaby-Tavor

IBM Research {ella.rabi no vich 1, samuel.ackerman}@ibm.com {ornar, farchi, atereta}@il.ibm.com

IBM Research {ella.rabi novich1, samuel.ackerman}@ibm.com {ornar, farchi, atereta}@il.ibm.com

Abstract

摘要

Semantic consistency of a language model is broadly defined as the model’s ability to produce semantically-equivalent outputs, given semantically-equivalent inputs. We address the task of assessing question-answering (QA) semantic consistency of contemporary large language models (LLMs) by manually creating a benchmark dataset with high-quality paraphrases for factual questions, and release the dataset to the community.

语言模型的语义一致性广义上定义为模型在给定语义等价输入时产生语义等价输出的能力。我们通过手动创建包含高质量事实问题改述的基准数据集来评估当代大语言模型(LLM)的问答(QA)语义一致性,并将该数据集向社区公开。

We further combine the semantic consistency metric with additional measurements suggested in prior work as correlating with LLM QA accuracy, for building and evaluating a framework for factual QA reference-less performance prediction – predicting the likelihood of a language model to accurately answer a question. Evaluating the framework on five contemporary LLMs, we demonstrate encouraging, significantly outperforming baselines, results.

我们进一步将语义一致性指标与先前研究中建议的、与大语言模型问答准确率相关的其他测量方法相结合,用于构建和评估一个无参考事实问答性能预测框架——预测语言模型准确回答问题的可能性。通过在五种当代大语言模型上评估该框架,我们展示了令人鼓舞且显著优于基线的结果。

1 Introduction

1 引言

Consistency of a model is broadly defined as the invariance of its behavior under meaning-preserving variations of its input (Elazar et al., 2021; Raj et al., 2022). Clearly, consistency is a highly desirable property of large language models, increasing their safety, robustness and trustworthiness. Here we address the question of factual consistency of LLMs in the context of open-domain zero-shot factual question answering. As a concrete example, a consistent model will produce the same answer for the set of questions {“What is Stevie Cameron’s occupation?”, “What job does Stevie Cameron do?”, “What does Stevie Cameron earn a living as?”}. A model’s consistency metric is defined to be agnostic to answers’ accuracy (Elazar et al., 2021), meaning that semantically-equivalent (possibly incorrect) outputs are qualified as consistent. As such, while the correct answer to the questions above is “journalist”, three other identical answers (e.g., “politician”) will score as perfectly consistent.

模型的一致性(consistency)通常定义为在输入保持语义不变的情况下,其行为表现的不变性(Elazar et al., 2021; Raj et al., 2022)。显然,一致性是大语言模型高度期望具备的特性,能提升其安全性、鲁棒性和可信度。本文探讨大语言模型在开放域零样本事实问答中的事实一致性问题。举例而言,一个具备一致性的模型对于问题集{"Stevie Cameron的职业是什么?","Stevie Cameron从事什么工作?","Stevie Cameron以什么谋生?"}会给出相同答案。模型的一致性指标定义与答案准确性无关(Elazar et al., 2021),这意味着语义等价(可能不正确)的输出仍被视为一致。因此,虽然上述问题的正确答案是"记者",但三个相同的错误答案(如"政治家")仍会被判定为完全一致。

Semantic consistency of masked language models (MLMs) has been studied by Elazar et al. (2021), who inspected masked tokens as predicted by encoder models, for alternations of word tuples, using a dataset of factual statements and their crowd-sourced paraphrases, specifically tailored for working with MLMs. Raj et al. (2022) evaluated semantic consistency of decoder models for the task of non-factual question answering, experimenting with a range of consistency metrics. The authors automatically generated paraphrases for questions in the TruthfulQA dataset (Lin et al., 2022), and scored a model’s consistency as its robustness to paraphrases. However, the sub-optimal quality of automatic paraphrases, along with open and often lengthy nature of answers to questions,1 as well as multiple (occasionally semantically diverse) reference answers, challenge benchmarking of LLMs’ QA consistency using TruthfulQA.

Elazar等人(2021)研究了掩码语言模型(MLM)的语义一致性,他们使用专门为MLM设计的事实陈述数据集及其众包改写版本,通过检查编码器模型对单词元组变体的掩码token预测来评估。Raj等人(2022)针对非事实性问答任务评估了解码器模型的语义一致性,实验了多种一致性指标。作者们为TruthfulQA数据集(Lin等人,2022)中的问题自动生成改写,并将模型对改写的鲁棒性作为其一致性评分。然而,自动改写质量欠佳、问题答案的开放性和通常较长的特性,以及多个(有时语义多样的)参考答案,使得使用TruthfulQA对大语言模型的问答一致性进行基准测试面临挑战。

A benchmark dataset for measuring the robustness of LLMs to paraphrases in the context of factual QA should satisfy two desirable properties: (1) strictly semantically-equivalent question paraphrases, and (2) questions that call for single short (possibly multi-word) answer, facilitating accurate evaluation. Using the recently introduced PopQA dataset with over 14K factual questions (Mallen et al., 2023), we create its carefully curated extended version—PopQA-TP (PopQA templated paraphrases)—where 3–10 manually-created alternations were appended for each original question. The final dataset comprises over 118K questions, while preserving metadata (e.g., reference answers) from the original PopQA. We further use this dataset for benchmarking factual semantic consistency of multiple encoder-decoder and decoder-only LLMs. The dataset is made available for the community at https://hugging face.co/ datasets/ibm/popqa-tp.

衡量大语言模型(LLM)在事实问答(QA)场景下对改写问题的鲁棒性时,基准数据集应满足两个关键特性:(1)严格语义等价的问题改写,(2)要求单一简短(可能多词)答案的问题,以确保评估准确性。基于近期推出的包含14K+事实问题的PopQA数据集(Mallen等人,2023),我们创建了其精心扩展版本——PopQA-TP(PopQA模板改写),为每个原始问题附加3-10个人工创建的改写变体。最终数据集包含超过118K个问题,同时保留了原始PopQA的元数据(如参考答案)。我们进一步使用该数据集对多种编码器-解码器和纯解码器架构的大语言模型进行事实语义一致性基准测试。数据集已在https://huggingface.co/datasets/ibm/popqa-tp向社区开放。

We next demonstrate that robustness to question paraphrases correlates with a model’s answer correctness for the given question. Practically, this finding means that semantic consistency score is predictive of a model accuracy. Combining this predictor with additional metrics suggested in prior work as correlating with LLM QA correctness, we perform a comprehensive regression analysis of the predictive power of various metrics on the model’s accuracy, as well as interactions between those metrics. Moreover, we show that the developed framework can be used for predicting the likelihood of a language model to accurately answer a factual question. Collectively, these results pave the way for the extremely challenging, yet highly important, task of question-answering performance prediction, a reference-less evaluation of QA performance, in the absence of ground-truth answers.

接下来我们证明,对问题改述的鲁棒性与模型对给定问题的回答正确性相关。实际上,这一发现意味着语义一致性分数可以预测模型的准确性。我们将这一预测指标与先前工作中提出的其他与大语言模型问答正确性相关的指标相结合,对各种指标在模型准确性上的预测能力以及这些指标之间的相互作用进行了全面的回归分析。此外,我们还表明,所开发的框架可用于预测语言模型准确回答事实性问题的可能性。总的来说,这些结果为极具挑战性但极其重要的问题回答性能预测任务(即在没有真实答案的情况下对问答性能进行无参考评估)铺平了道路。

The contribution of this work is, therefore, twofold: First, we introduce and release a large extension of the PopQA dataset (PopQA-TP), with high-quality paraphrases, that can be used for benchmarking QA semantic consistency of LLMs. Second, we develop a prototype model for QA performance prediction, allowing for comparative analysis of various metrics, and demonstrating predictive power much higher than baselines.

因此,本工作的贡献有两点:首先,我们引入并发布了PopQA数据集的大规模扩展版本(PopQA-TP),包含高质量释义,可用于评估大语言模型的问答语义一致性。其次,我们开发了一个问答性能预测的原型模型,支持多种指标的对比分析,其预测能力显著优于基线水平。

2 Dataset

2 数据集

Benchmarking semantic consistency of LLMs requires high quality question alternations, eliminating possible confounds that stem from issues in automatic paraphrase generation. Despite the immense advances in paraphrasing models during the past few years (e.g., Bandel et al. 2022; Raj et al. 2022; Rahamim et al. 2023), automatic tools still occasionally produce paraphrases that are not meaning-preserving (e.g., “Who is the vocalist of ‘Perfect’?” for the original question “Who is the composer of ‘Perfect’?”), incomplete (e.g., “Who is the vocalist of ‘Perfect’? Shape of You”), or violate, albeit infrequently, grammatical rules (e.g., “Tap water’s safe drinking?" as a paraphrase of “Is tap water safe to drink?"). Aiming at a high-quality benchmark dataset, we opted to manually construct paraphrase templates specific to each question category in PopQA, as detailed below.

评估大语言模型(LLM)的语义一致性需要高质量的问题变体,以消除自动转述生成可能带来的干扰。尽管过去几年转述模型取得了巨大进步(例如 Bandel et al. 2022; Raj et al. 2022; Rahamim et al. 2023),但自动工具仍偶尔会产生非语义保留的转述(例如将原问题"Who is the composer of 'Perfect'?"转述为"Who is the vocalist of 'Perfect'?")、不完整的转述(例如"Who is the vocalist of 'Perfect'? Shape of You"),或违反语法规则的情况(例如将"Is tap water safe to drink?"转述为"Tap water's safe drinking?")。为确保基准数据集的高质量,我们选择针对PopQA中每个问题类别手动构建特定的转述模板,具体如下。

2.1 Paraphrase Templates Creation

2.1 释义模板创建

Each question $q\in\mathrm{PopQA}$ is formed by substituting a single-entity subject into a question template that is fixed for each category. For instance, the occupation and religion templates are “What is ’s occupation?" and “What is the religion of ?", respectively. These fixed templates are sometimes grammatically awkward depending on the type of subject, for instance for the religion category subject ‘Assumption of Mary’.

每个问题 $q\in\mathrm{PopQA}$ 都是通过将单实体主语替换为每个类别固定的问题模板而形成的。例如,职业和宗教的模板分别是"What is ’s occupation?"和"What is the religion of ?"。这些固定模板有时会根据主语类型出现语法上的不自然,例如宗教类别的主语"Assumption of Mary"。

We create the paraphrase question dataset by manually creating multiple paraphrase templates specific to each category, and substituting the subject of each $q$ in PopQA into each template, yielding a set of paraphrases denoted by $P(q)$ . Thus, each question in a given category has the same number of paraphrases. We name the resulting dataset PopQA-TP (PopQA templated paraphrases), which thus consists of $(P(q)+{q}:q\in{\mathrm{PopQA}})$ , that is, the original questions and their paraphrases.

我们通过为每个类别手动创建多个特定于该类别的改写模板,并将PopQA中每个问题$q$的主语替换到每个模板中,生成一组改写问题,记为$P(q)$。因此,给定类别中的每个问题都具有相同数量的改写版本。我们将最终得到的数据集命名为PopQA-TP(PopQA模板改写集),该数据集由$(P(q)+{q}:q\in{\mathrm{PopQA}})$构成,即原始问题及其改写版本。

Table 1 shows summary statistics of the number of questions, by category and overall, for both the original PopQA and our PopQA-TP datasets. Examples of original questions and paraphrases in PopQA-TP are reported in Table 2.

表1展示了原始PopQA数据集和我们构建的PopQA-TP数据集中各问题类别及总体的问题数量统计摘要。表2列举了PopQA-TP数据集中原始问题及其改写版本的具体示例。

category#Q #Q alternativestotal # Q
author15146 9084
capital6456 4515
capital of3633 1452
color345 204
composer9785 5868
country8389 8380
director199910 21989
father5704 2850
genre16196 11333
mother1875 1122
occupation5325 3192
place of birth5846 4088
producer152010 16720
religion3385 2028
screenwriter199910 21989
sport5476 3829
total14267118643
类别 #Q #Q 备选 总 # Q
作者 1514 6 9084
首都 645 6 4515
所属首都 363 3 1452
颜色 34 5 204
作曲家 978 5 5868
国家 838 9 8380
导演 1999 10 21989
父亲 570 4 2850
流派 1619 6 11333
母亲 187 5 1122
职业 532 5 3192
出生地 584 6 4088
制片人 1520 10 16720
宗教 338 5 2028
编剧 1999 10 21989
运动 547 6 3829
总计 14267 118643

Table 1: Dataset summary statistics, for each category label in PopQA. Column ‘#Q’ shows the number of original questions, one per subject, in PopQA; column #Q alternatives’ is the number of template paraphrase for each question in that category, in our PopQA-TP dataset; ‘total #Q' is the resulting number of questions in PopQA-TP, which is (1+(#Q alternatives))×(#Q) .

表 1: PopQA各分类标签的数据集统计摘要。列"#Q"显示PopQA中每个主题的原始问题数量;列"#Q alternatives"表示我们PopQA-TP数据集中每个问题的模板改写数量;"total #Q"是PopQA-TP中的问题总数,计算公式为(1+(#Q alternatives))×(#Q)。

Some PopQA question categories contain subjects of the same underlying type, while in others the type may vary. For instance, subjects of occupation questions are all persons, and in capital of they are all states, provinces, or countries, etc. In religion, some are persons (e.g., Rumi or Paul, but also people like Bertrand Russell who were not religious leaders), ethnic or national groups (e.g., Swedes, Arabs), institutions (e.g., Boston College), or miscellaneous topics (e.g., saint, Bourbon Restoration, Assumption of Mary). For some subjects, thus, it would be more grammatical to phrase the religion question as what religion the subject ‘follows’, and in for others which religion the subject is ‘associated with’. Note that this awkwardness is inherent in the original PopQA, and so our paraphrase templates are designed to span the possible meanings. Nevertheless, we expect a good model to answer these questions intelligently and not be stumped by slight grammatical awkwardness.

PopQA的部分问题类别包含相同基础类型的主题,而其他类别的主题类型可能各不相同。例如,职业类问题的主题都是人物,首都类问题的主题都是州、省或国家等。在宗教类问题中,有些主题是人物(如Rumi或Paul,也包括像Bertrand Russell这样非宗教领袖的人物)、民族或国家群体(如瑞典人、阿拉伯人)、机构(如Boston College),或是其他杂项主题(如圣人、波旁复辟、圣母升天)。因此,对于某些主题,将宗教问题表述为"遵循"什么宗教会更符合语法,而对于其他主题,则更适合问"关联"什么宗教。需要注意的是,这种语法上的不协调性原本就存在于PopQA中,因此我们的改写模板旨在涵盖所有可能的含义。尽管如此,我们期望一个好的模型能够智能地回答这些问题,而不会因轻微的语法不协调而受阻。

question Table 2: Example set of question paraphrases in PopQATP for the genre and occupation categories. The first question in each paraphrase grouping is the original question from PopQA.

WhatgenreisAvatar:TheLastAirbender?
What type ofworkisAvatar:TheLastAirbender? FansofwhatgenrewouldlikeAvatar:TheLastAirbender? Whatgenre doesAvatar:TheLastAirbenderbelong to? What genre is"Avatar: The Last Airbender"? Whatgenre isAvatar:TheLastAirbender associatedwith? Avatar:TheLastAirbenderisassociatedwithwhatgenre?
WhatisShozaburoNakamura'soccupation?
What is the occupation of ShozaburoNakamura? WhatkindofworkdoesShozaburoNakamurado? WhatdoesShozaburoNakamuraearnalivingas? What jobdoesShozaburoNakamura do? WhatisShozaburoNakamura'sjob?

表 2: PopQATP中针对流派和职业类别的示例问题改写集合。每个改写分组中的第一个问题来自PopQA的原始问题。

问题
《降世神通:最后的气宗》属于什么流派?
《降世神通:最后的气宗》属于什么类型的作品? 喜欢什么流派的粉丝会欣赏《降世神通:最后的气宗》? 《降世神通:最后的气宗》归属于哪个流派? "降世神通:最后的气宗"属于什么流派? 《降世神通:最后的气宗》与什么流派相关联? 《降世神通:最后的气宗》关联到什么流派?
Shozaburo Nakamura的职业是什么?
Shozaburo Nakamura从事什么职业? Shozaburo Nakamura做什么类型的工作? Shozaburo Nakamura以什么谋生? Shozaburo Nakamura做什么工作? Shozaburo Nakamura的职位是什么?

Throughout the work, we obtain text vector embeddings using the Sentence Transformer (ST) encoder (Reimers and Gurevych, 2019). The quality of paraphrases of $q$ can thus be assessed by the average cosine similarity between the embeddings of each paraphrase and $q$ . Calculating the average paraphrase quality for each question category, and averaging across categories, we obtain a high value of 0.914; this shows that the templated paraphrases are sufficiently similar to the original questions.

在整个工作中,我们使用Sentence Transformer (ST)编码器[20]获取文本向量嵌入。因此,通过计算每个释义与$q$的嵌入向量间平均余弦相似度,可以评估$q$的释义质量。计算每个问题类别的平均释义质量并跨类别求平均后,我们得到了0.914的高值,这表明模板化释义与原始问题具有足够高的相似性。

3 Benchmarking Semantic Consistency

3 语义一致性基准测试

We next use PopQA-TP, our dataset of manuallyconstructed paraphrase templates for assessing the semantic consistency of multiple contemporary LLMs. We report both models’ accuracy (the ratio of correct answers to questions), as well as their consistency (robustness to question alternations), and further develop hypothesis about the correlation of semantic consistency and correctness.

我们接下来使用PopQA-TP数据集(人工构建的释义模板库)来评估多个当代大语言模型的语义一致性。我们统计了各模型的准确率(问题正确答案占比)和一致性(对问题变体的鲁棒性),并进一步提出关于语义一致性与正确性相关性的假设。

3.1 Experimental Setup

3.1 实验设置

We experiment with several openly-available encoder-decoder and decoder-only contemporary LLMs, that have been proven effective in multiple generative tasks: Google Research’s Flan-T5-XXL (11B; Chung et al., 2022) and Flan-UL2 (20B; Tay, 2023), BigScience Workshop’s MT0-XXL (13B; Mu en nigh off et al., 2022), EleutherAI’s GPT-NeoX (20B; Black et al., 2022) and Mosaic ML, Inc.’s MPT-Instruct2 (7B; MosaicML, 2023).

我们测试了多个公开可用的编码器-解码器和仅解码器架构的当代大语言模型 (LLM),这些模型已在多种生成任务中证明有效:Google Research 的 Flan-T5-XXL (11B; Chung 等人, 2022) 和 Flan-UL2 (20B; Tay, 2023)、BigScience Workshop 的 MT0-XXL (13B; Mu 等人, 2022)、EleutherAI 的 GPT-NeoX (20B; Black 等人, 2022) 以及 Mosaic ML 公司的 MPT-Instruct2 (7B; MosaicML, 2023)。

Each question in PopQA-TP is queried to each model in greedy decoding mode, i.e., no sampling is allowed. Following previous studies (Raj et al., 2022), for the decoder-only models, the prompt is formatted using the input query template Question: $\mathsf{<}\star\mathsf{>}\mathsf{\Lambda}\mathsf{n}$ Answer:, while for the encoderdecoder models, it is submitted as-is. The GPTNeoX and MPT-Instruct2 models often generated multi-sentence answers; in these cases, only the first sentence was used for evaluation.

PopQA-TP中的每个问题都以贪婪解码模式查询每个模型,即不允许采样。根据先前的研究 (Raj et al., 2022),对于仅解码器模型,提示使用输入查询模板Question: $\mathsf{<}\star\mathsf{>}\mathsf{\Lambda}\mathsf{n}$ Answer: 进行格式化,而对于编码器-解码器模型,则直接提交问题。GPTNeoX和MPT-Instruct2模型经常生成多句答案;在这些情况下,仅使用第一句进行评估。

3.2 Semantic Consistency – Metrics

3.2 语义一致性 – 指标

Semantic consistency of a language model is broadly defined as the model’s ability to produce semantically-equivalent outputs, given semantically-equivalent inputs (Elazar et al., 2021; Jang et al., 2021; Zhou et al., 2022). The precise approach to consistency assessment may, however, vary according to the characteristics of the generated text. Here we distinguish between free-form (possibly long) answers to open questions, and short, often single-word, factoid answers.

语言模型的语义一致性广义上定义为模型在给定语义等价输入时产生语义等价输出的能力 (Elazar et al., 2021; Jang et al., 2021; Zhou et al., 2022)。然而,一致性评估的具体方法可能因生成文本的特性而异。此处我们区分开放式问题的自由形式(可能较长)答案与简短(通常为单词级)的事实型答案。

Semantic Consistency of Free-form Answers In the context of open-domain zero-shot QA, Raj et al. (2022) quantify the equivalence of a model’s answers to semantically-equivalent paraphrases of the same question. The authors show, among others, that semantic equivalence of relatively long (sentence- or short paragraph-length) answers, is most reliably quantified by means of measuring lexical entailment between pairs of answers. In particular, they demonstrate higher correlation of this metric to human judgements, than e.g., using pairwise cosine similarity between answers’ dense represent at ions. As a concrete example, consider two answers for rephrases of the question "What are the benefits of eating an apple a day?" (expanded TruthfulQA, Raj et al., 2022):

自由形式答案的语义一致性
在开放域零样本问答的背景下,Raj等人(2022)量化了模型对同一问题的语义等价改述的答案等效性。作者表明,相对较长(句子或短段落长度)答案的语义等价性,最可靠的量化方法是通过测量答案对之间的词汇蕴含关系。特别是,他们证明该指标与人类判断的相关性高于使用答案密集表示之间的成对余弦相似度等方法。举个具体例子,考虑对"每天吃一个苹果有什么好处?"(扩展版TruthfulQA,Raj等人,2022)这个问题的两种改述的答案:

(1) Apples are a delicious and nutritious fruit that offer a range of health benefits when consumed regularly. (2) Apples are a popular and healthy food that provide numerous benefits.

(1) 苹果是一种美味且营养丰富的水果,定期食用可带来多种健康益处。
(2) 苹果是一种广受欢迎的健康食品,具有诸多益处。

While the second answer could be reasonably entailed from the first one (and vise versa), cosine similarity between the two embeddings might not be very indicative of their (rough) equivalence due to the relatively high lexical distinction.

虽然第二个答案可以从第一个合理推断出来(反之亦然),但由于词汇差异较大,两个嵌入向量之间的余弦相似度可能并不能很好地表明它们的(大致)等价性。

Semantic Consistency of Factoid Answers Contrary to questions that call for a (possibly long) freeform answer, PopQA, and its paraphrase-extended version, require short, single- or a few-word answers, that constitute a less-natural fit for the task of lexical entailment. Alternatively, cosine similarity of answer embeddings provides a more reliable similarity score for very short utterances. As an example, semantic consistency rating of two answers to the question "What is ’s occupation?" (PopQA, Mallen et al. 2023), with a SOTA NLI model2 and cosine similarity is reported in Table 3:

事实性答案的语义一致性
与需要(可能较长的)自由形式回答的问题不同,PopQA及其经过释义扩展的版本要求简短、单字或少量词语的答案,这对词汇蕴含任务来说不太自然。相反,对于极短话语,答案嵌入的余弦相似度能提供更可靠的相似性评分。例如,针对问题"What is ’s occupation?" (PopQA, Mallen et al. 2023)的两个答案,表3报告了使用最先进的NLI模型2和余弦相似度的语义一致性评分:

answer1answer2NLIcosine
actressactress0.9271.00
architectarchitect0.8761.00
politiciangerman politician0.0750.70
german politicianpolitician0.5880.70
answer1 answer2 NLI cosine
actress actress 0.927 1.00
architect architect 0.876 1.00
politician german politician 0.075 0.70
german politician politician 0.588 0.70

Table 3: NLI and cosine similarity scores of two answers to rephrases of the same question. Note the NLI score distinction between the two “politician” examples due to the inherently asymmetric nature of lexical entailment, as well as differences for “actress” and “architect”.

表 3: 同一问题不同表述下两个答案的自然语言推理(NLI)与余弦相似度得分。注意两个"politician"示例因词汇蕴含的固有不对称性导致的NLI分数差异,以及"actress"和"architect"案例的得分区别。

3.3 Experimental Results

3.3 实验结果

We next present the results of LLMs correctness and semantic consistency, using PopQA-TP.

接下来我们使用PopQA-TP展示大语言模型(LLM)的正确性和语义一致性结果。

3.3.1 Correctness

3.3.1 正确性

Following Mallen et al. (2023), we consider a question answered correctly if a substring of the gener- ated text is an exact string match to one of the gold answers (e.g., a generated answer of "film director" matches "director"). Figure 1 presents the mean correctness results for the five models, split by category. Evidently, some categories are systematically easier than others, e.g., color and sport, while others pose challenge across the board, e.g., author and director. This result can be partly attributed to the more restricted space of plausible answers to former categories (there is only a limited set of color names), compared to the infinitely large space of person names for the latter. Notably, the two decoder-only models—MPT-Instruct2 (accuracy of 0.224) and GPT-NeoX (accuracy of 0.184)— perform better than their encoder-decoder counterparts, on average, across categories.

遵循 Mallen 等人的研究 (2023),若生成文本的子串与任一标准答案完全匹配 (例如生成答案"film director"匹配"director"),则认为问题回答正确。图 1: 展示了五类模型按类别划分的平均正确率结果。显然,某些类别 (如颜色和运动) 系统性更易回答,而其他类别 (如作家和导演) 则普遍具有挑战性。这一结果部分源于前者的合理答案空间更受限 (颜色名称集合有限),而后者的人名空间则无限大。值得注意的是,两款纯解码器模型——MPT-Instruct2 (准确率0.224) 和 GPT-NeoX (准确率0.184)——在各类别中的平均表现优于编码器-解码器架构模型。

3.3.2 Semantic Consistency

3.3.2 语义一致性

Internal semantic consistency of a set of (possibly non-unique) texts ${\mathcal{T}}{=}{t_{1},t_{2},\dots}$ can be cal- culated by the mean pairwise cosine similarity of their respective embedding vectors ${e_{1},e_{2},\ldots}$ , which ranges from 0 to 1. Formally:

一组(可能非唯一)文本 ${\mathcal{T}}{=}{t_{1},t_{2},\dots}$ 的内部语义一致性可通过其对应嵌入向量 ${e_{1},e_{2},\ldots}$ 的平均两两余弦相似度计算得出,其值范围为0到1。形式化表示为:

$$
\mathrm{int_sim}(\mathscr{T})=\frac{1}{\binom{|\mathscr{T}|}{2}}\sum_{i=1}^{|\mathscr{T}|-1}\sum_{j=i+1}^{|\mathscr{T}|}\mathrm{cosine}(e_{i},e_{j})
$$

$$
\mathrm{int_sim}(\mathscr{T})=\frac{1}{\binom{|\mathscr{T}|}{2}}\sum_{i=1}^{|\mathscr{T}|-1}\sum_{j=i+1}^{|\mathscr{T}|}\mathrm{cosine}(e_{i},e_{j})
$$

Given $\mathcal{A}$ , the set of generated answers to $q$ and paraphrases $P(q)$ , we define the semantic consistency of $\mathcal{A}$ as SCo $\mathsf{n s}(q){=}\mathrm{int_sim}(\mathcal{A})\in[0,1]$ .

给定 $\mathcal{A}$,即生成答案集合对问题 $q$ 及其复述 $P(q)$ 的响应,我们将 $\mathcal{A}$ 的语义一致性定义为 SCo $\mathsf{n s}(q){=}\mathrm{int_sim}(\mathcal{A})\in[0,1]$。

Figure 2 presents results of mean answer semantic consistency computation, by question category. Consistency values vary in the [0.4, 0.9] range, with some (albeit lower) deviation across categories. Similarly to correctness, the relatively high consistency values in capital, color, country, religion, and sport can be attributed to the more restricted space of plausible answers, compared to other categories. Figure 3 shows a scatter plot of the mean category correctness and consistency for the Flan-T5-XXL as a representative example of the models. Across categories, answer correctness and consistency are positively correlated. Across all models considered, the religion category is an outlier among the categories above with restricted answer space, in that these questions had relatively low correctness but high consistency.

图 2 展示了按问题类别划分的平均答案语义一致性计算结果。一致性数值分布在 [0.4, 0.9] 范围内,不同类别间存在一定(尽管较低)的偏差。与正确性类似,首都、颜色、国家、宗教和体育类别中相对较高的一致性值,可归因于这些类别比其他类别具有更有限的合理答案空间。图 3 以 Flan-T5-XXL 为例展示了各模型平均类别正确性与一致性的散点图。在所有类别中,答案正确性与一致性呈正相关关系。在所有考察模型中,宗教类别在答案空间受限的类别中属于异常值,表现为这些问题具有较低的正确性但较高的一致性。

Contrary to correctness results, here encoderdecoder LLMs (MT0-XXL, Flan-UL2 and FlanT5-XXL) outperform decoder-only models.

与正确性结果相反,此处编码器-解码器架构的大语言模型 (MT0-XXL、Flan-UL2 和 FlanT5-XXL) 表现优于仅解码器模型。

4 QA Performance Prediction

4 问答性能预测

We next define and address the task of factual question-answering performance prediction. Here we rely on some parallels to the task of query performance prediction (QPP) in IR (search) systems – an established research area (Zhou and Croft, 2007; Carmel and Kurland, 2012; Raiber and Kurland, 2014; Faggioli et al., 2023). QPP is defined as the assessment of the retrieval quality of a search system for a query, without relevance judgments.

接下来我们定义并探讨事实性问答性能预测任务。这一任务与信息检索(搜索)系统中查询性能预测(QPP)存在相似性——后者是一个成熟的研究领域 (Zhou and Croft, 2007; Carmel and Kurland, 2012; Raiber and Kurland, 2014; Faggioli et al., 2023) 。QPP被定义为在没有相关性判断的情况下,评估搜索系统对查询的检索质量。


Figure 1: Mean LLMs’ correctness on questions in the PopQA dataset (Mallen et al., 2023), by category. Blue shades denote encoder-decoder models, green – decoder-only.

图 1: 大语言模型在PopQA数据集(Mallen等人,2023)各问题类别上的平均正确率。蓝色阴影表示编码器-解码器模型,绿色表示仅解码器模型。


Figure 2: Mean LLMs’ consistency on questions in the PopQA dataset (Mallen et al., 2023) and their paraphrases (PopQA-TP, this work), by category. Blue shades denote encoder-decoder models, green – decoder-only.

图 2: 大语言模型在PopQA数据集 (Mallen et al., 2023) 及其改写版本 (PopQA-TP,本工作) 中各类别问题的平均一致性。蓝色阴影表示编码器-解码器模型,绿色表示仅解码器模型。


Figure 3: Scatter plot of mean in-category answer correctness and consistency (as depicted in Figures 1 and 2) for the Flan-T5-XXL model. The evident positive correlation supports the intuition that semantic consistency has a predictive power on an LLM QA accuracy.

图 3: Flan-T5-XXL 模型在各类别中的平均回答正确率与一致性(如图 1 和图 2 所示)的散点图。明显的正相关性支持了语义一致性对大语言模型问答准确率具有预测力的直觉。

Core differences exist between IR and LLM-based systems used for the task of open-domain factual QA; yet, we address a conceptually similar task:

用于开放领域事实性问答任务的IR(信息检索)系统与基于大语言模型的系统存在核心差异;然而,我们处理的是概念上相似的任务:

assessment of a system’s potential answer quality (that is manifested by its correctness) for a question, without relying on ground-truth answers.

评估系统对问题的潜在回答质量(即其正确性),而不依赖真实答案。

Casting the task as a classification scenario, we train a logistic regression model, where several regressors—variables proven to correlate with LLMs correctness—carry over predictive power on the outcome variable: the model’s likelihood to produce a correct answer for a given question.

将任务构建为分类场景,我们训练了一个逻辑回归模型。其中多个回归因子(即已被证实与大语言模型正确性相关的变量)对结果变量具有预测能力:该模型对给定问题生成正确答案的概率。

4.1 Predictor Variables

4.1 预测变量

4.1.1 Question Subject Popularity (SPop)

4.1.1 问题主题流行度 (SPop)

Mallen et al. (2023) hypothesize that factual knowledge that is less frequently discussed on the web may not be well memorized by LLMs. Given a question that can be modeled by the {subject, relationship, object} triple, e.g., “What is the capital of (R) Louisiana (S)?”, the authors approximate its subject’s popularity by the mean number of monthly views of the corresponding Wikipedia page. The answer—“Baton Rouge”—is scored by popularity in a similar way, but we refrain from using this score for our predictive analysis, since it is unknown in a realistic QA setup.

Mallen等人 (2023) 提出假设:在网络上较少被讨论的事实性知识可能未被大语言模型 (LLM) 充分记忆。对于可通过 {主体, 关系, 客体} 三元组建模的问题 (例如 "路易斯安那州 (S) 的首府 (R) 是哪里?"),作者通过对应维基百科页面的月均浏览量来估算该主体的流行度。答案 "巴吞鲁日" 也以类似方式计算流行度得分,但在实际问答场景中该得分未知,因此我们未将其用于预测分析。

Following Mallen et al. (2023), we define our first predictor—question subject popularity (SPop)—as the mean number of monthly views of the subject entity’s Wikipedia page. In the PopQA datset, the SPop score varies from 2 to over 15M.

根据 Mallen 等人 (2023) 的研究,我们将第一个预测指标——问题主题流行度 (SPop) ——定义为该主题实体维基百科页面的月均浏览量。在 PopQA 数据集中,SPop 分数从 2 到超过 1500 万不等。

4.1.2 Semantic Consistency (SCons)

4.1.2 语义一致性 (SCons)

Semantic consistency—as defined by the SCons metric in Section 3.2—associated with $q$ , is measured as $\mathsf{S C o n s(}q\mathrm{)=int_\mathrm{sim}(\mathcal{A})}$ , where $\mathcal{A}$ consists of greedily-generated answers to $q$ itself and the set of its paraphrases $P(q)$ .

语义一致性——如第3.2节中SCons指标所定义——与$q$相关的部分,通过$\mathsf{S C o n s(}q\mathrm{)=int_\mathrm{sim}(\mathcal{A})}$衡量,其中$\mathcal{A}$包含对$q$本身及其释义集合$P(q)$的贪婪生成答案。

4.1.3 Answer Certainty (Cert)

4.1.3 回答确定性 (Cert)

Multiple studies investigated the uncertainty of natural language generation in the context of free-form QA. Kuhn et al. (2022) put forward a hypothesis that given some degree of freedom (i.e., sampling, not greedy generation), “. . . very uncertain generations should be less likely to be correct”. Specifically, the authors suggest that a (non-greedilyprobed) model producing multiple distinct answers for the same question is unstable and less robust, potentially affecting the model’s ability to provide a correct answer to the question.

多项研究探讨了自由形式问答背景下自然语言生成的不确定性。Kuhn等人 (2022) 提出假设:给定一定自由度(即采样而非贪婪生成)时,"......极不确定的生成结果正确概率更低"。具体而言,作者指出(非贪婪采样的)模型对同一问题产生多个不同答案时,其表现不稳定且鲁棒性较低,可能影响模型提供正确答案的能力。

Uncertainty of a set of answers $\mathcal{A}$ to a factual question $q$ is manifested by the relative amount of distinct answers out of the entire answer pool $\mathcal{A}$ . Multiple metrics were suggested to measure uncertainty—or, its complementary metric, certainty—of a set of answers, including lexical similarity, Rouge-L (Lin and Och, 2004), and predictive entropy (Kuhn et al., 2022). As with semantic consistency (see Section 4.1.2), we found mean pairwise semantic similarity of answers in $\mathcal{A}$ to be the most appropriate metric for certainty of very short factoid answers. Our sampled answers certainty metric is defined as $\mathsf{C e r t}(q){=}\mathrm{int_sim}(\mathcal{A})$ , where, following Kuhn et al. (2022), $\mathcal{A}$ is a set of ten answers to $q$ sampled non-greedily, setting models’ temperature to 0.5. Table 4 presents several results of sampling answers to questions in the PopQA dataset, along with their respective certainty score.

一组答案 $\mathcal{A}$ 对于事实性问题 $q$ 的不确定性,表现为整个答案池 $\mathcal{A}$ 中不同答案的相对数量。已有多种指标被提出用于衡量答案集的不确定性(或其互补指标确定性),包括词汇相似度、Rouge-L (Lin and Och, 2004) 和预测熵 (Kuhn et al., 2022)。与语义一致性(见第4.1.2节)类似,我们发现 $\mathcal{A}$ 中答案的平均两两语义相似度是最适合衡量极简短事实性答案确定性的指标。我们的采样答案确定性指标定义为 $\mathsf{C e r t}(q){=}\mathrm{int_sim}(\mathcal{A})$ ,其中根据Kuhn等人 (2022) 的方法,$\mathcal{A}$ 是通过非贪婪采样获得的十个问题 $q$ 的答案集,模型温度设为0.5。表4展示了PopQA数据集中问题答案采样的若干结果及其对应的确定性分数。

4.1.4 Question Category (QCat)

4.1.4 问题类别 (QCat)

Figure 1 suggests that question category—the semantic grouping a question belongs to—has a considerable effect on an LLM’s ability to answer a question correctly. While models systematically succeed in answering questions on capital, color, and sport, they struggle in categories like director, producer, and author. Question category (QCat) has been shown to interact with numerical variables (see Section 4.2), suggestive of the potential benefits of including question category as a (nominal) categorical variable in our regression analysis.

图1: 表明问题类别(即问题所属的语义分组)对大语言模型正确回答问题的能力有显著影响。虽然模型在回答关于首都、颜色和运动的问题时普遍成功,但在导演、制片人和作者等类别上表现不佳。研究表明问题类别(QCat)会与数值变量产生交互作用(见第4.2节),这提示我们在回归分析中将问题类别作为(名义)分类变量可能带来潜在益处。

4.2 Predictive Model

4.2 预测模型

We build a logistic regression model for predicting if an LLM will answer a question correctly. Specifically, for an original question $q{\in}\mathrm{PopQA}$ , we define a model using the four predictors described in Section 4.1, where the regression outcome is a binary indicator: will $q$ be answered accurately (1), or not (0).3 We denote the regression response variable by correct, and use QCat, SCons, Cert and SPop as regressors. We apply a natural log transformation to SPop, reducing its skewness, and strengthening its relationship with the target variable.

我们构建了一个逻辑回归模型,用于预测大语言模型是否能正确回答问题。具体而言,对于原始问题 $q{\in}\mathrm{PopQA}$,我们使用第4.1节描述的四个预测因子定义模型,其中回归结果是一个二元指标:问题 $q$ 会被准确回答(1)或不被准确回答(0)。我们将回归响应变量记为correct,并使用QCat、SCons、Cert和SPop作为回归因子。我们对SPop进行自然对数转换,以降低其偏度,并增强其与目标变量的关联性。

The regression model assumes a linear relationship between each regressor and the logit of the binary target, holding other regressors constant. We consider the first-order effects of QCat and the numeric variables (SCons, Cert and SPop), as well as the second-order interaction between each numeric variable and question category QCat, where the intuition is that the precise impact of a numeric predictor varies by category. Figure 5 in Appendix A.1 illustrates the need to account for QCat interactions with the numeric regressors because the marginal effect (slope of linear relationship) of each variable on correctness differs by the QCat group. Consequently, we define our regression model using the common regression notation as correct $\sim\mathrm{QCat}\star\mathrm{log(SPop)}+\mathsf{Q C a t}\star\mathsf{S C o n s}+$ $\mathsf{Q C a t{\star}C e r t}$ , where ‘*’ denotes the second- and firstorder effects of two variables. QCat is treated as a fixed rather than random categorical effect, since we are interested in the individual effect of each category and do not assume that the relationship types were randomly sampled from the population of available ones. Appendix A.2 table 9 quantifies the relative contributions of each regressor to the model’s goodness of fit, and shows that QCat and its interactions are strongly statistically significant.

回归模型假设在保持其他回归变量不变的情况下,每个回归变量与二元目标的logit之间存在线性关系。我们考虑QCAt和数值变量(SCons、Cert和SPop)的一阶效应,以及每个数值变量与问题类别QCat之间的二阶交互作用,其直观理解是数值预测因子的具体影响因类别而异。附录A.1中的图5说明了考虑QCat与数值回归变量交互作用的必要性,因为每个变量对正确性的边际效应(线性关系的斜率)因QCat组别而异。因此,我们用常见的回归符号将回归模型定义为correct $\sim\mathrm{QCat}\star\mathrm{log(SPop)}+\mathsf{Q C a t}\star\mathsf{S C o n s}+$ $\mathsf{Q C a t{\star}C e r t}$,其中‘*’表示两个变量的二阶和一阶效应。QCat被视为固定而非随机的分类效应,因为我们关注每个类别的个体效应,并且不假设关系类型是从可用关系类型中随机抽样的。附录A.2中的表9量化了每个回归变量对模型拟合优度的相对贡献,并表明QCat及其交互作用具有极强的统计显著性。

Table 4: Examples of Cert score assigned to a set of sampled answers to the same question. Notably, cultural bias(es) in contemporary LLMs are manifested by the “samurai” answer to the question about Japanese politician.

questionsampled answerscertaintyscore
WhatisRobbyKrieger's occupation? WhatisShozaburoNakamura's occupation?(guitarist, guitarist, guitarist, guitarist, guitarist)1.000
(samurai, samurai,film director, actor, director)0.250
What is the capital of Benin?(cotonou, bamako, abidjan, bamako, bamako)0.521

表 4: 对同一问题的一组抽样答案分配的Cert分数示例。值得注意的是,当代大语言模型中的文化偏见通过关于日本政治家的"武士"答案得以显现。

question sampled answers certaintyscore
Robby Krieger的职业是什么?Shozaburo Nakamura的职业是什么? (吉他手, 吉他手, 吉他手, 吉他手, 吉他手) 1.000
(武士, 武士, 电影导演, 演员, 导演) 0.250
贝宁的首都是哪里? (科托努, 巴马科, 阿比让, 巴马科, 巴马科) 0.521

Logistic regression is implemented using Python’s stats models (Seabold and Perktold, 2010) module formula interface. We report regression results when applied on each of the five LLMs detailed in Section 3.1, further explore the relative contribution of each predictor, and perform an ablation study in the next section.

逻辑回归采用Python语言的statsmodels (Seabold和Perktold, 2010)模块公式接口实现。我们报告了将该方法应用于3.1节详述的五种大语言模型时的回归结果,深入探究各预测变量的相对贡献度,并在下一节进行消融实验。

4.3 Experimental Results

4.3 实验结果

Main Results Table 5 reports the performance of the logistic regression trained on each of the five LLMs. A regression model’s goodness of fit is measured using McFadden’s pseudo-$\mathrm{R^{2}}$ ; according to McFadden (1977), values of 0.2 and above indicate very good fit.4 We also report regression models’ accuracy on the $20%$ held-out test set, where the accuracy should be interpreted in terms of the relative percent increase, compared to the majority vote baseline – fixing all predictions to 0, due to the higher prior of incorrect answers for questions in PopQA, for all five LLMs in this study. Notably, the random choice baseline is 0.5.

主要结果
表5报告了在五个大语言模型上分别训练的逻辑回归性能。回归模型的拟合优度采用McFadden伪-$\mathrm{R^{2}}$衡量,根据McFadden (1977)的研究,0.2及以上的值表示拟合效果非常好。我们还报告了回归模型在20%保留测试集上的准确率,由于本研究中所有五个大语言模型在PopQA问题上错误答案的先验概率更高(固定所有预测为0),该准确率应相对于多数投票基线的相对百分比增长来解读。值得注意的是,随机选择基线的准确率为0.5。

We repeat the experiment in a more balanced (and desirable) setting, where the set of question categories is limited to those an LLM shows over $10%$ correctness on. Naturally, the lower (but still negative) prior, is reflected in the lower majority voice baseline, posing higher prediction difficulty for the regression model. We show (Table 5, right) that the benefits of the suggested approach are amplified in this setting: models obtain high accuracy, improving over the majority vote baseline by a significant extent, between $13.40{-}26.23%$ .

我们在一个更平衡(且理想)的设置中重复了实验,将问题类别限制在大语言模型(LLM)正确率超过$10%$的范围内。自然,较低(但仍为负值)的先验概率反映在较低多数投票基线上,这给回归模型带来了更高的预测难度。结果显示(表5右),在此设置下所提方法的优势被放大:模型获得了高准确率,较多数投票基线显著提升了$13.40{-}26.23%$。

Ablation Study Next we test the robustness of the regression model, by eliminating regressors, one by one, from an example LLM regression model, and inspecting the outcome, as reported in Table 6. Again, we perform this experiment with all question categories, and the set of categories with the correctness prior $>0.1$ , for the selected model. High prediction accuracy (0.902 and 0.781) is maintained, even when removing ${\mathsf{S P o p}}$ and QCat, thereby only including regressors independent of external knowledge—semantic consistency and certainty—predictors that can be computed automatically (including paraphrase generation). Moreover, using only semantic consistency or certainty as a single predictor shows considerable performance gains, in both settings.

消融实验
接下来我们通过逐步剔除回归因子来测试回归模型的鲁棒性。如表 6 所示,我们从一个大语言模型回归示例中逐个移除回归因子并观察结果。该实验覆盖了所有问题类别,并针对选定模型筛选了正确率先验 $>0.1$ 的类别组。即使移除 ${\mathsf{S P o p}}$ 和 QCat 这两个依赖外部知识的回归因子(仅保留可自动计算的语义一致性和确定性指标,包括复述生成),模型仍保持较高的预测准确率(0.902 和 0.781)。值得注意的是,在两种实验设置下,仅使用语义一致性或确定性作为单一预测因子时,模型性能均有显著提升。

表 6:

In-category Coefficient Analysis The ablation study findings are further supported by the regression summary in Table 7, for two sample question categories with high correctness: capital and sport. Regressor coefficients $({\hat{\beta}})$ , as well as their $95%$ confidence intervals, and p-values are presented. Positive coefficients reflect the (expected) positive correlation between the predictors and the regression model outcome: higher semantic consistency, higher certainty or question subject popularity are predictive of higher LLM’s answer accuracy with respect to the question at hand.

类别内系数分析
表7中的回归摘要进一步支持了消融研究的结果,针对两个正确率较高的样本问题类别:首都和体育。展示了回归系数$({\hat{\beta}})$、其$95%$置信区间以及p值。正系数反映了预测因子与回归模型结果之间的(预期)正相关性:更高的语义一致性、更高的确定性或问题主题流行度,预示着大语言模型对当前问题的回答准确率更高。

5 Related Work

5 相关工作

Semantic Consistency of LLMs Studies in the domain of model consistency were pioneered with the work by Elazar et al. (2021), who investigated this question in the context of masked language models, where the same factual knowledge (in the form of a single token) was masked from multiple meaning-preserving alternations of the same statement. Fierro and Søgaard (2022) extended the factual consistency study on MLMs to the multilingual setup. Jang et al. (2022) extend the notion of consistency to six behavioral consistency properties, including semantic textual similarity, machinereading comprehension, and topic classification. The authors make use of adapted and newly-created datasets for testing multiple fine-tuned language models on the set of selected tasks. Factual consistency experiments are explicitly excluded from the set of tests. Multiple semantic consistency metrics were evaluated by Raj et al. (2022) on automatically generated paraphrases of (mostly not factoid) open-domain questions in the TruthfulQA dataset (Lin et al., 2022); the authors demonstrate that NLIbased consistency metric correlates best with human judgements, when evaluating the consistency of sentence-length answers.

大语言模型的语义一致性研究

模型一致性领域的研究始于Elazar等人(2021)的工作,他们在掩码语言模型(masked language models)背景下研究了这个问题:通过从同一陈述的多个语义等价变体中掩码相同的事实知识(以单个token形式呈现)。Fierro和Søgaard(2022)将MLMs的事实一致性研究扩展到了多语言场景。Jang等人(2022)将一致性概念扩展到六个行为一致性属性,包括语义文本相似度、机器阅读理解(machine reading comprehension)和主题分类。作者利用改编和新创建的数据集,在选定任务上测试了多个微调语言模型,但明确排除了事实一致性实验。Raj等人(2022)在TruthfulQA数据集(Lin等人,2022)中(多数非事实型)开放域问题的自动生成释义上评估了多种语义一致性指标,证明在评估句子长度答案的一致性时,基于自然语言推理(NLI)的一致性指标与人类判断相关性最高。

model{q ∈ PopQA}{q ∈ PopQA I correct(QCat(q)) > 0.1}
R2ACC(test set)mjr. baselineR2ACC(test set)mjr. baseline
MTO-XXL (ED)0.4890.936 (+2.63)0.9120.3080.809 (+18.27)0.684
Flan-UL2 (ED)0.4790.915 (+4.69)0.8740.3440.829 (+19.10)0.696
Flan-T5-XXL (ED)0.4910.928 (+3.80)0.8940.3110.794 (+26.23)0.629
MPT-Instruct2 (D)0.4300.878 (+13.1)0.7760.4250.862 (+13.40)0.760
GPT-NeoX (D)0.4180.883 (+8.21)0.8160.3100.791 (+22.82)0.644
模型 {q ∈ PopQA} R2 {q ∈ PopQA} ACC(test set) {q ∈ PopQA} mjr. baseline {q ∈ PopQA I correct(QCat(q)) > 0.1} R2 {q ∈ PopQA I correct(QCat(q)) > 0.1} ACC(test set) {q ∈ PopQA I correct(QCat(q)) > 0.1} mjr. baseline
MTO-XXL (ED) 0.489 0.936 (+2.63) 0.912 0.308 0.809 (+18.27) 0.684
Flan-UL2 (ED) 0.479 0.915 (+4.69) 0.874 0.344 0.829 (+19.10) 0.696
Flan-T5-XXL (ED) 0.491 0.928 (+3.80) 0.894 0.311 0.794 (+26.23) 0.629
MPT-Instruct2 (D) 0.430 0.878 (+13.1) 0.776 0.425 0.862 (+13.40) 0.760
GPT-NeoX (D) 0.418 0.883 (+8.21) 0.816 0.310 0.791 (+22.82) 0.644

Table 5: QA performance prediction using logistic regression with various models. ‘ED’ stands for encoder-decoder models, ‘D’ – for decoder-only. McFadden’s pseudo R-squared is reported as well as models’ accuracy on held-out test set $(20%)$ ; relative performance improvement, compared to the baseline, is specified with $\cdot_{+},$ in parenthesis. Left: all question categories are considered, right: only categories with correctness exceeding 0.1 are considered.

included predictors{q ∈ PopQA}{q E PopQA I correct(QCat(q)) > 0.1}
R2ACC(testset)R2ACC(testset)
SPop,QCat,SCons,Cert (the full model)0.4790.915 (+4.69)0.3440.829 (+19.10)
QCat, SCons, Cert0.4420.911 (+4.23)0.2980.803 (+15.37)
SPop, SCons, Cert0.3620.904 (+3.43)0.2070.782 (+12.35)
SCons,Cert0.3520.902 (+3.20)0.1960.781 (+12.21)
Cert0.3280.892 (+2.06)0.1560.769 (+10.48)
SCons0.2640.886 (+1.37)0.1640.753 (+08.18)

表 5: 使用逻辑回归与不同模型进行问答性能预测。'ED'代表编码器-解码器模型,'D'代表仅解码器模型。报告了McFadden伪R平方以及模型在保留测试集 $(20%)$ 上的准确率;与基线相比的相对性能改进用 $\cdot_{+},$ 在括号中注明。左:考虑所有问题类别,右:仅考虑正确率超过0.1的类别。

包含的预测变量 {q ∈ PopQA} {q ∈ PopQA I correct(QCat(q)) > 0.1}
R2 ACC(testset)
SPop,QCat,SCons,Cert (完整模型) 0.479 0.915 (+4.69)
QCat, SCons, Cert 0.442 0.911 (+4.23)
SPop, SCons, Cert 0.362 0.904 (+3.43)
SCons,Cert 0.352 0.902 (+3.20)
Cert 0.328 0.892 (+2.06)
SCons 0.264 0.886 (+1.37)

Table 6: Ablation analysis with one of the best performing models (Flan-UL2), testing various predictor combinations. The majority vote baseline of Flan-UL2 is 0.874 for the full set of questions, and 0.696 for questions in categories with baseline correctness>0.1. High accuracy, in particular, much higher than baseline, is maintained when omitting QCat; omitting both (not easily obtainable) QCat and SPop results in yet powerful regression model, improving the baseline by 3.20 and 12.21 percent, for the full and selective question set, respectively. Table 7: Logistic regression summary of the Flan-UL2 model for two of its best-performing categories: capital and sport. Variable are standardized (to have a mean of 0.0 and STD of 1.0) for comparative analysis of coefficients. Appendix A.2 (tables 10–14) reports full regression models’ results, including variable interactions, for all LLMs in this study.

predictor<③[0.0250.975]p-value
capitalintercept log(SPop) SCons Cert0.83 1.64 2.68 2.290.67 1.28 1.98 1.681.04 2.10 3.63 3.120.114 0.000 0.000 0.000
sportintercept log(SPop) SCons Cert1.35 0.94 1.82 1.471.12 0.77 1.40 1.151.63 1.15 2.37 1.880.001 0.584 0.000 0.002

表 6: 使用性能最佳模型之一(Flan-UL2)进行的消融分析,测试不同预测变量组合。Flan-UL2多数投票基线在全问题集的准确率为0.874,在基线正确率>0.1的类别中为0.696。特别是当省略QCat时仍保持高准确率(显著高于基线);同时省略(不易获取的)QCat和SPop后,回归模型仍表现强劲,全问题集和筛选问题集分别比基线提升3.20%和12.21%。

表 7: Flan-UL2模型在两个最佳表现类别(首都和体育)上的逻辑回归摘要。变量已标准化(均值0.0,标准差1.0)以进行系数比较分析。附录A.2(表10-14)报告了本研究中所有大语言模型的完整回归模型结果(包括变量交互项)。

predictor <③ [0.025 0.975] p-value
capital intercept log(SPop) SCons Cert 0.83 1.64 2.68 2.29 0.67 1.28 1.98 1.68 1.04 2.10 3.63 3.12 0.114 0.000 0.000 0.000
sport intercept log(SPop) SCons Cert 1.35 0.94 1.82 1.47 1.12 0.77 1.40 1.15 1.63 1.15 2.37 1.88 0.001 0.584 0.000 0.002

A wider notion of prompt consistency was studied by Zhou et al. (2022) for multiple tasks: NLI, co-reference resolution, word sense disambiguation and sentence completion. The authors design pairwise distillation loss that encourages consistency between semantically-equivalent pair of prompts, and demonstrate increase of over $10%$ in models’ performance. Finally, Newman et al. (2021) introduce P-Adapters for increasing the robustness of MLMs (specifically, BERT (Devlin et al., 2018)) to prompt alternations. No prior work, to the best of our knowledge, has explicitly addressed the task of LLMs factual semantic consistency, with a highquality benchmark factual QA dataset.

Zhou等人(2022)研究了更广泛的提示一致性概念,涵盖多项任务:自然语言推理(NLI)、共指消解、词义消歧和句子补全。作者设计了成对蒸馏损失函数,鼓励语义等价提示对之间保持一致性,实验表明模型性能提升超过$10%$。Newman等人(2021)则提出P-Adapters方法,用于增强掩码语言模型(MLM)(特别是BERT(Devlin等人,2018))对提示变化的鲁棒性。据我们所知,此前尚未有研究基于高质量的事实问答基准数据集,专门针对大语言模型的事实语义一致性任务展开探讨。

QA Performance Prediction Inspired by the established and well-studied task of query performance prediction (QPP) in the domain of information retrieval (i.e., search engines), we develop a framework for predicting the correctness of a generative (not retrieval-based) LLM’s response to a factual question – question answering performance prediction. Given a question, the ultimate goal is to score the likelihood of the model to answer the question correctly, without any reference answers. The open-domain nature of questions pose a special challenge for the task, in the complete absence of information facilitating reference-less evaluation, such as a document for the task of sum mari z ation, or a paragraph for context-based extractive QA.

问答性能预测

受信息检索领域(即搜索引擎)中已确立且深入研究过的查询性能预测(QPP)任务启发,我们开发了一个预测生成式(非检索式)大语言模型对事实性问题回答正确率的框架——问答性能预测。给定一个问题,其最终目标是在没有任何参考答案的情况下,对模型正确回答该问题的可能性进行评分。问题的开放域特性为该任务带来了特殊挑战,因为完全缺乏促进无参考评估的信息,例如摘要任务中的文档,或基于上下文的抽取式问答中的段落。

Despite its evident importance, prior work on QA performance prediction is relatively scarce. Kuhn et al. (2022) have shown that semantic certainty—the consistency of a model’s answers to a question, where sampling is allowed—is indicative of the model’s ability to answer the question correctly. Specifically, they report that "... [when sampling is allowed] Incorrectly answered questions have more semantically distinct answers than correct ones." Introducing the PopQA dataset of factual questions, Mallen et al. (2023) suggest that factual knowledge memorization depends on the popularity of the entity, the subject of a question refers to: the frequency of information about the question subject on the web.

尽管其重要性显而易见,但此前关于问答性能预测的研究相对较少。Kuhn等人(2022)研究表明,语义确定性(即允许采样时模型对同一问题回答的一致性)能反映模型正确回答问题的能力。他们特别指出:"...[当允许采样时]错误回答的问题比正确回答的问题会产生更多语义差异的答案"。Mallen等人(2023)在提出事实性问题数据集PopQA时指出,事实性知识的记忆效果取决于问题所指实体的网络流行度:即问题主体相关信息在互联网上的出现频率。

6 Conclusions

6 结论

We explore the robustness of LLMs to paraphrases in the context of open-domain zero-shot QA. Introducing a large and carefully-curated extension of the PopQA dataset (PopQA-TP), with high-quality paraphrases, we first benchmark the semantic consistency of diverse LLMs; next, we develop a framework for QA performance prediction, incorporating semantic consistency, as well as additional aspects, shown to correlate with model’s QA accuracy. Collectively, our work shows that a model’s ability to answer a question accurately can be reliably predicted, in a reference-less setting. Our future work includes the exploration of how the semantic consistency metric used in this work can be adapted to additional generative tasks with long(er) answers, e.g., sum mari z ation, dialogue.

我们探究了大语言模型(LLM)在开放域零样本问答任务中对改述的鲁棒性。通过引入一个经过精心筛选且规模庞大的PopQA数据集扩展版本(PopQA-TP),其中包含高质量改述样本,我们首先评估了多种大语言模型的语义一致性;随后,我们开发了一个结合语义一致性及其他与模型问答准确率相关因素的问答性能预测框架。整体而言,我们的研究表明,在无参考设置下,模型准确回答问题的能力可以被可靠预测。未来工作包括探索如何将本研究所用的语义一致性指标适配到需要生成长(篇幅)答案的其他生成式任务,例如摘要生成和对话系统。

7 Limitations

7 局限性

Our study has several limitations: First, the semantic consistency measurement has been studied in the relatively narrow context of the factual QA task; it would be useful to explore how this metric applies and should possible be adapter for additional generative tasks, such as sum mari z ation, translation, or QA with free-form long(er) answers. Second, the presented QA performance prediction framework exhibits best results with the full set of predictors, exploiting “external knowledge”— subject popularity and question category; those are not always available. Given said that, we show significant prediction benefits even when using easilyobtainable predictors, Scons and Cert.

我们的研究存在若干局限性:首先,语义一致性测量仅在事实性问答任务这一相对狭窄的背景下进行研究;探索该指标如何适用于其他生成式任务(如摘要、翻译或自由形式长答案问答)并可能进行调整将很有价值。其次,所提出的问答性能预测框架在使用完整预测因子集(利用"外部知识"——主题流行度和问题类别)时表现最佳,但这些信息并非总能获取。尽管如此,我们证明即使仅使用易于获取的预测因子Scons和Cert,也能显著提升预测效果。

References

参考文献

Elron Bandel, Ranit Aharonov, Michal ShmueliScheuer, Ilya Sh nay der man, Noam Slonim, and Liat Ein-Dor. 2022. Quality controlled paraphrase generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 596–609, Dublin, Ireland. Association for Computational Linguistics.

Elron Bandel、Ranit Aharonov、Michal ShmueliScheuer、Ilya Shnayderman、Noam Slonim 和 Liat Ein-Dor。2022. 质量可控的复述生成。载于《第60届计算语言学协会年会论文集(第一卷:长论文)》,第596-609页,爱尔兰都柏林。计算语言学协会。

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. GPT-NeoX-20B: An opensource auto regressive language model.

Sid Black、Stella Biderman、Eric Hallahan、Quentin Anthony、Leo Gao、Laurence Golding、Horace He、Connor Leahy、Kyle McDonell、Jason Phang、Michael Pieler、USVSN Sai Prashanth、Shivanshu Purohit、Laria Reynolds、Jonathan Tow、Ben Wang 和 Samuel Weinbach。2022。GPT-NeoX-20B:一个开源的自回归大语言模型。

David Carmel and Oren Kurland. 2012. Query performance prediction for IR. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 1196–1197.

David Carmel 和 Oren Kurland. 2012. 信息检索中的查询性能预测. 在第35届国际ACM SIGIR信息检索研究与发展会议论文集, 第1196–1197页.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.

Hyung Won Chung、Le Hou、Shayne Longpre、Barret Zoph、Yi Tay、William Fedus、Eric Li、Xuezhi Wang、Mostafa Dehghani、Siddhartha Brahma 等. 2022. 规模化指令微调语言模型. arXiv预印本 arXiv:2210.11416.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova。2018。BERT:面向语言理解的深度双向Transformer预训练。arXiv预印本 arXiv:1810.04805。

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhi- lasha Ravi chan der, Eduard Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguis- tics, 9:1012–1031.

Yanai Elazar、Nora Kassner、Shauli Ravfogel、Abhilasha Ravichander、Eduard Hovy、Hinrich Schütze 和 Yoav Goldberg。2021。预训练语言模型一致性的测量与改进。计算语言学协会汇刊,9:1012–1031。

Guglielmo Faggioli, Thibault Formal, Stefano March- esin, Stéphane Clinchant, Nicola Ferro, and Ben- jamin Piwowarski. 2023. Query performance prediction for neural IR: Are we there yet? In European Conference on Information Retrieval, pages 232–248. Springer.

Guglielmo Faggioli、Thibault Formal、Stefano Marchesin、Stéphane Clinchant、Nicola Ferro和Benjamin Piwowarski。2023。神经信息检索中的查询性能预测:我们做到了吗?见《欧洲信息检索会议》,第232–248页。Springer。

Constanza Fierro and Anders Søgaard. 2022. Factual consistency of multilingual pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3046–3052.

Constanza Fierro 和 Anders Søgaard. 2022. 多语言预训练语言模型的事实一致性. 载于《计算语言学协会发现集: ACL 2022》, 第3046–3052页.

Myeongjun Jang, Deuk Sin Kwon, and Thomas L ukasiewicz. 2021. Accurate, yet inconsistent? consistency analysis on language understanding models. arXiv preprint arXiv:2108.06665.

Myeongjun Jang、Deuk Sin Kwon 和 Thomas Lukasiewicz。2021. 准确但矛盾?语言理解模型的一致性分析。arXiv预印本 arXiv:2108.06665。

Myeongjun Jang, Deuk Sin Kwon, and Thomas L ukasiewicz. 2022. BECEL: Benchmark for Consistency Evaluation of Language Models. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3680–3696.

Myeongjun Jang、Deuk Sin Kwon和Thomas Lukasiewicz。2022。BECEL:语言模型一致性评估基准。载于《第29届国际计算语言学会议论文集》,第3680–3696页。

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2022. Semantic uncertainty: Linguistic in variances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.

Lorenz Kuhn、Yarin Gal和Sebastian Farquhar。2022。语义不确定性:自然语言生成中不确定性估计的语言不变性。载于《第十一届国际学习表征会议》。

Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common sub sequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21-26 July, 2004, Barcelona, Spain, pages 605–612. ACL.

Chin-Yew Lin和Franz Josef Och。2004。基于最长公共子序列和跳跃二元组统计的机器翻译质量自动评估。载于《第42届计算语言学协会年会论文集》,2004年7月21-26日,西班牙巴塞罗那,第605-612页。ACL。

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.

Stephanie Lin、Jacob Hilton和Owain Evans。2022。TruthfulQA:衡量模型如何模仿人类谬误。载于《第60届计算语言学协会年会论文集(第一卷:长论文)》,第3214–3252页。

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822.

Alex Mallen、Akari Asai、Victor Zhong、Rajarshi Das、Daniel Khashabi 和 Hannaneh Hajishirzi。2023. 语言模型何时不可信:参数化与非参数化记忆的有效性研究。载于《第61届计算语言学协会年会论文集(第一卷:长论文)》,第9802–9822页。

Daniel McFadden. 1977. Quantitative methods for analyzing travel behaviour of individuals: Some recent developments. Cowles Foundation Discussion Papers 474, Cowles Foundation for Research in Economics, Yale University.

Daniel McFadden. 1977. 个体出行行为分析的定量方法:若干近期进展. Cowles Foundation Discussion Papers 474, Cowles Foundation for Research in Economics, Yale University.

NLP Team MosaicML. 2023. Introducing MPT-7B: A new standard for open-source, commercially usable LLMs. Accessed: 2023-08-05.

NLP Team MosaicML. 2023. 推出MPT-7B:开源商用大语言模型 (LLM) 的新标准. 访问日期: 2023-08-05.

Niklas Mu en nigh off, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Cross lingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.

Niklas Muennighoff、Thomas Wang、Lintang Sutawika、Adam Roberts、Stella Biderman、Teven Le Scao、M Saiful Bari、Sheng Shen、Zheng-Xin Yong、Hailey Schoelkopf等。2022。通过多任务微调实现跨语言泛化。arXiv预印本arXiv:2211.01786。

Benjamin Newman, Prafulla Kumar Choubey, and Nazneen Rajani. 2021. P-adapters: Robustly extract- ing factual information from language models with diverse prompts. In International Conference on Learning Representations.

Benjamin Newman、Prafulla Kumar Choubey 和 Nazneen Rajani。2021。P-adapters:通过多样化提示从语言模型中稳健提取事实信息。发表于 International Conference on Learning Representations。

Adir Rahamim, Guy Uziel, Esther Goldbraich, and Ateret Anaby Tavor. 2023. Text augmentation using dataset reconstruction for low-resource classification. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7389–7402.

Adir Rahamim、Guy Uziel、Esther Goldbraich 和 Ateret Anaby Tavor。2023. 基于数据集重构的文本增强方法在低资源分类任务中的应用。载于《计算语言学协会发现集:ACL 2023》,第7389–7402页。

Fiana Raiber and Oren Kurland. 2014. Queryperformance prediction: setting the expectations straight. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 13–22.

Fiana Raiber与Oren Kurland. 2014. 查询性能预测: 明确预期. 见第37届国际ACM SIGIR信息检索研究与发展会议论文集, 第13–22页.

Harsh Raj, Domenic Rosati, and Subhabrata Majumdar. 2022. Measuring reliability of large language models through semantic consistency. In NeurIPS ML Safety Workshop.

Harsh Raj、Domenic Rosati 和 Subhabrata Majumdar。2022。通过语义一致性衡量大语言模型的可靠性。收录于 NeurIPS ML Safety Workshop。

Nils Reimers and Iryna Gurevych. 2019. SentenceBERT: Sentence Embeddings using Siamese BERTNetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.

Nils Reimers 和 Iryna Gurevych. 2019. SentenceBERT: 基于孪生BERT网络的句子嵌入. 见《2019年自然语言处理经验方法会议暨第九届自然语言处理国际联合会议论文集》(EMNLP-IJCNLP), 第3982–3992页.

Skipper Seabold and Josef Perktold. 2010. stats models: Econometric and statistical modeling with python. In 9th Python in Science Conference.

Skipper Seabold 和 Josef Perktold. 2010. statsmodels: 基于 Python语言 的计量经济学与统计建模. 见: 第九届 Python语言 科学会议.

Yi Tay. 2023. A new open source Flan 20B with UL2.

Yi Tay. 2023. 全新开源Flan 20B模型(UL2版本)

Chunting Zhou, Junxian He, Xuezhe Ma, Taylor BergKirkpatrick, and Graham Neubig. 2022. Prompt consistency for zero-shot task generalization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2613–2626.

Chunting Zhou、Junxian He、Xuezhe Ma、Taylor BergKirkpatrick 和 Graham Neubig。2022。零样本任务泛化的提示一致性。载于《计算语言学协会发现:EMNLP 2022》,第2613–2626页。

Yun Zhou and W Bruce Croft. 2007. Query performance prediction in web search environments. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 543–550.

Yun Zhou和W Bruce Croft。2007。网络搜索环境中的查询性能预测。第30届国际ACM SIGIR信息检索研究与发展会议论文集,543–550页。

A Appendices

A 附录

A.1 Logistic Regression Diagnostic Plots

A.1 逻辑回归诊断图

As mentioned in Section 4.2, in logistic regression, a binary response $y$ (in our case, the indicator $\mathsf{c o r r e c t e}{0,1})$ is modeled as a function of a set of regressors; the regressors consist of certain predictor variables and possible interactions between them. More precisely, the logit transformation of the dependent variable $p=\mathrm{Pr}(y=1)$ , the probability of the indicator equaling 1 (denoted pcorrect) is modeled as a linear function of the regressors; thus, the logit should have a linear relationship with each regressor.

如第4.2节所述,在逻辑回归中,二元响应变量$y$(本研究中指代正确性指标$\mathsf{c o r r e c t e}{0,1}$)被建模为一组回归量的函数;这些回归量包含特定预测变量及其可能的交互项。具体而言,通过logit变换将因变量$p=\mathrm{Pr}(y=1)$(即指标等于1的概率,记作pcorrect)建模为回归量的线性函数,因此logit应与每个回归量保持线性关系。

Identifying Predictor Interactions Our chosen logistic model is correct $\sim\mathsf{Q C a t}{\star}\mathsf{l o g}(\mathsf{S P o p})\quad+$ $\mathsf{Q C a t{\star}S C o n s}+\mathsf{Q C a t{\star}C e r t}$ . The appropriateness of the addition of a regressor in the logistic model can be visually analyzed by plotting the empirical values of $p$ (here, pcorrect) conditioned on values of a regressor. Here, we we illustrate with the interaction of the categorical $\mathsf{Q C a t}$ with each numeric variable $x{\in}{108(\mathsf{S P o p}),\mathsf{S C o n s},\mathsf{C e r t}}$ . The interaction means that the slope of the estimated linear relationship between pcorrect and each variable $x$ can differ conditionally on each level of the categorical QCat. If the interaction is significant, we should see significant slope differences for at least some of the levels of $\mathtt{Q C a t}$ ; if there is no interaction, the lines will have similar slope but possibly differing vertical displacement (i.e., vertical intercepts).

识别预测变量交互作用
我们选择的逻辑模型是正确的 $\sim\mathsf{Q Cat}{\star}\mathsf{log}(\mathsf{S Pop})\quad+$ $\mathsf{Q Cat{\star}S Cons}+\mathsf{Q Cat{\star}Cert}$。通过绘制以回归变量值为条件的 $p$ (此处为 pcorrect) 的经验值,可以直观分析逻辑模型中添加回归量的合理性。这里,我们用分类变量 $\mathsf{Q Cat}$ 与每个数值变量 $x{\in}{108(\mathsf{S Pop}),\mathsf{S Cons},\mathsf{Cert}}$ 的交互作用进行说明。该交互作用意味着 pcorrect 与每个变量 $x$ 之间的估计线性关系斜率,可根据分类变量 QCat 的每个水平而不同。若交互作用显著,我们应看到 $\mathtt{Q Cat}$ 至少某些水平存在显著斜率差异;若无交互作用,各线斜率相近但可能存在垂直位移差异(即垂直截距不同)。

Because the continuous-valued pcorrect is not observed (we see only the binary correct), we can approximate it by first, binning the observed range of each variable into, say, 15 equal-width bins; second, restricting to observations with value of $x$ in a given bin and a given value of QCat, and calculating the average value of correct for these, we can approximate the typical value of pcorrect (assuming that there are enough observations in the subset) for $x$ in that bin interval. In Figure 5, we plot this estimated value of pcorrect versus the bin midpoint, considering only bins of $x$ falling between in the center $95%$ interval of observed $x$ values for that level of QCat (see Figure 4), to reduce noisy estimates at the edges.

由于无法直接观测到连续变量 pcorrect (我们只能看到二元的 correct 值),我们可以通过以下方法进行近似估算:首先,将每个变量的观测范围划分为15个等宽区间;其次,限定观测数据中变量 $x$ 处于特定区间且 QCat 值固定的情况,计算这些数据中 correct 的平均值。这样就能近似得到该区间内 $x$ 对应的典型 pcorrect 值 (假设子集中有足够的观测数据)。在图5中,我们仅选取 QCat 对应层级下 $x$ 观测值中心 $95%$ 区间内的分箱数据 (参见图4),将估算的 pcorrect 值与区间中点进行对比绘制,以此降低边界区域的噪声干扰。

Figure 5 shows that the presence of an interaction is reasonable, since for each variable, the relationship is roughly linear for each value of QCat but that the slopes often differ; the differing vertical displacements of the lines for each variable $x$ are modeled by the single-order coefficients of QCat.

图 5: 交互作用的存在是合理的,因为对于每个变量,QCat各取值水平下的关系大致呈线性,但斜率往往不同;各变量$x$线条的垂直位移差异通过QCat的一阶系数建模。

A.2 Logistic Regression Coefficient Tables

A.2 逻辑回归系数表

Here we present summary tables from the logistic model in Section A.1 fit to the results of each LLM on PopQA-TP, without a train-test split. Table 8 summarizes the overall fit of the chosen logistic model on each LLM. The McFadden’s statistic measures overall goodness-of-fit without penalizing the number of regressors; since statistic values over 0.4 indicate excellent fit, the model fits very well for each LLM. The Akaike Information Criterion (AIC) statistic adjusts for the number of regressors, and this model specification achieved the lowest (best) AIC for each LLM over the reduced models, indicating that the interaction effects are correctly included in the predictive logistic model.

这里我们展示了A.1节中逻辑模型在PopQA-TP数据集上对各LLM的拟合结果汇总表(未划分训练测试集)。表8总结了所选逻辑模型对各LLM的整体拟合情况。McFadden统计量衡量整体拟合优度且不惩罚回归变量数量,由于统计值超过0.4表示极佳拟合,可见该模型对所有LLM都拟合得非常好。Akaike信息准则(AIC)统计量会调整回归变量数量,该模型设定相比简化模型为每个LLM都取得了最低(最优)AIC值,表明交互效应被正确地纳入了预测性逻辑模型中。

Table 8: Summary of logistic regression fits by model.

modelMcFadden's R2AIC
Flan-T5-XXL0.4925029.942
Flan-UL20.4795746.113
MTO-XXL0.4904471.597
MPT-Instruct20.4308770.356
GPT-NeoX0.4198040.402

表 8: 各模型逻辑回归拟合结果汇总

模型 McFadden's R2 AIC
Flan-T5-XXL 0.492 5029.942
Flan-UL2 0.479 5746.113
MTO-XXL 0.490 4471.597
MPT-Instruct2 0.430 8770.356
GPT-NeoX 0.419 8040.402

Table 9 quantifies how much each regressor in the logistic model contributes to the overall fit of the model. This can be assessed by comparing the magnitudes of the Wald $\chi^{2}$ statistics (“stat.”) for different regressors in the same LLM, and across different LLMs. The statistical significance of each is indicated by the “symbol” column, which codes5 the statistic’s p-value: *** $\left(<0.001\right)$ , ** $(<0.01)$ , $^{*}(<0.05)$ , . $(<0.1)$ , or blank $(\geq0.1)$ . The statistical significance penalizes the regressor’s constraint degrees of freedom (‘df’ column), which equals 1 for numeric variables and #levels-1 for a categorical variable; hence here the numeric interactions with QCat have 15 degrees of freedom, since there are 16 categories.

表9量化了逻辑模型中每个回归变量对模型整体拟合的贡献程度。可通过比较同一大语言模型内不同回归变量及不同大语言模型间的Wald $\chi^{2}$统计量("stat.")大小来评估。各变量的统计显著性由"symbol"列标示,该列用5级编码表示统计量的p值:*** $\left(<0.001\right)$、** $(<0.01)$、$^{*}(<0.05)$、. $(<0.1)$或空白$(\geq0.1)$。统计显著性会惩罚回归变量的约束自由度('df'列),数值变量自由度为1,分类变量自由度为#levels-1;因此此处与QCat的数值交互项具有15个自由度(因存在16个类别)。

Overall, the question category QCat has the most explanatory power (its statistic is the largest), followed by Cert or log(SPop); SCons contributes relatively little on its own, but more when it is interacted with QCat. Interestingly, the contributions of $\mathsf{l o g}(\mathsf{S P o p})$ , Cert, the log(SPop)-QCat interactions are much larger in the encoder-only LLMs (MPT-Instruct2 and GPT-NeoX-20B) compared to the encoder-decoder language models, though the interactions in both cases are already very significant (scoring *** regardless).

总体而言,问题类别QCat的解释力最强(其统计量最大),其次是Cert或log(SPop);SCons单独贡献较小,但与QCat交互时贡献更大。有趣的是,$\mathsf{l o g}(\mathsf{S P o p})$、Cert以及log(SPop)-QCat交互项在仅编码器的大语言模型(MPT-Instruct2和GPT-NeoX-20B)中的贡献远大于编码器-解码器语言模型,尽管两种情况下交互项均已非常显著(均标为***)。

Though Table 9 summarizes each regressor’s contribution, it does not tell us about the direction of the effect of each regressor. For that, we refer to Tables 10–14, which show the full set of coefficient estimates. In each table, we have the coefficient estimate $({\hat{\beta}})$ , its $95%$ confidence interval, the p-value, and the symbol coding of the p-value. The interpretation of a coefficient is the marginal effect on logit(correct) of a 1-unit increase in each regressor. For the numeric variables, which have been standardized, this corresponds to a 1 standard deviation change (allowing their effect to be compared despite the different original scales); for the factor QCat, this corresponds to the increase in the logit associated with the given category value relative to that of the omitted level, “author”. Thus, positive values of the coefficient indicate that regressor, all others being equal, is associated with a positive increase in correctness. For QCat, for instance, since the choice of omitted level is arbitrary (it is alphabetical), the coefficient sign only has a relative, not absolute interpretation. For instance, if the coefficient on $\log(\mathsf{S P o p})$ is 4.5, and its interaction with $\scriptstyle0\complement\oper