GENERATE RATHER THAN RETRIEVE: LARGE LANGUAGE MODELS ARE STRONG CONTEXT GENERATORS

生成而非检索：大语言模型是强大的上下文生成器

ABSTRACT

摘要

Knowledge-intensive tasks, such as open-domain question answering (QA), require access to a large amount of world or domain knowledge. A common approach for knowledge-intensive tasks is to employ a retrieve-then-read pipeline that first retrieves a handful of relevant contextual documents from an external corpus such as Wikipedia and then predicts an answer conditioned on the retrieved documents. In this paper, we present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GENREAD), which first prompts a large language model to generate contextual documents based on a given question, and then reads the generated documents to produce the final answer. Furthermore, we propose a novel clustering-based prompting method that selects distinct prompts, in order to generate diverse documents that cover different perspectives, leading to better recall over acceptable answers. We conduct extensive experiments on three different knowledge-intensive tasks, including open-domain QA, fact checking, and dialogue system. Notably, GENREAD achieves 71.6 and 54.4 exact match scores on TriviaQA and WebQ, significantly outperforming the state-of-the-art retrieve-thenread pipeline DPR-FiD by $+4.0$ and $+3.9$ , without retrieving any documents from any external knowledge source. Lastly, we demonstrate the model performance can be further improved by combining retrieval and generation. Our code and generated documents can be found at https://github.com/wyu97/GenRead.

知识密集型任务（如开放域问答(QA)）需要获取大量世界或领域知识。针对此类任务的常见方法是采用"检索-阅读"流程：先从维基百科等外部语料库检索少量相关上下文文档，再基于检索到的文档预测答案。本文提出一种解决知识密集型任务的新视角——用大语言模型生成器替代文档检索器。我们将该方法称为"生成-阅读"(GENREAD)：先提示大语言模型根据给定问题生成上下文文档，再通过阅读生成文档得出最终答案。此外，我们提出一种基于聚类的新型提示方法，通过筛选不同提示来生成涵盖多元视角的文档，从而提升对可接受答案的召回率。我们在开放域QA、事实核查和对话系统这三个知识密集型任务上进行了广泛实验。值得注意的是，GENREAD在TriviaQA和WebQ上分别取得71.6和54.4的精确匹配分数，在完全不依赖外部知识源文档检索的情况下，显著超越当前最先进的"检索-阅读"流程DPR-FiD（分别提升+4.0和+3.9）。最后，我们证明结合检索与生成能进一步提升模型性能。代码及生成文档详见https://github.com/wyu97/GenRead。

1 INTRODUCTION

1 引言

Knowledge-intensive tasks, such as open-domain question answering (QA) and fact checking, require access to a large amount of world or domain knowledge (Petroni et al., 2021). These tasks are even challenging for humans without access to an external knowledge source such as Wikipedia. A common thread of existing methods for knowledge-intensive tasks employ a retrieve-then-read pipeline that first retrieves a handful of relevant contextual documents from Wikipedia and then conditions the prediction of the answer on these documents along with the question (Karpukhin et al., 2020; Lewis et al., 2020; Izacard & Grave, 2021). Nevertheless, these methods mainly suffer from three drawbacks. First, candidate documents for retrieval are chunked (e.g., 100 words) and fixed, so the retrieved documents might contain noisy information that is irrelevant to the question. Second, the representations of questions and documents are typically obtained independently in modern two-tower dense retrieval models (Karpukhin et al., 2020), leading to only shallow interactions captured between them (Khattab et al., 2021). Third, document retrieval over a large corpus requires the retriever model to first encode all candidate documents and store representations for each document. These two operations limit the parameters of dense retrievers and the size of embedding vectors, and thus cannot enjoy the world knowledge or deduction capabilities of large language models (Levine et al., 2022).

知识密集型任务，例如开放域问答(open-domain QA)和事实核查，需要获取大量世界或领域知识(Petroni et al., 2021)。即便对人类而言，若无法访问维基百科等外部知识源，这些任务也颇具挑战性。现有方法处理知识密集型任务的通用流程采用"检索-阅读"管道(retrieve-then-read pipeline)：先从维基百科检索少量相关上下文文档，再结合问题和这些文档预测答案(Karpukhin et al., 2020; Lewis et al., 2020; Izacard & Grave, 2021)。然而这些方法存在三个主要缺陷：首先，检索候选文档被分块(如100词)且固定，检索结果可能包含与问题无关的噪声信息；其次，现代双塔稠密检索模型(two-tower dense retrieval models)中问题和文档的表征通常独立获取(Karpukhin et al., 2020)，导致二者仅能捕捉浅层交互(Khattab et al., 2021)；第三，大规模文档检索要求检索模型先编码所有候选文档并存储各文档表征，这两个操作限制了稠密检索器的参数量与嵌入向量维度，因而无法利用大语言模型的世界知识或推理能力(Levine et al., 2022)。

In this paper, we propose to leverage large language models, such as Instruct GP T (Ouyang et al., 2022), to directly generate contextual documents for a given question, instead of retrieving relevant documents from an external corpus, such as Wikipedia. Our approach has two main advantages. First, we show that generated contextual documents contain the correct answer more often than the top retrieved documents. We believe this is because large language models generate contextual documents by performing deep token-level cross-attention between all the question and document contents, resulting in generated documents that are more specific to the question than retrieved documents. Second, we show that our approach significantly outperforms directly generating answers from large language models despite not incorporating any new external information. This is mainly because the task of generating document-level contexts is close to the objective of causal language modeling pre-training, so the world knowledge stored in the model parameters can be better utilized.

在本文中，我们提出利用大语言模型（如InstructGPT (Ouyang et al., 2022)）直接为给定问题生成上下文文档，而非从外部语料库（如维基百科）检索相关文档。我们的方法具有两大优势：首先，实验表明生成的上下文文档比检索到的Top文档更频繁包含正确答案。我们认为这是因为大语言模型通过对问题和文档内容进行深度token级交叉注意力计算来生成文档，使得生成结果比检索文档更具问题针对性。其次，尽管未引入任何新外部信息，该方法显著优于直接从大语言模型生成答案的方案。这主要因为生成文档级上下文的任务与因果语言建模预训练目标高度契合，能更有效调用模型参数中存储的世界知识。

We show, on multiple datasets, that generated documents are more likely to contain correct answers than the top retrieved documents. Notably, in dense retrieval methods, as more documents are retrieved, the recall of documents containing the correct answer increases (Karpukhin et al., 2020). However, the recall performance does not scale as well with generated documents because even with sampling methods, generated documents tend to contain duplicate information. In order to improve the recall performance of generated documents, we propose a novel clustering-based prompt method. We synthesize a prompt with in-context demonstrations of question-document pairs sampled from diverse clusters. These prompts result in generated documents that cover different perspectives of the question and improve the scaling of performance as more documents are generated per question.

我们在多个数据集上证明，生成文档比检索到的顶级文档更可能包含正确答案。值得注意的是，在密集检索方法中，随着检索文档数量的增加，包含正确答案的文档召回率会提升 [20]。但生成文档的召回性能扩展性较差，因为即使采用采样方法，生成文档也容易包含重复信息。为提升生成文档的召回性能，我们提出了一种基于聚类的新型提示方法：通过从不同聚类中采样问题-文档对的上下文示例来合成提示。这种提示能使生成文档覆盖问题的不同视角，从而在每问题生成更多文档时实现更好的性能扩展。

In contrast to the retrieve-then-read pipeline, our method is essentially a generate-then-read pipeline. Specifically, it first prompts a large language model to generate contextual documents based on a given question, and then reads the generated document to produce the final answer. The reader can still be a large model (e.g., Instruct GP T (Ouyang et al., 2022)) used under a zero-shot setting, or a small one (e.g., FiD (Izacard & Grave, 2021)) fine-tuned with generated documents on the training split of the target dataset. We evaluate our proposed method on three different knowledge-intensive tasks and demonstrate its effectiveness on both zero-shot and supervised settings.

与检索-阅读流程不同，我们的方法本质上是生成-阅读流程。具体而言，该方法首先提示大语言模型基于给定问题生成上下文文档，然后通过阅读生成的文档来产生最终答案。阅读器仍可采用零样本设置下的大模型（例如 InstructGPT (Ouyang et al., 2022)），或使用目标数据集训练集对生成文档进行微调的小型模型（例如 FiD (Izacard & Grave, 2021)）。我们在三个不同知识密集型任务上评估了所提方法，并验证了其在零样本和监督设置下的有效性。

Overall, our main contributions can be summarized as follows:

总体而言，我们的主要贡献可概括如下:

We propose a novel generate-then-read pipeline for solving knowledge-intensive tasks, i.e., replacing the process of retrieving documents from Wikipedia or searching for related documents on Google, by prompting a large language model to generate relevant contextual documents. 2. We propose a novel clustering-based prompting approach to generate multiple diverse contextual documents that increases the likelihood of covering the correct answer. We demonstrate this approach can significantly improve performance on end QA and other downstream tasks. 3. We conduct extensive experiments with three knowledge-intensive NLP tasks under both zeroshot and supervised settings. Notably, our method can match or even outperform retrieve-then-read pipeline methods, without retrieving any documents from any external knowledge source.
我们提出了一种新颖的生成后读取流程 (generate-then-read pipeline) 来解决知识密集型任务，即通过提示大语言模型生成相关上下文文档，取代从维基百科检索文档或在谷歌搜索相关文档的过程。
我们提出了一种基于聚类的新型提示方法，用于生成多个多样化的上下文文档，从而提高覆盖正确答案的可能性。实验证明该方法能显著提升端到端问答及其他下游任务的性能。
我们在零样本和监督设置下，针对三项知识密集型 NLP 任务进行了大量实验。值得注意的是，我们的方法无需从任何外部知识源检索文档，即可达到甚至超越检索后读取流程 (retrieve-then-read pipeline) 方法的性能。

2 相关工作

2.1 KNOWLEDGE-INTENSIVE NLP VIA RETRIEVE-THEN-READ PIPELINE.

2.1 基于检索-阅读流程的知识密集型自然语言处理

Mainstream methods for solving knowledge-intensive NLP tasks employ a retrieve-then-read model pipeline. Given a question, this model first leverages a retriever over a large evidence corpus (e.g. Wikipedia) to fetch a set of relevant documents that may contain the answer. A reader is then used to peruse the retrieved documents and predict an answer. Recent follow-up work has mainly focused on improving the retriever (Karpukhin et al., 2020; Qu et al., 2021; Sachan et al., 2022) or the reader (Izacard & Grave, 2021; Cheng et al., 2021; Yu et al., 2022), or training the system end-toend (Lewis et al., 2020; Singh et al., 2021). Early retrieval methods mainly employed sparse retrievers, such as BM25 (Chen et al., 2017). Recently, ORQA (Lee et al., 2019) and DPR (Karpukhin et al., 2020) have revolutionized the field by utilizing dense contextual i zed vectors for document indexing, leading to superior performance to traditional approaches. We propose an alternative approach which forgoes retrieval, instead extracting the knowledge from the model parameters of a large language model. We show that our approach is can be combine with dense retrievers to outperform both methods independently. Our method can also be combined with any reader mechanism, allowing generated context documents to be plugged into any current knowledge-intensive NLP pipelines.

解决知识密集型NLP任务的主流方法采用检索-阅读的模型流程。给定一个问题，该模型首先利用检索器在大型证据语料库(如维基百科)中获取一组可能包含答案的相关文档，随后通过阅读器分析检索到的文档并预测答案。近期研究主要聚焦于改进检索器(Karpukhin等人，2020；Qu等人，2021；Sachan等人，2022)或阅读器(Izacard & Grave，2021；Cheng等人，2021；Yu等人，2022)，或训练端到端系统(Lewis等人，2020；Singh等人，2021)。早期检索方法主要采用稀疏检索器，如BM25(Chen等人，2017)。近年来，ORQA(Lee等人，2019)和DPR(Karpukhin等人，2020)通过使用稠密上下文向量进行文档索引，实现了对传统方法的性能超越。我们提出了一种替代方案，放弃检索步骤，直接从大语言模型的参数中提取知识。研究表明，该方法与稠密检索器结合使用时，性能优于单独使用任一方法。本方案还可与任何阅读机制结合，使生成的上下文文档能无缝接入现有知识密集型NLP流程。

2.2 GENERATOR AS RETRIEVER FOR OBTAINING CONTEXTUAL DOCUMENTS.

2.2 作为检索器的生成器用于获取上下文文档

Recent works have investigated using auto-regressive language models to generate identifier strings for documents, as an intermediate target for retrievals, such as entity names (De Cao et al., 2020) or distinctive n-grams that can be mapped to full passages (Bevilacqua et al., 2022). However, one needs to create the identifiers, hence the structure was not thoroughly evaluated on a large-scale benchmark (Bevilacqua et al., 2022). Other works have demonstrated that the knowledge stored in the parameters of pre-trained language models could be “retrieved” to some extent by directly generating text (Petroni et al., 2019; Roberts et al., 2020). However, the previous work only used generation for query expansion (Mao et al., 2021), which did not exploit the potential of directly generating contextual documents for open-domain questions. Different from the above approaches that aimed to train a generator model to produce contextual document identifiers (which is still using the original Wikipedia text) or provide data augmentation to retrievers, our work directly generates contextual documents for given questions.

近期研究探索了使用自回归语言模型为文档生成标识符字符串，作为检索的中间目标，例如实体名称 (De Cao et al., 2020) 或可映射到完整段落的独特n元语法 (Bevilacqua et al., 2022)。然而，由于需要人工创建标识符，该结构尚未在大规模基准测试中得到充分评估 (Bevilacqua et al., 2022)。其他研究表明，预训练语言模型参数中存储的知识可以通过直接生成文本实现一定程度的"检索" (Petroni et al., 2019; Roberts et al., 2020)。但前人工作仅将生成技术用于查询扩展 (Mao et al., 2021)，未能充分发挥直接为开放域问题生成上下文文档的潜力。与上述旨在训练生成模型产生上下文文档标识符（仍使用原始维基百科文本）或为检索器提供数据增强的方法不同，我们的工作直接为给定问题生成上下文文档。

2.3 NLP MODELS ENHANCED BY LARGE LANGUAGE MODEL OUTPUTS.

2.3 基于大语言模型输出的增强型NLP模型

A line of recent work has shown that relevant knowledge can be elicited from large language models, especially for those domains that lack appropriate knowledge bases with sufficient coverage (Liu et al., 2022b; Fang et al., 2022). For example, Liu et al. (2022b) proposed leveraging GPT-3 to generate relevant contexts, then providing the contexts as additional input when answering a commonsense question. Another line of work focused on prompting a large language model to generate a series of intermediate reasoning steps, often referred to as chain-of-thought (Wei et al., 2022b; Kojima et al., 2022; Li et al., 2022). The prompt consists of an instruction (e.g., Let’s think step by step!), a few demonstrations that are fixed for each task, and a new-question placeholder. The demonstrations are human-written, and each consists of a question in the style of the task and a series of intermediate reasoning steps that is helpful for answering the question. Our work does not require any human annotation, but adds to this line of work of leveraging model generated text to guide further generations. In our case, we apply this approach to knowledge-intensive tasks, which have not been explored by previous work.

近期一系列研究表明，可以从大语言模型中提取相关知识，特别是在那些缺乏覆盖充分的知识库的领域 (Liu et al., 2022b; Fang et al., 2022)。例如，Liu et al. (2022b) 提出利用 GPT-3 生成相关上下文，然后在回答常识性问题时将这些上下文作为额外输入。另一类研究侧重于通过提示让大语言模型生成一系列中间推理步骤，通常称为思维链 (Wei et al., 2022b; Kojima et al., 2022; Li et al., 2022)。提示包括指令 (例如"让我们一步步思考！")、针对每个任务固定的少量示例，以及一个新问题的占位符。这些示例由人工编写，每个示例包含一个任务风格的问题和一系列有助于回答问题的中间推理步骤。我们的工作不需要任何人工标注，但延续了利用模型生成文本来指导后续生成的研究方向。在我们的案例中，我们将这种方法应用于知识密集型任务，这是以往研究尚未探索的领域。

3 PROPOSED METHOD

3 研究方法

In this section, we present details of our proposed novel generate-then-read (GENREAD) pipeline for solving various knowledge-intensive tasks. Specifically, it first prompts a large language model to generate contextual documents with respect to a given query, then reads the generated documents to predict the final answer. The reader can either be a large model (e.g., Instruct GP T) used for the zero-shot setting, or a small one (e.g., FiD) fine-tuned with generated documents on the training split of the target dataset. We introduce the zero-shot setting in $\S3.1$ and supervised setting in $\S3.2$ .

在本节中，我们将详细介绍所提出的新型生成后阅读（GENREAD）流程，用于解决各类知识密集型任务。具体而言，该流程首先提示大语言模型根据给定查询生成相关上下文文档，随后通过阅读生成文档来预测最终答案。阅读器既可以是用于零样本设置的大模型（如InstructGPT），也可以是在目标数据集训练集上使用生成文档微调的小模型（如FiD）。我们将在$\S3.1$介绍零样本设置，在$\S3.2$介绍监督设置。

3.1 ZERO-SHOT SETTING

3.1 零样本 (Zero-shot) 设定

Under the zero-shot setting, there is no training data – neither questions nor contextual documents. When tested on the open-domain QA task, most existing large language models directly encode the given question and predict the answer (Brown et al., 2020; Du et al., 2022; Chowdhery et al., 2022). Specifically, the question $q$ , associated with some text prompt, is input to the model, which then generates the answer, denoted as $p(a|q,\theta)$ , where $\theta$ represents the pre-trained model parameters. In practice, the maximum a posteriori estimation (MAP) is the final answer, i.e., $\hat{a}=\arg\operatorname*{max}_{a}p(a|q,\theta)$ . However, this way of directly asking large language models to output answers often leads to poor performance, as it leaves a considerable amount of additional world knowledge un exploited (Levine et al., 2022). On the contrary, the zero-shot retrieve-then-read pipeline first uses an off-the-shelf retriever to fetch relevant documents from an external knowledge source such as Wikipedia, then asks the large language model to read the documents and predict the answer.

在零样本 (zero-shot) 设置下，既没有训练数据，也没有问题或上下文文档。在开放域问答任务测试中，大多数现有大语言模型直接对给定问题进行编码并预测答案 (Brown et al., 2020; Du et al., 2022; Chowdhery et al., 2022)。具体而言，问题 $q$ 与某些文本提示一起输入模型，模型随后生成答案，表示为 $p(a|q,\theta)$，其中 $\theta$ 代表预训练模型参数。实践中，最大后验估计 (MAP) 为最终答案，即 $\hat{a}=\arg\operatorname*{max}_{a}p(a|q,\theta)$。然而，这种直接要求大语言模型输出答案的方式往往表现不佳，因为它未能充分利用大量额外的世界知识 (Levine et al., 2022)。相反，零样本检索-阅读流程首先使用现成的检索器从维基百科等外部知识源获取相关文档，然后要求大语言模型阅读文档并预测答案。

In this work, we improve the performance by introducing an additional auxiliary generated document variable $d$ , and then extend the model to have the form $\begin{array}{r}{p(a|q)=\sum_{i}p(a|d_{i},\dot{q})\bar{p}(d_{i}|q)}\end{array}$ . In practice, we cannot sum over all possible documents $d$ . Therefore, the mos t common approach is to compute the MAP estimate $\hat{d}=\arg\operatorname*{max}\hat{p}(d)$ using beam search, and then to approximate the sum over $d$ with this single value. This two step approach, we label it as a generate-then-read pipeline.

在本工作中，我们通过引入额外的辅助生成文档变量$d$来提升性能，并将模型扩展为$\begin{array}{r}{p(a|q)=\sum_{i}p(a|d_{i},\dot{q})\bar{p}(d_{i}|q)}\end{array}$形式。实际应用中无法对所有可能的文档$d$求和，因此最常见的方法是使用束搜索计算最大后验估计$\hat{d}=\arg\operatorname*{max}\hat{p}(d)$，然后用该单一值近似替代对$d$的求和。我们将这种两步法标注为生成-阅读管道。

Figure 1: An overall framework of clustering-based prompting method. It leverages distinct questiondocument pairs sampled from each embedding cluster as in-context demonstrations to prompt a large language model to generate diverse documents, then read the documents to predict an answer.

图 1: 基于聚类的提示方法整体框架。该方法从每个嵌入聚类中采样不同的问题-文档对作为上下文示例，提示大语言模型生成多样化文档，随后通过阅读这些文档来预测答案。

STEP1: GENERATE. In this step, we first prompt a large language model (e.g., Instruct GP T (Ouyang et al., 2022)) to generate documents based on the given question. For example, the input to the language model could be “Generate a background document to answer the given question. {question placeholder}”. We can use any decoding strategy (e.g., greedy decoding, beam search), but we used greedy decoding throughout the zero-shot experiments for simplicity and reproducibility.

步骤1：生成。在此步骤中，我们首先提示一个大语言模型（例如InstructGPT (Ouyang et al., 2022)）根据给定问题生成文档。例如，语言模型的输入可以是“生成一份背景文档来回答给定问题。{问题占位符}”。我们可以使用任何解码策略（例如贪婪解码、束搜索），但为了简单性和可复现性，在零样本实验中我们全程采用贪婪解码。

STEP 2: READ. In the second step, we use generated sentence $\hat{d}$ along with the input question to produce the final answer from the large language model. This is actually the same setting as “zeroshot” reading comprehension, as widely studied in existing works (Brown et al., 2020; Lazaridou et al., 2022). We choose appropriate prompts from P3 (Bach et al., 2022), such as “Refer to the passage below and answer the following question. Passage: {background placeholder} Question: {question placeholder}”. Finally, the language model is fed the prompted text to generate an answer.

步骤2：阅读。在第二步中，我们使用生成的句子$\hat{d}$和输入问题，从大语言模型中生成最终答案。这实际上与现有研究中广泛探讨的"零样本"阅读理解设置相同 (Brown et al., 2020; Lazaridou et al., 2022)。我们从P3 (Bach et al., 2022) 中选择合适的提示，例如"参考以下段落并回答问题。段落：{背景占位符} 问题：{问题占位符}"。最后，将提示文本输入语言模型以生成答案。

3.2 SUPERVISED SETTING

3.2 监督式设定

Although large language models demonstrate impressive performance on zero-shot learning abilities, their performance still lag behind the supervised setting. Therefore, we also explore how the generated documents from large language models can benefit the supervised setting. As directly fine-tuning large language models on downstream datasets could be prohibitively expensive, we leverage a small reader model such as FiD to peruse the generated documents under the supervised setting.

尽管大语言模型在零样本学习能力上表现出色，但其性能仍落后于监督学习场景。因此，我们同时探索了大语言模型生成的文档如何提升监督学习效果。鉴于直接在下游数据集上微调大语言模型可能成本过高，我们采用FiD等小型阅读器模型在监督环境下处理生成文档。

Under the supervised setting, scaling the size of retrieved documents can lead to better performance (Karpukhin et al., 2020; Izacard & Grave, 2021). This is mainly because retrieving more documents can cover more relevant information and knowledge, i.e., a higher recall score. Nevertheless, asking a large language model to generate multiple high-quality contextual documents is a challenging task. Dense retrieval methods can fetch multiple documents covering different perspectives of the answer. Compared to dense retrievers, simply prompting a large language model to generate multiple contextual documents often leads to low knowledge coverage, since the contents generated by multiple decoding passes from the same input tend to be similar. Sampling decoding methods, such as nucleus sampling1 (Holtzman et al., 2020) can diversify the generation process to some extent, but the knowledge content of generated texts still tends to be highly repetitive when used to generate documents for a given question. We further propose two novel solutions, including diverse human prompts and clustering-based prompts, which will be elaborated on in this section.

在有监督设置下，扩大检索文档规模可以提升性能 (Karpukhin et al., 2020; Izacard & Grave, 2021)。这主要是因为检索更多文档能覆盖更广泛的相关信息和知识，即获得更高的召回率。然而，要求大语言模型生成多份高质量上下文文档具有挑战性。稠密检索方法能获取涵盖答案不同视角的多个文档。与稠密检索器相比，单纯提示大语言模型生成多份上下文文档往往导致知识覆盖率低下，因为同一输入经多次解码生成的内容往往高度相似。虽然核心采样1 (Holtzman et al., 2020) 等采样解码方法能在一定程度上使生成过程多样化，但针对给定问题生成文档时，生成文本的知识内容仍存在高度重复性。我们进一步提出两种创新解决方案：多样化人工提示和基于聚类的提示，将在本节详细阐述。

3.2.1 DIVERSE HUMAN PROMPTS

3.2.1 多样化人类提示

In order to avoid similar token distributions under a single prompt, we ask human annotators to provide different prompts, in order to make the generated document diverse. This method is simple, but can effectively vary the token distribution during generation. In the experiments, we empirically found this method can bring improvement to the retrieval performance (Figure 2). However, this method suffers from two drawbacks. On one hand, it requires human annotators to write different prompts, which cannot be easily generalized to different knowledge-intensive tasks. On the other hand, different large language models might be sensitive to different prompt words, which might cause a set of good prompt words not work on a different large language model.

为了避免单一提示下出现相似的token分布，我们要求人工标注者提供不同的提示，以使生成文档多样化。该方法虽然简单，但能有效改变生成过程中的token分布。实验中发现该方法能提升检索性能（图2）。但存在两个缺陷：一方面需要人工编写不同提示，难以推广到不同知识密集型任务；另一方面不同大语言模型可能对提示词敏感度不同，导致一组优质提示词在其他大语言模型上失效。

3.2.2 CLUSTERING-BASED PROMPTS

3.2.2 基于聚类的提示

To increase knowledge coverage in generated documents, we propose a novel clustering-based prompt method. It first clusters the representations of a set of documents into $K$ classes $K=2$ in Figure 1), where the number of classes is equal to the number of documents that need to be generated in the end. Next, it randomly selects $n$ question-document pairs $\mathit{\Omega}_{n}=5$ in Figure 1) from each cluster. Lastly, a large language model presents the different $n$ question-document pairs as in-context demonstrations for generating documents to a given question. In this way, large language models are based on different distributions of examples, hence resulting in generated documents covering different perspectives. We show this in Figure 1 and illustrate the details of each step as follows.

为提升生成文档的知识覆盖范围，我们提出了一种基于聚类的新型提示方法。该方法首先将一组文档的表征聚类为$K$个类别（图1中$K=2$），其类别数量最终等于需要生成的文档数量。接着，从每个聚类中随机选取$n$个问答文档对（图1中$\mathit{\Omega}_{n}=5$）。最后，大语言模型将这$n$个不同的问答文档对作为上下文示例，根据给定问题生成文档。通过这种方式，大语言模型基于不同的示例分布进行生成，从而使最终文档涵盖不同视角。图1展示了这一流程，各步骤细节如下所述。

STEP 1: GET ONE INITIAL DOCUMENT PER QUESTION. Similar to the zero-shot setting, we first ask a large language model to generate one contextual document $d$ for each question $q\in\mathcal{Q}$ , where $\mathcal{Q}$ is the set of questions in the training split. Alternatively, we can use an unsupervised retriever (e.g., BM25) to obtain a document from Wikipedia. We now have a question-document pair set ${q_{i},d_{i}}_{i=1}^{|\mathcal{Q}|}$

步骤1：为每个问题获取初始文档。类似于零样本设置，我们首先让一个大语言模型为每个问题$q\in\mathcal{Q}$生成一个上下文文档$d$，其中$\mathcal{Q}$是训练集中的问题集合。或者，我们也可以使用无监督检索器（例如BM25）从维基百科获取文档。现在，我们拥有一个问题-文档对集合${q_{i},d_{i}}_{i=1}^{|\mathcal{Q}|}$。

STEP 2: ENCODE EACH DOCUMENT, DO K-MEANS CLUSTERING. We then use a large language model (i.e., GPT-3) to encode each question-document pair, i.e., $\mathbf{e}_ {i}=\mathrm{GPT}\mathbf{-}3([q_{i},d_{i}])$ , resulting in a 12,288-dimensional vector per document. Then, we use K-means to cluster all embedding vectors ${\mathbf{e}_ {i}}_{i=1}^{|Q|}$ into $K$ sets, so each question-document pair is assigned a unique cluster id $c\in{1,...,K}$ We vary the number of $K$ in the experiments, which will be illustrated in Figure 2.

步骤2：编码每个文档并进行K均值聚类。随后，我们使用一个大语言模型（即GPT-3）对每个问题-文档对进行编码，即$\mathbf{e}_ {i}=\mathrm{GPT}\mathbf{-}3([q_{i},d_{i}])$，每个文档生成一个12,288维的向量。接着，我们采用K均值算法将所有嵌入向量${\mathbf{e}_ {i}}_{i=1}^{|Q|}$聚类为$K$个集合，因此每个问题-文档对被分配一个唯一的聚类ID$c\in{1,...,K}$。实验中我们调整$K$的数量，具体结果将在图2中展示。

STEP 3: SAMPLE AND GENERATE $K$ DOCUMENTS. Lastly, we sample $n$ question-document pairs from each cluster $c$ , denoted as ${{q_{c1},d_{c1};q_{c2},d_{c2};...;q_{c n},d_{c n}}$ , in which $n$ is a hyper parameter 2. Then, the $n$ sampled question-document pairs from the same cluster serve as in-context demonstrations for the large language model to generate a contextual document. For example, the input to the large language model could be "${q_{c1} placeholder}$ ${d_{c1} placeholder}$ ... ${q_{c n} placeholder}$ ${d_{c n} placeholder}$ {input question placeholder}”. By enumerating the sampled documents in these $K$ clusters, we can finally get $K$ -generated documents. By conditioning on different sampled in-context demonstrations collected from different clusters, the large language model has been biased for different perspectives. Although these different perspectives exist in a latent manner, we empirically show it works well in practice, by comparing it with sampling methods, diverse human prompts (Figure 2 and Table 2) and randomly sampling $n$ pairs from the entire dataset (Table 11).

步骤3：采样并生成$K$份文档。最后，我们从每个聚类$c$中采样$n$个问题-文档对，记为${q_{c1},d_{c1};q_{c2},d_{c2};...;q_{c n},d_{c n}}$，其中$n$是一个超参数2。然后，来自同一聚类的$n$个采样问题-文档对作为大语言模型的上下文示例，用于生成上下文文档。例如，大语言模型的输入可能是${q_{c1} 占位符}$ ${d_{c1}占位符}$...${q_{c n}占位符}$ ${d_{c n} 占位符}${输入问题占位符}”。通过枚举这些$K$个聚类中的采样文档，我们最终可以得到$K$份生成的文档。通过基于从不同聚类收集的不同采样上下文示例进行条件化，大语言模型已经偏向于不同的视角。尽管这些不同的视角以潜在的方式存在，但通过将其与采样方法、多样化的人工提示（图2和表2）以及从整个数据集中随机采样$n$对（表11）进行比较，我们实证表明它在实践中效果良好。

4 EXPERIMENTS

4 实验

In this section, we conduct comprehensive experiments on three knowledge-intensive NLP tasks, including open-domain QA (NQ (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017) and WebQ (Berant et al., 2013)), fact checking (FEVER (Thorne et al., 2018) and FM2 (Eisen schlo s et al., 2021)) and open-domain dialogue system (WoW (Dinan et al., 2019)). More detailed dataset information can be found in Appendix A.1. To evaluate the model performance, we use exact match (EM) score for evaluating open-domain QA (Zhu et al., 2021). An answer is considered correct if and only if its normalized form has a match in the acceptable answer list. We also employ Recall $\ @\mathrm{K}$ $(\mathbf{R}\ @\mathbf{K})$ as an intermediate evaluation metric, measured as the percentage of top-K retrieved or generated documents that contain the answer. This metric is commonly used in evaluations of previous works (Karpukhin et al., 2020; Izacard & Grave, 2020; Sachan et al., 2022). For other knowledge-intensive tasks, we follow the KILT benchmark (Petroni et al., 2021) to use accuracy (ACC) for fact checking and F1 / Rouge-L (R-L) score for open-domain dialogue system.

在本节中，我们对三项知识密集型NLP任务进行了全面实验，包括开放域问答（NQ (Kwiatkowski et al., 2019)、TriviaQA (Joshi et al., 2017)和WebQ (Berant et al., 2013)）、事实核查（FEVER (Thorne et al., 2018)和FM2 (Eisen schlo s et al., 2021)）以及开放域对话系统（WoW (Dinan et al., 2019)）。更详细的数据集信息见附录A.1。为评估模型性能，我们采用精确匹配（EM）分数评估开放域问答（Zhu et al., 2021），当且仅当答案的标准化形式与可接受答案列表匹配时判定为正确。同时使用Recall @K (R@K)作为中间评估指标，即前K个检索或生成文档中包含答案的百分比，该指标在先前研究中被广泛采用（Karpukhin et al., 2020; Izacard & Grave, 2020; Sachan et al., 2022）。对于其他知识密集型任务，我们遵循KILT基准（Petroni et al., 2021），使用准确率（ACC）评估事实核查任务，采用F1/Rouge-L（R-L）分数评估开放域对话系统。

Models	Open-domain QA NQ	TriviaQAWebQ		Fact Checking FEVERFM2		DialogueSystem WoW (F1 /R-L)
withretriever,ANDdirectlytrainedonthesedatasets DPR + InstructGPT
	29.1	53.8	20.2	79.8	65.9	15.4 13.7
*withretriever,BUTNOTtrainedonthesedatasets
BM25+InstructGPT	19.7	52.2 15.8	78.7	65.2	15.7	13.7
Contriever+InstructGPT	18.0	51.3	16.6	80.4	66.6 15.5	14.0
Google+InstructGPT	28.8	58.8	20.4	82.9	66.0 14.8	13.2
*withoutretriever,andnotusingexternaldocuments
PreviousSoTAmethods	24.71	56.72 19.01
InstructGPT (no docs.)	20.9	57.5	18.6	77.6	59.4 15.4	13.8
GENREAD(InstructGPT)	28.0	59.0	24.6	80.4	65.5	15.8 14.2

模型	开放域问答 NQ	TriviaQAWebQ		事实核查 FEVERFM2		对话系统 WoW (F1/R-L)
带检索器，且直接在这些数据集上训练 DPR + InstructGPT	29.1	53.8	20.2	79.8	65.9	15.4 13.7
带检索器，但未在这些数据集上训练
BM25+InstructGPT	19.7	52.2 15.8	78.7	65.2	15.7	13.7
Contriever+InstructGPT	18.0	51.3	16.6	80.4	66.6 15.5	14.0
Google+InstructGPT	28.8	58.8	20.4	82.9	66.0 14.8	13.2
不带检索器，且不使用外部文档
先前SOTA方法	24.71	56.72 19.01
InstructGPT (无文档)	20.9	57.5	18.6	77.6	59.4 15.4	13.8
GENREAD(InstructGPT)	28.0	59.0	24.6	80.4	65.5	15.8 14.2

Table 1: Zero-shot open-domain QA performance. Our proposed GENREAD with the Instruct GP T reader (named GENREAD (Instruct GP T)) can significantly outperform the original Instruct GP T, achieving new state-of-the-art performance on three open-domain QA benchmarks (previous SoTA: 1GLaM (Du et al., 2022), 2FLAN (Wei et al., 2021)) under this setting without using any external document. Our GENREAD can achieve comparable or even better performance than zero-shot retrieve-then-read models that use a retriever or search engine to first obtain contextual documents. To ensure reproducibility, we use greedy search in decoding. All prompts used are shown in the $\S\mathrm{B}.1$ . Note: fix numbers in v2 by adding average performance of different prompts, see details in Table 20.

表 1: 零样本开放领域问答性能。我们提出的GENREAD与Instruct GPT阅读器(命名为GENREAD (Instruct GPT))显著优于原始Instruct GPT，在不使用任何外部文档的情况下，在此设置下实现了三个开放领域问答基准的最新性能(先前SoTA: 1GLaM (Du et al., 2022), 2FLAN (Wei et al., 2021))。我们的GENREAD可以达到与使用检索器或搜索引擎先获取上下文文档的零样本检索-阅读模型相当甚至更好的性能。为确保可复现性，我们在解码时使用贪心搜索。所有使用的提示如$\S\mathrm{B}.1$所示。注: 通过添加不同提示的平均性能来修正v2中的数字，详见表20。

4.1 ZERO-SHOT SETTING EXPERIMENTS

4.1 零样本设置实验

We first compare our proposed GENREAD approach with various large language models proposed in recent years, including GPT-3 (Brown et al., 2020), Gopher (Rae et al., 2021), FLAN (Wei et al., 2021), GLaM (Du et al., 2022), Chinchilla (Hoffmann et al., 2022), PaLM (Chowdhery et al., 2022) and Instruct GP T (Ouyang et al., 2022). Due to the space limitation, we only put the best performance on each dataset in Table 1, in which the line is called previous SoTA methods. In addition, their corresponding model parameters and performance are listed in Table 9 in Appendix. All of these baseline methods use the same input formats, i.e., [prompt words; question].

我们首先将提出的GENREAD方法与近年来提出的各种大语言模型进行比较，包括GPT-3 (Brown等人，2020)、Gopher (Rae等人，2021)、FLAN (Wei等人，2021)、GLaM (Du等人，2022)、Chinchilla (Hoffmann等人，2022)、PaLM (Chowdhery等人，2022)和InstructGPT (Ouyang等人，2022)。由于篇幅限制，我们仅在表1中列出每个数据集上的最佳性能，该行称为先前SoTA方法。此外，它们对应的模型参数和性能列在附录的表9中。所有这些基线方法都使用相同的输入格式，即[提示词；问题]。

GENREAD is based on Instruct GP T with 175B parameters. In order to fully evaluate the effectiveness of our proposed method, we also compare with Instruct GP T augmented with retrieved documents from Wikipedia or Google search. The baseline methods (1) BM25 / Contriever $+$ InstructGPT; (2) Google $+$ Instruct GP T; (3) DPR $+$ Instruct GP T have the same input format as our GENREAD , i.e., [prompt words; contextual document; question]. BM25 is a traditional sparse retrieval method. Contriever (Izacard et al., 2022a) is a state-of-the-art unsupervised dense retrieval model. DPR (Karpukhin et al., 2020) is a supervised dense retrieval model directly trained on NQ, TriviaQA and WebQ datasets. We note that comparing with above three methods is challenging because our method only relies on the large language model itself, without using any external corpus.

GENREAD基于拥有1750亿参数的InstructGPT。为了全面评估我们提出方法的有效性，我们还对比了通过维基百科或谷歌搜索获取检索文档增强的InstructGPT。基线方法包括：(1) BM25/Contriever $+$ InstructGPT；(2) 谷歌搜索 $+$ InstructGPT；(3) DPR $+$ InstructGPT，这些方法与我们的GENREAD采用相同的输入格式，即[提示词；上下文文档；问题]。BM25是传统的稀疏检索方法。Contriever (Izacard等，2022a) 是最先进的无监督密集检索模型。DPR (Karpukhin等，2020) 是直接在NQ、TriviaQA和WebQ数据集上训练的监督式密集检索模型。需要指出的是，与上述三种方法对比具有挑战性，因为我们的方法仅依赖大语言模型自身，未使用任何外部语料库。

4.1.1 EXPERIMENTAL RESULTS

4.1.1 实验结果

In the experiments, we use Instruct GP T as our backbone model. As shown in Table 1, compared with state-of-the-art large language models, our proposed GENREAD with the Instruct GP T reader improves its performance by generating contextual documents and conditioning on the generated documents, even though no new data is introduced, and the generator and reader have the exact same parameters. Specifically, GENREAD can improve the EM score by $+6.9$ on three open-domain QA benchmarks, compared to the original Instruct GP T. We also make a similar observation on fact checking and open-domain dialogue system. Our proposed GENREAD can consistently outperform the baseline Instruct GP T model without retrieving any contextual documents.

在实验中，我们使用Instruct GPT作为主干模型。如表1所示，与最先进的大语言模型相比，我们提出的GENREAD方法通过生成上下文文档并以生成文档为条件，即使没有引入新数据且生成器与阅读器参数完全相同，仍能提升Instruct GPT阅读器的性能。具体而言，在三个开放域QA基准测试中，GENREAD的EM分数比原始Instruct GPT提高了$+6.9$。在事实核查和开放域对话系统任务中，我们也观察到类似现象。提出的GENREAD方法在不检索任何上下文文档的情况下，始终优于基线Instruct GPT模型。

To further validate the effectiveness of GENREAD , we compare against zero-shot retrieve-then-read pipeline models, which first use a retrieval model or the Google search engine to get a relevant contextual document, then use Instruct GP T to read the texts and produce the final answer. As shown in Table 1, GENREAD can achieve on-par performance with zero-shot retrieve-then-read pipeline models on the NQ and FM2 datasets, and outperform them on all other benchmarks. The knowledge learned by the large language models can be retrieved via auto regressive text generation. Without seeing any examples from these datasets, GENREAD can outperform using the supervised retrieval model (i.e., DPR) to recover relevant contextual documents.

为进一步验证GENREAD的有效性，我们将其与零样本检索-阅读流程模型进行对比。该流程首先使用检索模型或Google搜索引擎获取相关上下文文档，再通过InstructGPT阅读文本生成最终答案。如表1所示，GENREAD在NQ和FM2数据集上与零样本检索-阅读流程模型表现相当，在其他所有基准测试中均优于后者。大语言模型通过自回归文本生成即可提取其习得的知识。在未接触这些数据集任何样本的情况下，GENREAD的表现优于使用监督式检索模型（如DPR）获取相关上下文文档的方案。

Figure 2: Recall $\ @\mathrm{K}$ on test sets, measured as the percentage of top-K documents that contain the answer. Our proposed clustering-based prompting method can outperform DPR and Google search, also two variants of using LLMs to generate documents. Exact numbers are reported in Table 6.

图 2: 测试集上的召回率 $\ @\mathrm{K}$，衡量包含答案的前K篇文档的百分比。我们提出的基于聚类的提示方法优于DPR和Google搜索，也优于使用大语言模型生成文档的两种变体。具体数值见表6。

4.2 SUPERVISED SETTING EXPERIMENTS

4.2 监督式设置实验

We compare our proposed GENREAD with retrieve-then-read models, including DPR (Karpukhin et al., 2020), RAG (Lewis et al., 2020), and FiD (Izacard & Grave, 2021). In addition, we compared with obtaining relevant documents from the internet using the Google search engine.

我们将提出的GENREAD与检索-阅读(retrieve-then-read)模型进行比较，包括DPR (Karpukhin等人，2020)、RAG (Lewis等人，2020)和FiD (Izacard & Grave，2021)。此外，我们还与使用Google搜索引擎从互联网获取相关文档的方法进行了对比。

4.2.1 EXPERIMENTAL SETUP

4.2.1 实验设置

For our proposed method, we replace the retriever with a large language model to directly generate contextual documents. In the experiments, we use Instruct GP T (Ouyang et al., 2022). After contextual documents are retrieved or generated, we employ a FiD reader with 770M parameter models (i.e., FiD-l) and 3B parameter models (i.e., FiD-xl) that are fine-tuned on the training split of target datasets. We note that we only use 10 documents during reading for the following reasons.

在我们提出的方法中，我们用一个大型语言模型替代了检索器，直接生成上下文文档。实验中我们使用了 InstructGPT (Ouyang et al., 2022)。当上下文文档被检索或生成后，我们采用了参数规模为 7.7 亿的 FiD 阅读器模型 (即 FiD-l) 和 30 亿参数的模型 (即 FiD-xl)，这些模型在目标数据集的训练集上进行了微调。需要说明的是，在阅读阶段我们仅使用 10 篇文档，原因如下。

Why do we choose to use only 10 documents instead of 100 when reading?

为什么我们在阅读时选择只用10份文档而非100份？

As noted in Section 6.2 in DPR (Karpukhin et al., 2020) and Figure 3 in FiD (Izacard & Grave, 2021), increasing the number of documents can lead to better model performance and achieve state-of-the-art when using 100 documents. However, there are two major drawbacks to using 100 documents during the reading step. First, the operation is very expensive, leading to a significant increase in memory consumption and training time. As reported by Izacard & Grave (2021), the training process requires 64 Tesla V100 32GB running for around one day. Second, generating documents by using a large language model is slow and expensive, so only using 10 documents can be a significant cost saving in our method. Therefore, in our experiments, we choose to use 10 documents during the reading process. When using FiD-770M (i.e., FiD-large), the training process can be easily performed even on a single Tesla V100 32GB GPU. Meanwhile, when only using 10 documents, we can also increase the size of FiD model from 770M to 3B, which takes about the same amount of GPU memory as using 100 documents on a 770M model, but at the same time significantly shortens the training time. We note that training T5-3B model needs a bigger cluster such as 8 Tesla V100 or A100 GPUs.

如DPR (Karpukhin等，2020) 第6.2节和FiD (Izacard & Grave，2021) 图3所示，增加文档数量能提升模型性能，使用100份文档时可达到最优水平。但阅读阶段使用100份文档存在两大缺陷：首先，该操作成本极高，会显著增加内存消耗和训练时长。据Izacard & Grave (2021) 报告，训练过程需要64块Tesla V100 32GB显卡运行约一整天；其次，通过大语言模型生成文档速度慢且成本高，因此本方法仅使用10份文档可大幅节省成本。故实验中我们选择在阅读阶段使用10份文档。使用FiD-770M (即FiD-large) 时，训练过程甚至可在单块Tesla V100 32GB显卡上轻松完成。同时，仅使用10份文档还能将FiD模型规模从770M扩大到3B，其显存占用与770M模型处理100份文档时相当，但能显著缩短训练时间。需注意的是，训练T5-3B模型需要更大规模的集群，例如8块Tesla V100或A100显卡。

4.2.2 EXPERIMENTAL RESULTS ON OPEN-DOMAIN QA

4.2.2 开放域问答实验结果

We first use Recall $\ @\mathrm{K}$ to compare the retrieval accuracy of different models. As shown in Figure 2, GENREAD can significantly outperform DPR and Google search for under 10 retrieved or generated documents. Compared to different GENREAD variants, including nucleus sampling, human written prompts, and clustering-based prompts, clustering-based prompts achieve the best performance. At the same time, we notice that the language model inevitably has the problem that the slope of the curve decreases as the number of generated documents increases. On one hand, this is due to the similarity of token distributions when large language models generate multiple documents. On the other hand, due to the shallow interaction characteristics of the dense retrieval model itself, the retrieved documents might not be completely relevant to the given question, so that the increase in recall might come from false positive documents, as also mentioned by Sachan et al. (2022).

我们首先使用召回率 $\ @\mathrm{K}$ 来比较不同模型的检索准确率。如图 2 所示，在检索或生成文档数量少于 10 篇时，GENREAD 明显优于 DPR 和谷歌搜索。与不同 GENREAD 变体（包括核采样、人工编写提示和基于聚类的提示）相比，基于聚类的提示表现最佳。同时，我们注意到语言模型不可避免地存在曲线斜率随生成文档数量增加而下降的问题。一方面，这是由于大语言模型生成多篇文档时 token 分布的相似性所致；另一方面，由于稠密检索模型本身的浅层交互特性，检索到的文档可能与给定问题不完全相关，因此召回率的提升可能来自误判文档，正如 Sachan 等人 (2022) 所提到的。

Table 2: Supervised open-domain QA performance. By only using generated documents from InstructGPT, our GENREAD with FiD reader (named GENREAD (FiD)) can achieve better performance than baseline methods on TriviaQA and WebQ. Through our detailed analysis of NQ, we found the performance gap mainly due to the temporal it y issue, which will be elaborated in §A.7.

Models	# reader parameters	# docu- ments	TriviaQA open test	WebQ open test open test	NQ	Avg.
*baselineswithretrievingfromWikipedia;allnumbersreported by existingpapers
DPR (Karpukhin et al., 2020)	110M	100	56.8	41.1	41.5	46.5
RAG (Lewis et al., 2020)	400M	10	56.1	45.2	44.5	48.6
FiD (Izacard & Grave, 2021)	770M	100	67.6	50.5	51.4	56.5
*baselines with retrieving from Wikipedia or Google; all numbers from our experiments
FiD-1 (DPR,Wikipedia)	770M	10	61.9	48.1	46.7	52.2
FiD-xl (DPR, Wikipedia)	3B	10	66.3	50.8	50.1	55.7
FiD-xl (Google search)	3B	10	70.1	53.6	45.0	56.2
*our proposed method by leveraging a large language model to generate documents
GENREAD (FiD-l) (sampling)	770M	10	67.8	51.5	40.3	53.2
GENREAD (FiD-1) (clustering)	770M	10	70.2	53.3	43.5	55.6
GENREAD (FiD-xl) (sampling)	3B	10	69.6	52.6	42.6	54.9
GENREAD (FiD-xl) (clustering)	3B	10	71.6	54.4	45.6	57.1
F merge retrieved documents with generated documents			74.3	56.2	54.0	61.5

表 2: 监督式开放域问答性能。仅使用 InstructGPT 生成的文档时，我们搭载 FiD 阅读器的 GENREAD (命名为 GENREAD (FiD)) 在 TriviaQA 和 WebQ 上的表现优于基线方法。通过对 NQ 的详细分析，我们发现性能差距主要源于时效性问题，具体将在 §A.7 中阐述。

模型	阅读器参数量	文档数	TriviaQA 开放测试	WebQ 开放测试	NQ	平均
*基线方法使用维基百科检索；所有数据来自现有论文
DPR (Karpukhin et al., 2020)	110M	100	56.8	41.1	41.5	46.5
RAG (Lewis et al., 2020)	400M	10	56.1	45.2	44.5	48.6
FiD (Izacard & Grave, 2021)	770M	100	67.6	50.5	51.4	56.5
*基线方法使用维基百科或谷歌检索；所有数据来自我们的实验
FiD-1 (DPR, Wikipedia)	770M	10	61.9	48.1	46.7	52.2
FiD-xl (DPR, Wikipedia)	3B	10	66.3	50.8	50.1	55.7
FiD-xl (Google search)	3B	10	70.1	53.6	45.0	56.2
*我们提出的方法：利用大语言模型生成文档
GENREAD (FiD-l) (采样)	770M	10	67.8	51.5	40.3	53.2
GENREAD (FiD-1) (聚类)	770M	10	70.2	53.3	43.5	55.6
GENREAD (FiD-xl) (采样)	3B	10	69.6	52.6	42.6	54.9
GENREAD (FiD-xl) (聚类)	3B	10	71.6	54.4	45.6	57.1
融合检索文档与生成文档	-	-	74.3	56.2	54.0	61.5

As shown in Table 2, we can first observe the FiD model performs the best among all baseline models. Using FiD-xl with only 10 documents achieves comparable performance with using FiD-l with 100 documents. The average gap is less than $1%$ on three benchmarks. Compared with both close-book models and Wikipedia-based retrieve-then-read pipelines, our proposed GENREAD can achieve state-of-the-art performance. Furthermore, compared with using sampling methods to generate documents, the clustering-based prompt method can improve the EM score by $+2.2$ on average. This indicates that the clustering-based prompt method is effectively increasing the knowledge coverage of generated documents, and also leading to better downstream QA performance. We also show that GENREAD can outperform Google search on all benchmarks. We observe both our method and Google search perform worse than DPR, mainly due to the significant portion of time-dependent questions in the dataset, which is described in the following analysis.

如表 2 所示，我们首先可以观察到 FiD 模型在所有基线模型中表现最佳。仅使用 10 篇文档的 FiD-xl 就能达到与使用 100 篇文档的 FiD-l 相当的性能，三个基准测试上的平均差距小于 $1%$。与闭卷模型和基于维基百科的检索-阅读流程相比，我们提出的 GENREAD 能够实现最先进的性能。此外，与使用采样方法生成文档相比，基于聚类的提示方法平均可将 EM 分数提高 $+2.2$。这表明基于聚类的提示方法有效增加了生成文档的知识覆盖范围，同时也带来了更好的下游问答性能。我们还展示了 GENREAD 在所有基准测试上都能超越 Google 搜索。观察到我们的方法和 Google 搜索的表现都不如 DPR，主要是由于数据集中存在大量时间敏感性问题，具体分析如下。

4.2.3 EXPERIMENTAL RESULTS ON OTHER TASKS

4.2.3 其他任务的实验结果

We demonstrate the experimental results in Table 3. Under the supervised setting, GENREAD can achieve on par performance on the fact checking task and superior performance on the dialogue system task, indicating that large language model can be seen as a strong knowledge generator.

我们在表3中展示了实验结果。在有监督设置下，GENREAD在事实核查任务上达到相当性能，在对话系统任务上表现更优，这表明大语言模型可视为强大的知识生成器。

Models	FEVER FM2 Acc. Acc.	Wow F1/R-L
RAG (Lewis et al., 2020)	86.3 71.1	13.1/11.6
FiD (Izacard & Grave, 2021)	90.2 77.6	17.5/16.1
GENREAD (FiD-xl) (sampling)	89.0 76.3	18.9/16.7
GENREAD (FiD-xl) (clustering	89.6 77.8	19.1/16.8
H merge two source docs.	91.8 78.9	20.1/17.9

Table 3: Supervised performance on fact checking (FEVER and FM2) and open-domain dialogue system (WoW).

模型	FEVER FM2 准确率	WoW F1/R-L
RAG (Lewis et al., 2020)	86.3 71.1	13.1/11.6
FiD (Izacard & Grave, 2021)	90.2 77.6	17.5/16.1
GENREAD (FiD-xl) (采样)	89.0 76.3	18.9/16.7
GENREAD (FiD-xl) (聚类)	89.6 77.8	19.1/16.8
H 合并两个源文档	91.8 78.9	20.1/17.9

表 3: 事实核查 (FEVER 和 FM2) 和开放域对话系统 (WoW) 的监督性能。

The main reason that GENREAD performs worse than the dense retriever

GENREAD 表现不如密集检索器的主要原因

for fact checking is that the task provides sufficient semantic information to reach strong performance on this binary decision task. So, there is a smaller semantic gap between the given factual statement and contextual documents than that of question and document pairs in open-domain QA, which is an easier retrieval setting for modern dense retrieval methods that are mainly based on vector similarity.

事实核查任务的优势在于，它为这项二元决策任务提供了足够的语义信息以实现强劲性能。因此，给定的事实陈述与上下文文档之间的语义差距，比开放域问答中的问题-文档对更小，这对主要基于向量相似度的现代密集检索方法而言是更简单的检索场景。

4.3 OBSERVATIONS AND EXPERIMENTAL ANALYSIS

4.3 观察与实验分析

4.3.1 COMPLEMENT ARI TY OF GENERATED AND RETRIEVED DOCUMENTS

4.3.1 生成文档与检索文档的互补性

Generated documents can be combined with retrieved documents to outperform both. Even with a very large number of retrieved documents, including few samples of generated knowledge leads to large improvements. As shown in Table 2, merging retrieved documents with generated documents can

生成的文档可以与检索到的文档结合，表现优于单独使用两者。即使检索到的文档数量非常多，加入少量生成知识的样本也能带来显著提升。如表2所示，将检索文档与生成文档合并可以

Figure 3: Combining DPR retrieved documents and large language model (LLM) generated documents can achieve significantly better performance than using DPR retrieved documents only. For a fair comparison, instead of adding LLM generated documents to the model, we replace 10 documents retrieved by DPR with 10 documents generated by LLM so the total number of documents is the same. In this experiment, we use FiD-l (i.e., FiD-large) as the reader model because when the documents scale to more than 20, FiD-xl (i.e., FiD-3B) causes out-of-memory issues on A100 GPUs.

图 3: 结合 DPR 检索文档与大语言模型 (LLM) 生成文档，能获得比仅使用 DPR 检索文档显著更优的性能。为确保公平对比，我们并非向模型追加 LLM 生成文档，而是用 10 篇 LLM 生成文档替换 DPR 检索的 10 篇文档，以保持文档总数不变。本实验采用 FiD-l (即 FiD-large) 作为阅读器模型，因为当文档量超过 20 篇时，FiD-xl (即 FiD-3B) 会在 A100 GPU 上引发内存溢出问题。

achieve state-of-the-art performance compared to all baseline methods listed in the table. Specifically, it can improve $+5.7$ averagely on three open-domain QA benchmarks compared to DPR alone, and improve $+4.4$ averagely compared to the large language model alone.

与表中列出的所有基线方法相比，该方案实现了最先进的性能。具体而言，在三个开放域问答基准测试中，相比单独使用DPR平均提升$+5.7$，相比单独使用大语言模型平均提升$+4.4$。

4.3.2 COVERAGE ANALYSIS OVER ALL POSSIBLE ANSWERS

4.3.2 全可能答案覆盖分析

The improvement in open-domain QA performance is due to the fact that correct answers are included more frequently in the generated text Recall $\ @\mathrm{K}$ is the most commonly used metric in existing works to measure the retrieval performance, which computes the percentage of top-K retrieved or generated documents that contain any possible answer at least once. than in the retrieved documents. However, as many questions contain multiple correct answers, recall $\ @\mathrm{K}$ cannot fully reflect the diversity of generated or retrieved documents. Each question in the WebQ has 2.39 correct answers, 1.79 correct answers in NQ and 14.02 (including all entity alias) in the TriviaQA. NQ and WebQ do not include alias names in the labels.

开放域问答性能的提升源于正确答案更频繁地出现在生成文本中。召回率 $\ @\mathrm{K}$ 是现有工作中最常用的检索性能衡量指标，它计算前K个检索或生成文档中至少包含一个可能答案的百分比。然而，由于许多问题存在多个正确答案，召回率 $\ @\mathrm{K}$ 无法完全反映生成或检索文档的多样性。WebQ数据集中每个问题平均包含2.39个正确答案，NQ数据集为1.79个，TriviaQA数据集则达到14.02个（含所有实体别名）。NQ和WebQ的标注中未包含别名信息。

In this section, we also demonstrate the answer coverage performance of different models in Table 6. Answer coverage measures the percentage of the number of answers that are contained in the documents over all possible answers. Coverage analysis showed that generated text tends to have lower coverage than retrieved documents

在本节中，我们还在表6中展示了不同模型的答案覆盖率性能。答案覆盖率衡量文档中包含的答案数量占所有可能答案的百分比。覆盖率分析表明，生成文本的覆盖率往往低于检索文档

Documents obtained by↓	NQ	TriviaQA		WebQ
		w.alias	w/oalias
BM25(Robertson et al.,2009)	48.4	17.1	63.8	41.2
Google search engine?	57.9	18.9	72.0	54.2
DPR (Karpukhin et al., 2020)	67.9	17.9	67.3	58.8
GENREAD (nucleus sampling)	56.6	19.6	74.5	59.8
GENREAD (10 human prompts)	57.4	20.1	74.8	61.1
GENREAD (clustering prompts)	61.7	20.4	76.5	62.1

Table 4: Answer coverage $%$ ) over 10 retrieved or generated documents. Case studies are provided in Tables 16-19 in Appendix.

获取文档方式↓	NQ	TriviaQA	WebQ
		w.alias	w/oalias
BM25 (Robertson et al., 2009)	48.4	17.1	63.8
Google 搜索引擎?	57.9	18.9	72.0
DPR (Karpukhin et al., 2020)	67.9	17.9	67.3
GENREAD (nucleus sampling)	56.6	19.6	74.5
GENREAD (10 human prompts)	57.4	20.1	74.8
GENREAD (clustering prompts)	61.7	20.4	76.5

表 4: 答案覆盖率 ($%$) 基于10篇检索或生成的文档。案例研究见附录中的表 16-19。

because generated documents tends to have little diversity compared to retrieved documents. To improve coverage, we propose GENREAD with clustering, where we include examples in the prompt from different clusters of the training data to elicit more diverse generations.

因为生成文档往往比检索文档的多样性更低。为了提高覆盖率，我们提出了带聚类功能的GENREAD方法，即在提示中包含来自训练数据不同聚类的示例，以激发更多样化的生成内容。

After we manually compare some retrieved documents from DPR and generated documents from Instruct GP T, we observe that the readability of different documents, when they contain the correct answer string, is different. In other words, documents containing answers might also contain noisy in

在我们手动对比了DPR检索的部分文档和Instruct GPT生成的文档后，发现当这些文档包含正确答案字符串时，它们的可读性存在差异。也就是说，包含答案的文档可能同时存在噪声。

4.4 READABILITY ANALYSIS OF RETRIEVED AND GENERATED DOCUMENTS

Documents obtained by ↓	NQ	TriviaQA	WebQ
DPR (Karpukhin et al., 2020)	63.1	80.2	63.3
GENREAD (nucleus sampling)	58.7	83.7	63.8
GENREAD (clustering prompts)	64.0	86.8	66.7

4.4 检索与生成文档的可读性分析

获取方式 ↓	NQ	TriviaQA	WebQ
DPR (Karpukhin et al., 2020)	63.1	80.2	63.3
GENREAD (nucleus sampling)	58.7	83.7	63.8
GENREAD (clustering prompts)	64.0	86.8	66.7

Table 5: Readability study on retrieved documents and generated documents. See detailed analysis in $\S4.4$ .

表 5: 检索文档与生成文档的可读性研究。详细分析见 $\S4.4$。

formation that is irrelevant to the question, which could affect both the model and human reading.

与问题无关的信息可能会影响模型和人类阅读。

In order to further validate the readability of retrieved documents and generated documents, we extracted a subset of data examples from NQ, TriviaQA and WebQ datasets, in which both retrieved and generated documents contain the correct answer. As shown in Table 5, when both retrieved and generated documents contain the correct answer, the FiD reader can produce more correct answers when reading the generated documents from large language models (e.g., Instruct GP T).

为了进一步验证检索文档和生成文档的可读性，我们从NQ、TriviaQA和WebQ数据集中提取了部分数据样本，这些样本中的检索文档和生成文档都包含正确答案。如表5所示，当检索文档和生成文档都包含正确答案时，FiD阅读器在读取大语言模型（如InstructGPT）生成的文档时能产生更多正确答案。

We also provide some case studies in Tables 16-19. For example, in Table 18, the question is “What city was Zeus the patron god of?”. The first document retrieved by DPR is “Like the other Pan hellenic Games, the ancient Olympic Games were a religious festival, held at the sanctuary of Zeus at Olympia.”. Although it contains the correct answer, it is hard to infer the answer “Olympia” from it. On the contrary, Instruct GP T generates the document “Zeus was the patron god of the city of Olympia, which was located in the northwestern Peloponnese region of Greece. Olympia was the site of the Olympic Games, held every four years in honor of Zeus.”, which is much easier to read.

我们还在表16-19中提供了一些案例研究。例如在表18中，问题为"What city was Zeus the patron god of?"。DPR检索到的首份文档是"Like the other Panhellenic Games, the ancient Olympic Games were a religious festival, held at the sanctuary of Zeus at Olympia."，虽然包含正确答案，但很难从中推断出"Olympia"这个答案。相反，InstructGPT生成的文档"Zeus was the patron god of the city of Olympia, which was located in the northwestern Peloponnese region of Greece. Olympia was the site of the Olympic Games, held every four years in honor of Zeus."更易于理解。

5 EPILOGUE

5 尾声

CONCLUSION. In this paper, we present a novel perspective for solving knowledge-intensive tasks by replacing the dense retrieval models with large language model generators. We call it generate-thenread, which first prompts a large language model to generate contextual documents, then read the generated document to infer the final answer. Notably, without retrieving any documents, it reaches 71.6 and 54.4 exact match scores on TriviaQA and WebQ, significantly outperforming the current retrieval-reader model DPR-FiD, as well as on other two knowledge-intensive tasks.

结论。本文提出了一种解决知识密集型任务的新视角，即用大语言模型生成器替代密集检索模型。我们称之为"生成后阅读"(generate-then-read)方法，该方法首先提示大语言模型生成上下文文档，然后阅读生成的文档来推断最终答案。值得注意的是，在未检索任何文档的情况下，该方法在TriviaQA和WebQ上分别达到71.6和54.4的精确匹配分数，显著优于当前检索-阅读模型DPR-FiD，在其他两项知识密集型任务中同样表现优异。

LIMITATION AND FUTURE WORK. Despite the strong performance on the presented datasets, our approach is limited in its ability to update knowledge state and adapt to new domains. A major feature of retrieve-then-read is the ability to swap in new documents when new information is learned, such as temporally more recent documents, or adding in documents from a new domain to quickly adapt to a new downstream task. Our approach relies on a large language model to contain all this knowledge and adding new knowledge would likely require some retraining. Future work will explore how to efficiently incorporate new knowledge into our generate-then-read method. Besides, generated documents might suffer from hallucination error, resulting in incorrect predictions. We demonstrated case study in Table 15. Consideration in combination with recent approaches (Creswell & Shanahan, 2022) to boost generative faithfulness is a also direction worthy of future research.

局限性与未来工作。尽管在现有数据集上表现优异，但我们的方法在更新知识状态和适应新领域方面存在局限。检索-阅读(retrieve-then-read)方法的核心优势在于能够动态替换新文档（例如时效性更强的文献或跨领域文档）以快速适应下游任务，而我们的生成-阅读(generate-then-read)方案依赖大语言模型内化所有知识，新增知识可能需要重新训练。未来工作将探索如何高效整合新知识。此外，生成文档可能存在幻觉错误（如表15案例所示），结合近期提升生成可信度的研究(Creswell & Shanahan, 2022)也是值得探索的方向。

ETHICS STATEMENT

伦理声明

Large language models have a wide range of beneficial applications for society, but they also have potentially harmful applications. Previous work has shown various forms of bias, such as racial and gender bias, in large language models like GPT-3, even after explicit efforts to reduce toxic language (Chan, 2022). The importance of addressing these societal harms is acknowledged by OpenAI themselves in their 2020 paper introducing GPT-3 (Brown et al., 2020), which stated “we focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 ... and issues of bias, fairness, and representation within models like GPT-3.” on page 34.

大语言模型对社会具有广泛的有益应用，但也存在潜在的有害应用。先前研究表明，即便经过明确减少有害语言的努力，像GPT-3这样的大语言模型仍存在多种形式的偏见，例如种族和性别偏见 (Chan, 2022)。OpenAI在2020年介绍GPT-3的论文中承认了解决这些社会危害的重要性 (Brown et al., 2020)，该论文在第34页指出："我们重点关注两个主要问题：像GPT-3这样的语言模型可能被蓄意滥用...以及像GPT-3这样的模型内部的偏见、公平性和代表性等问题。"

The goal of this paper is to utilize knowledge stored in the parameters of large language models to answer open-domain questions and solve knowledge-intensive tasks. Unlike retrieve-then-read where an external corpus can be curated to be trustworthy, the use of a model to generate contextual documents may further permeate existing biases in common models. First, our work shows that generated documents suffer from challenges of stale information from outdated documents used for training. Second, we show that generated documents tend to be less diverse, potentially biasing answers towards more common entities and terms from the training data. Finally, we conducted experiments on only three large language models. It is possible that some of our conclusions or observations may not necessarily hold for other models trained with different data or objectives.

本文旨在利用大语言模型参数中存储的知识来回答开放域问题并解决知识密集型任务。与基于检索-阅读模式 (retrieve-then-read) 的方法不同（该方法可通过筛选外部语料库确保可信度），使用模型生成上下文文档可能会放大现有常见模型中的偏见。首先，我们的研究表明生成文档存在训练数据过时导致信息陈旧的问题。其次，生成文档往往多样性不足，可能使答案偏向训练数据中更常见的实体和术语。最后，我们仅针对三种大语言模型进行了实验，部分结论或观察可能不适用于采用不同训练数据或目标的其他模型。

Regarding ethical solutions, future work includes (i) further exploring potential bias and intentional or unintentional harm that may result from using generated contextual documents; (ii) better aligning language models with user intent to generate less biased contents and fewer fabricated facts.

关于伦理解决方案，未来工作包括：(i) 进一步探究使用生成式上下文文档可能导致的潜在偏见及有意/无意的伤害；(ii) 更好地使语言模型与用户意图对齐，以生成偏见更少的内容和更少的虚构事实。

ACKNOWLEDGEMENTS

致谢

This work was supported in part by NSF IIS-2119531, IIS-2137396, IIS-2142827, CCF-1901059, and ONR N00014-22-1-2507. Wenhao is supported in part by Bloomberg Data Science Ph.D Fellowship.

本研究部分受到美国国家科学基金会(NSF)项目IIS-2119531、IIS-2137396、IIS-2142827、CCF-1901059以及海军研究办公室(ONR)项目N00014-22-1-2507的资助。Wenhao还获得了彭博数据科学博士奖学金的部分支持。

A APPENDIX

A 附录

Datasets	Splits	Train	Valid	Test	Testlabels
TriviaQA (Joshi et al.,2017)	open domain	78,785	8,837	11,313	public
WebQ (Berant et al., 2013)	wikipedia split open domain	3,478	300	7,993 2,032	public public
NQ (Kwiatkowski et al., 2019)	open domain	79,168	8,757	3,610	public
FEVER (Thorne et al., 2018)	kilt challenge	104,966		10,100	hidden
FM2 (Eisenschlos et al.,2021)	official split	10,149	10,444 1169	1380
WoW (Dinan et al.,2019)	kiltchallenge	63,734	3,054	2,944	public hidden

Table 6: Datasets splits and statistics. For FEVER and WoW, labels in the test are hidden, so the model performance should be evaluated at https://ai.facebook.com/tools/kilt/.

数据集	划分方式	训练集	验证集	测试集	测试标签
TriviaQA (Joshi et al., 2017)	开放域	78,785	8,837	11,313	公开
WebQ (Berant et al., 2013)	维基百科划分/开放域	3,478	300	7,993/2,032	公开/公开
NQ (Kwiatkowski et al., 2019)	开放域	79,168	8,757	3,610	公开
FEVER (Thorne et al., 2018)	KILT挑战赛	104,966	-	10,100	隐藏
FM2 (Eisenschlos et al., 2021)	官方划分	10,149	10,444/1,169	1,380	-
WoW (Dinan et al., 2019)	KILT挑战赛	63,734	3,054	2,944	公开/隐藏

表 6: 数据集划分与统计信息。对于FEVER和WoW数据集，测试集标签为隐藏状态，模型性能评估请访问 https://ai.facebook.com/tools/kilt/。

A.1 DATASETS AND SPLITS

A.1 数据集与划分

– TRIVIAQA (TQA) (Joshi et al., 2017) contains a set of trivia questions with answers that were originally scraped from trivia and quiz-league websites.

– TRIVIAQA (TQA) (Joshi et al., 2017) 包含一组从问答网站和智力竞赛联盟网站抓取的 trivia 问题及其答案。

– WEB QUESTIONS (WebQ) (Berant et al., 2013) consists of questions selected using Google Suggest API, where the answers are entities in Freebase.

WEB QUESTIONS (WebQ) (Berant et al., 2013) 包含通过Google Suggest API筛选的问题，其答案为Freebase中的实体。

– NATURAL QUESTIONS (NQ) (Kwiatkowski et al., 2019) were mined from real Google search queries and the answers are spans in Wikipedia articles identified by human annotators.

NATURAL QUESTIONS (NQ) (Kwiatkowski等人，2019) 是从真实的Google搜索查询中挖掘的，答案是由人类标注者在维基百科文章中标注的文本片段。

We explore the same train / dev / test splits for the open-domain QA setting as used by Izacard & Grave (2021); Karpukhin et al. (2020). For TriviaQA, GPT-3 / GLaM / PaLM (Brown et al., 2020; Du et al., 2022; Chowdhery et al., 2022) evaluate on the Wikipedia dev set of 7,993 examples, so we ran an additional evaluation on that dev set in order to compare with their performance.

我们采用了与Izacard & Grave (2021)和Karpukhin等人(2020)相同的开放域问答任务训练/开发/测试集划分方案。对于TriviaQA数据集，GPT-3/GLaM/PaLM(Brown等人，2020；Du等人，2022；Chowdhery等人，2022)在包含7,993个样本的维基百科开发集上进行评估，因此我们额外在该开发集上进行了评测以对比模型性能。

– FEVER (Thorne et al., 2018) is one of the largest datasets for fact checking that requires retrieving evidence from external corpus to support if a statement is supported or refuted.

FEVER (Thorne et al., 2018) 是事实核查领域最大的数据集之一，需要通过从外部语料库检索证据来判断某个陈述是被支持还是被反驳。

– FOOL ME TWICE (FM2) (Eisen schlo s et al., 2021) is a challenging fact checking dataset collected by gam if i cation. Players write challenging claims either entailed or refuted by evidence from Wikipedia. They are then tasked to spot the refuted claim among a group.

– FOOL ME TWICE (FM2) (Eisen schlo s et al., 2021) 是一个通过游戏化方式收集的具有挑战性的事实核查数据集。玩家根据维基百科的证据撰写可被证实或反驳的复杂声明，随后需在一组声明中识别出被反驳的声明。

– WIZARD OF WIKIPEDIA (WoW) (Dinan et al., 2019) is an open-domain dialogue task for training agents that can converse knowledgeably about open-domain topics. One speaker in the conversation must ground their utterances in a specific knowledge sentence from a Wikipedia page.

– WIZARD OF WIKIPEDIA (WoW) (Dinan et al., 2019) 是一个开放领域对话任务，旨在训练能够就开放领域话题进行知识性对话的智能体。对话中的一方必须将其发言基于维基百科页面中的特定知识句子。

We use the same train / dev / test splits in KILT challenge (Petroni et al., 2021) for the FEVER and WoW datasets. Their test labels are hidden, so the performance can only be evaluated through https://ai.facebook.com/tools/kilt. For FM2, we use its official dataset splits.

我们在FEVER和WoW数据集上采用了与KILT挑战赛 (Petroni et al., 2021) 相同的训练/开发/测试集划分。这些数据集的测试标签未公开，性能评估需通过https://ai.facebook.com/tools/kilt完成。对于FM2数据集，我们使用其官方划分方案。

A.2 IMPLEMENTATION DETAILS

A.2 实现细节

We use T5-770M (Raffel et al., 2020) and T5-3B as our backbone models to implement FiD (Izacard & Grave, 2021). We use AdamW as the optimizer, with 2,000 warm-up steps. We set the dropout probability to 0.1 and weight decay to 0.01. We use one A100 for running T5-770M and set the batch size of 16. We use 8 A100 for running T5-3B and set the per GPU batch as 2, leading to the total batch size as 16. We searched different learning rates, ranging from 5e-6 to $4e{-}5$ , and we found 3e-5 to 6e-5 performed the best under the T5-3B setting and 5e-5 to 1e-4 performed the best under the T5-770M setting. We refer to more individual implementation details in Table 7.

我们采用T5-770M (Raffel等人，2020) 和T5-3B作为骨干模型来实现FiD (Izacard & Grave，2021)。优化器选用AdamW，预热步数为2000。dropout概率设为0.1，权重衰减设为0.01。运行T5-770M时使用单张A100显卡，批次大小设为16；运行T5-3B时使用8张A100显卡，每GPU批次设为2，总批次大小保持为16。我们测试了5e-6到$4e{-}5$范围内的不同学习率，发现T5-3B环境下3e-5至6e-5效果最佳，T5-770M环境下5e-5至1e-4表现最优。更多具体实现细节参见表7。

We implement other baseline methods by using repositories:

我们通过以下代码库实现了其他基线方法：

– BM25: https://github.com/castorini/pyserini – DPR: https://github.com/facebook research/DPR – Contriever: https://github.com/facebook research/contriever

BM25: https://github.com/castorini/pyserini
DPR: https://github.com/facebookresearch/DPR
Contriever: https://github.com/facebookresearch/contriever

Settings/ Datasets	NQ	TriviaQA	WebQ	FEVER	FM2	Wow
Peak learning rate Total batch size Total training steps Bestvalidationsteps Validation performance	1e-4 64 15,000 6,000 43.27 43.50	1e-4 64 10,000 500 69.47 70.22	1e-4 64 10,000 8,500 60.33	1e-4 64 10,000 5,000 88.97	1e-4 64 10,000 6,000 73.57	5e-5 16 20,000 20,000 18.60
Best validation = test Peak learning rate Totalbatch size Total training steps Best validation steps	5e-5 16 20,000 14,000	6e-5 16 15,000 8,500	53.33 3e-5 16 15,000 11,500	87.25 5e-5 16 15,000 10,000	74.21 5e-5 16 15,000 6,000	18.49 3e-5 8 20,000 16,500

Table 7: Hyper para ter s settings and validation performance for open-domain QA (numbers reported in Table 2), fact checking and dialogue system (numbers reported in Table 3). The upper part numbers are from GENREAD (FiD-l) and the lower part numbers are from GENREAD (FiD-xl).

设置/数据集	NQ	TriviaQA	WebQ	FEVER	FM2	Wow
峰值学习率总批次大小总训练步数最佳验证步数验证性能	1e-4 64 15,000 6,000 43.27 43.50	1e-4 64 10,000 500 69.47 70.22	1e-4 64 10,000 8,500 60.33	1e-4 64 10,000 5,000 88.97	1e-4 64 10,000 6,000 73.57	5e-5 16 20,000 20,000 18.60
最佳验证=测试峰值学习率总批次大小总训练步数最佳验证步数	5e-5 16 20,000 14,000	6e-5 16 15,000 8,500	53.33 3e-5 16 15,000 11,500	87.25 5e-5 16 15,000 10,000	74.21 5e-5 16 15,000 6,000	18.49 3e-5 8 20,000 16,500

表 7: 开放域问答 (表 2 中报告的数字)、事实核查和对话系统 (表 3 中报告的数字) 的超参数设置和验证性能。上半部分数字来自 GENREAD (FiD-l)，下半部分数字来自 GENREAD (FiD-xl)。

We note that reproducing experiments on the OpenAI API, though publicly available, costs money. For this reason, we further add an evaluation on two open-source large language models OPT (Zhang et al., 2022) and Codex (OpenAI, 2022). As shown in Table 8, OPT performed worse than Instruct GP T, but still achieved comparable performance with DPR; OpenAI Codex achieved the best performance on both TriviaQA and WebQ.

我们注意到，虽然在OpenAI API上复现实验是公开可用的，但会产生费用。为此，我们进一步评估了两个开源大语言模型OPT (Zhang et al., 2022) 和Codex (OpenAI, 2022)。如表8所示，OPT的表现逊于Instruct GPT，但仍与DPR相当；OpenAI Codex在TriviaQA和WebQ上均取得了最佳性能。

A.3 REPRODUCIBILITY VIA OPEN SOURCE LARGE LANGUAGE MODELS

Documents obtained by ←	TriviaQA WebQ
DPR (Karpukhin et al., 2020)	66.3 50.8
OPT (Zhang et al., 2022)	62.1 51.8
InstructGPT (Ouyang et al., 2022)	71.3 54.5
Codex x (OpenAI,2022)	72.6 55.4

Table 8: Exact match (EM) score with using DPR and different open-source large language models such as OPT and Codex to generate contextual documents.

A.3 通过开源大语言模型实现可复现性

由以下方法获取的文档	TriviaQA WebQ
DPR (Karpukhin et al., 2020)	66.3 50.8
OPT (Zhang et al., 2022)	62.1 51.8
InstructGPT (Ouyang et al., 2022)	71.3 54.5
Codex (OpenAI, 2022)	72.6 55.4

表 8: 使用DPR和不同开源大语言模型(如OPT和Codex)生成上下文文档时的精确匹配(EM)分数。

A.4 SCALING WITH NUMBER OF LARGE LANGUAGE MODEL PARAMETERS

A.4 大语言模型参数规模的扩展

Figure 4 shows the scaling of performance with InstructGPT generator parameters, including Ada-150M, Babbage-1.3B, Curie-6.7B and Davinci-175B. We note that for both FiD and our GENREAD , we use the FiD-xl with 10 input documents either retrieved from Wikipedia or generated by Instruct GP T. The performance of both TriviaQA and WebQ continues to improve as the generator model parameters increase, as does the slope. Only with the largest size Instruct GP T, GENREAD can outperform the DPR-FiD. This indicates using large language model to generate contextual documents is an “emergent ability” of scaling, which is not present in smaller models but is only present in larger language models (Wei et al., 2022a).

图 4 展示了 InstructGPT 生成器参数对性能的影响，包括 Ada-150M、Babbage-1.3B、Curie-6.7B 和 Davinci-175B。我们注意到，对于 FiD 和我们的 GENREAD，我们使用了 FiD-xl 并输入 10 篇文档，这些文档要么是从维基百科检索的，要么是由 InstructGPT 生成的。随着生成器模型参数的增加，TriviaQA 和 WebQ 的性能持续提升，斜率也随之增大。只有在使用最大规模的 InstructGPT 时，GENREAD 才能超越 DPR-FiD。这表明利用大语言模型生成上下文文档是一种随着规模增长而"涌现的能力" (Wei et al., 2022a)，这种能力在较小模型中不存在，仅存在于较大的语言模型中。

Figure 4: Model performance with different size of Instruct GP T as context generators.

图 4: 不同规模Instruct GPT作为上下文生成器的模型性能。

A.5 ADDITIONAL NUMBERS FOR TABLES IN THE MAIN PAPER

A.5 主论文中表格的补充数据

– Table 9 contains additional evaluation results for Table 1. It demonstrates zero-shot open-domain QA performance, compared to recent large language model.

表 9 包含表 1 的额外评估结果，展示了与近期大语言模型相比的零样本开放域问答性能。

– Figure 5 contains additional retrieval performance evaluation for Figure 3 of experiments on combining DPR retrieved documents and large language model generated document.

图 5: 包含针对图 3 实验的额外检索性能评估，该实验结合了 DPR 检索文档与大语言模型生成文档。

– Table 10 contains additional retrieval performance evaluated by Recall $\ @\mathrm{K}$ of baselines and different GENREAD variants. Some numbers in the table overlaps with those in Figure 2.

表 10: 包含基线方法和不同GENREAD变体通过召回率 $\ @\mathrm{K}$ 评估的额外检索性能。表中部分数据与图 2 存在重叠。

Table 9: Additional numbers for Table 1. Zero-shot open-domain QA performance, compared to recent large language models. All models in the table do not leverage any external corpus for document retrieval. Compared to Instruct GP T, our proposed GENREAD can improve the EM score by $+6.9$ on average. GENREAD can achieve state-of-the-art performance on open test sets.

Models	#total parameters	NQ open test	TriviaQA opentest		WebQ opentest
				wiki split
GPT-3 (Brown et al.,2020)	175B	14.6 10.1	49.2	64.3	14.4
Gopher (Rae et al., 2021) FLAN (Wei et al.,2021)	280B 137B	20.7	43.5 56.7	52.8 68.1
GLaM (Du et al., 2022)	64B	21.5		68.0	19.0
Chinchilla (Hoffmann et al., 2022)	70B	16.6	55.4	67.0
PaLM (Chowdhery et al., 2022)	540B	21.2		76.9	10.9
InstructGPT (Ouyang et al., 2022)	175B	19.5	57.4	68.5	19.9
GENREAD (InstructGPT)	175B	28.2	59.3	70.3	24.8

表 9: 表 1 的补充数据。零样本开放域问答性能对比，与近期大语言模型的比较。表中所有模型均未使用外部语料库进行文档检索。相比 InstructGPT，我们提出的 GENREAD 平均可将 EM 分数提升 $+6.9$。GENREAD 在开放测试集上达到最先进性能。

Models	#total parameters	NQ open test	TriviaQA opentest	wiki split	WebQ opentest
GPT-3 (Brown et al.,2020)	175B	14.6 10.1	49.2	64.3	14.4
Gopher (Rae et al., 2021) FLAN (Wei et al.,2021)	280B 137B	20.7	43.5 56.7	52.8 68.1
GLaM (Du et al., 2022)	64B	21.5		68.0	19.0
Chinchilla (Hoffmann et al., 2022)	70B	16.6	55.4	67.0
PaLM (Chowdhery et al., 2022)	540B	21.2		76.9	10.9
InstructGPT (Ouyang et al., 2022)	175B	19.5	57.4	68.5	19.9
GENREAD (InstructGPT)	175B	28.2	59.3	70.3	24.8

Figure 5: Additional retrieval performance evaluation for Figure 3 of experiments on combining DPR retrieved documents and large language model generated documents. Merging documents from two sources achieved significantly better performance than using DPR retrieved documents only.

图 5: 对图 3 实验的补充检索性能评估，结合 DPR 检索文档与大语言模型生成文档的效果。合并两种来源的文档比仅使用 DPR 检索文档获得了显著更好的性能。

Models	R@1	TriviaQA		WebQ				NQ
Models	R@1	R@10	R@20	R@1	R@10	R@20	R@1	R@10	R@20
BM25 (Robertson et al., 2009)	46.2	71.7	76.4	19.1	51.8	62.6	22.8	55.6	63.9
Contriever (Izacard et al.,2022a)	34.0	67.9	74.3	18.2	55.7	65.7	18.8	54.8	65.1

[论文翻译]生成而非检索：大语言模型是强大的上下文生成器

原文地址：https://arxiv.org/pdf/2209.10063