[论文翻译]语言模型是无监督多任务学习者


原文地址:https://github.com/dalinvip/Awesome-ChatGPT/blob/main/PDF/GPT%E7%B3%BB%E5%88%97/gpt2-language_models_are_unsupervised_multitask_learners.pdf


Language Models are Unsupervised Multitask Learners

语言模型是无监督多任务学习者

Alec Radford * 1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever *

Alec Radford * 1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever *

Abstract

摘要

Natural language processing tasks, such as question answering, machine translation, reading com- prehension, and sum mari z ation, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the $127{,}000+$ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

自然语言处理任务,如问答、机器翻译、阅读理解和摘要生成,通常通过在特定任务的数据集上进行监督学习来解决。我们证明,当在一个名为WebText的数百万网页的新数据集上训练时,语言模型开始在没有明确监督的情况下学习这些任务。当给定一个文档和问题时,语言模型生成的答案在CoQA数据集上达到了55 F1分数——在不使用超过127,000个训练样本的情况下,匹配或超过了4个基线系统中的3个。语言模型的容量对于零样本任务迁移的成功至关重要,增加容量可以在任务中以对数线性方式提高性能。我们最大的模型GPT-2是一个拥有15亿参数的Transformer,在零样本设置下,在8个测试的语言建模数据集中的7个上达到了最先进的结果,但仍然对WebText欠拟合。模型的样本反映了这些改进,并包含连贯的文本段落。这些发现为构建从自然发生的演示中学习任务的语言处理系统提供了一条有希望的路径。

competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one.

我们希望朝着更通用的系统发展,这些系统能够执行许多任务——最终无需为每个任务手动创建和标注训练数据集。

The dominant approach to creating ML systems is to collect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017), and image class if i ers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the shortcomings of this approach.

创建机器学习系统的主流方法是收集一个训练示例的数据集,展示所需任务的正确行为,训练系统模仿这些行为,然后在独立同分布(IID)的保留示例上测试其性能。这种方法在推动窄领域专家系统方面取得了良好进展。但字幕生成模型(Lake 等,2017)、阅读理解系统(Jia & Liang,2017)和图像分类器(Alcorn 等,2018)在多样性和各种可能输入上的表现常常不稳定,突显了这种方法的某些不足。

Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al., 2018) and decaNLP (McCann et al., 2018) to begin studying this.

我们怀疑,当前系统泛化能力不足的主要原因是在单一领域数据集上进行单一任务训练的普遍性。要在现有架构上实现鲁棒系统的进展,可能需要在广泛的领域和任务上进行训练和性能评估。最近,已经提出了几个基准测试,如 GLUE (Wang et al., 2018) 和 decaNLP (McCann et al., 2018),以开始研究这一问题。

1. Introduction

1. 引言

Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning (Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei et al., 2016). Yet these systems are brittle and sensitive to slight changes in the data distribution (Recht et al., 2018) and task specification (Kirkpatrick et al., 2017). Current systems are better characterized as narrow experts rather than

机器学习系统现在通过结合使用大规模数据集、高容量模型和监督学习,在它们被训练的任务上表现出色(Krizhevsky 等人,2012)(Sutskever 等人,2014)(Amodei 等人,2016)。然而,这些系统在面对数据分布的微小变化(Recht 等人,2018)和任务规范的调整(Kirkpatrick 等人,2017)时显得脆弱且敏感。当前的系统更适合被描述为狭窄的专家,而非……

Multitask learning (Caruana, 1997) is a promising framework for improving general performance. However, multitask training in NLP is still nascent. Recent work reports modest performance improvements (Yogatama et al., 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al., 2018) (Bowman et al., 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be required to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning.

多任务学习 (Caruana, 1997) 是一个有潜力提升整体性能的框架。然而,在自然语言处理 (NLP) 领域,多任务训练仍处于起步阶段。最近的研究报告了适度的性能提升 (Yogatama et al., 2019),而迄今为止最雄心勃勃的两项工作分别训练了总共 10 对和 17 对 (数据集, 目标) 组合 (McCann et al., 2018) (Bowman et al., 2018)。从元学习的角度来看,每个 (数据集, 目标) 组合都是从数据集和目标分布中采样的单个训练样本。当前的机器学习系统需要数百到数千个样本来诱导出泛化良好的函数。这表明,多任务训练可能需要同样多的有效训练对,才能在当前方法下实现其潜力。继续扩展数据集的创建和目标的设计,以当前技术强行达到所需程度,将非常困难。这促使我们探索更多用于执行多任务学习的设置。

The current best performing systems on language tasks utilize a combination of pre-training and supervised finetuning. This approach has a long history with a trend towards more flexible forms of transfer. First, word vectors were learned and used as inputs to task-specific architectures (Mikolov et al., 2013) (Collobert et al., 2011), then the contextual representations of recurrent networks were transferred (Dai & Le, 2015) (Peters et al., 2018), and recent work suggests that task-specific architectures are no longer necessary and transferring many self-attention blocks is sufficient (Radford et al., 2018) (Devlin et al., 2018).

当前在语言任务上表现最佳的系统结合了预训练和监督微调。这种方法历史悠久,趋势是向更灵活的迁移形式发展。首先,学习词向量并将其用作任务特定架构的输入 (Mikolov et al., 2013) (Collobert et al., 2011),然后迁移循环网络的上下文表示 (Dai & Le, 2015) (Peters et al., 2018),最近的研究表明,任务特定架构不再必要,迁移多个自注意力块就足够了 (Radford et al., 2018) (Devlin et al., 2018)。


Figure 1. Zero-shot task performance of WebText LMs as a function of model size on many NLP tasks. Reading Comprehension results are on CoQA (Reddy et al., 2018), translation on WMT-14 Fr-En (Artetxe et al., 2017), sum mari z ation on CNN and Daily Mail (See et al., 2017), and Question Answering on Natural Questions (Kwiatkowski et al., 2019). Section 3 contains detailed descriptions of each result.

图 1: WebText 大语言模型在不同 NLP 任务上的零样本任务性能随模型大小的变化。阅读理解结果基于 CoQA (Reddy et al., 2018),翻译结果基于 WMT-14 Fr-En (Artetxe et al., 2017),摘要结果基于 CNN 和 Daily Mail (See et al., 2017),问答结果基于 Natural Questions (Kwiatkowski et al., 2019)。第 3 节包含每个结果的详细描述。

These methods still require supervised training in order to perform a task. When only minimal or no supervised data is available, another line of work has demonstrated the promise of language models to perform specific tasks, such as commonsense reasoning (Schwartz et al., 2017) and sentiment analysis (Radford et al., 2017).

这些方法仍然需要监督训练来执行任务。当只有少量或没有监督数据可用时,另一项工作展示了语言模型在执行特定任务(如常识推理 (Schwartz et al., 2017) 和情感分析 (Radford et al., 2017))方面的潜力。

In this paper, we connect these two lines of work and continue the trend of more general methods of transfer. We demonstrate language models can perform down-stream tasks in a zero-shot setting – without any parameter or architecture modification. We demonstrate this approach shows potential by highlighting the ability of language models to perform a wide range of tasks in a zero-shot setting. We achieve promising, competitive, and state of the art results depending on the task.

在本文中,我们结合了这两条研究路线,并延续了更通用的迁移方法趋势。我们展示了语言模型能够在零样本设置下执行下游任务——无需任何参数或架构修改。我们通过强调语言模型在零样本设置下执行广泛任务的能力,展示了这种方法的潜力。根据任务的不同,我们取得了有前景、具有竞争力以及最先进的结果。

2. Approach

2. 方法

At the core of our approach is language modeling. Language modeling is usually framed as unsupervised distribution estimation from a set of examples $(x_{1},x_{2},...,x_{n})$ each composed of variable length sequences of symbols $\left(s_{1},s_{2},...,s_{n}\right)$ . Since language has a natural sequential ordering, it is common to factorize the joint probabilities over symbols as the product of conditional probabilities (Jelinek & Mercer, 1980) (Bengio et al., 2003):

我们方法的核心是语言建模。语言建模通常被定义为从一组示例 $(x_{1},x_{2},...,x_{n})$ 中进行无监督分布估计,每个示例由可变长度的符号序列 $\left(s_{1},s_{2},...,s_{n}\right)$ 组成。由于语言具有自然的顺序性,通常将符号的联合概率分解为条件概率的乘积 (Jelinek & Mercer, 1980) (Bengio et al., 2003):

$$
p(x)=\prod_{i=1}^{n}p(s_{n}|s_{1},...,s_{n-1})
$$

$$
p(x)=\prod_{i=1}^{n}p(s_{n}|s_{1},...,s_{n-1})
$$

This approach allows for tractable sampling from and estimation of $p(x)$ as well as any conditionals of the form $p(s_{n-k},...,s_{n}|s_{1},...,s_{n-k-1})$ . In recent years, there have been significant improvements in the expressiveness of models that can compute these conditional probabilities, such as self-attention architectures like the Transformer (Vaswani et al., 2017).

这种方法允许从 $p(x)$ 以及任何形式为 $p(s_{n-k},...,s_{n}|s_{1},...,s_{n-k-1})$ 的条件概率中进行可处理的采样和估计。近年来,能够计算这些条件概率的模型的表达能力有了显著提升,例如 Transformer 等自注意力架构 (Vaswani et al., 2017) [20]。

Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution $p(o u t p u t|i n p u t)$ . Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model $p(o u t p u t|i n p u t,t a s k)$ . This has been variously formalized in multitask and meta-learning settings. Task conditioning is often implemented at an architectural level, such as the task specific encoders and decoders in (Kaiser et al., 2017) or at an algorithmic level such as the inner and outer loop optimization framework of MAML (Finn et al., 2017). But as exemplified in McCann et al. (2018), language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer). McCann et al. (2018) demonstrated it was possible to train a single model, the MQAN, to infer and perform many different tasks on examples with this type of format.

学习执行单个任务可以在概率框架中表示为估计条件分布 $p(o u t p u t|i n p u t)$。由于一个通用系统应该能够执行许多不同的任务,即使对于相同的输入,它不仅应该基于输入进行条件化,还应该基于要执行的任务进行条件化。也就是说,它应该建模 $p(o u t p u t|i n p u t,t a s k)$。这在多任务和元学习设置中已经以各种形式进行了形式化。任务条件化通常在架构级别实现,例如 (Kaiser et al., 2017) 中的任务特定编码器和解码器,或者在算法级别实现,例如 MAML (Finn et al., 2017) 的内外循环优化框架。但正如 McCann et al. (2018) 所展示的那样,语言提供了一种灵活的方式来将任务、输入和输出都指定为符号序列。例如,一个翻译训练示例可以写为序列 (translate to french, english text, french text)。同样,一个阅读理解训练示例可以写为 (answer the question, document, question, answer)。McCann et al. (2018) 证明了可以训练一个单一模型 MQAN,以推断并执行具有这种格式示例的许多不同任务。

Language modeling is also able to, in principle, learn the tasks of McCann et al. (2018) without the need for explicit supervision of which symbols are the outputs to be predicted. Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective. In this slightly toy setting, the concerns with density estimation as a principled training objective discussed in (Sutskever et al., 2015) are side stepped. The problem instead becomes whether we are able to, in practice, optimize the unsupervised objective to convergence. Preliminary experiments confirmed that sufficiently large language models are able to perform multitask learning in this toy-ish setup but learning is much slower than in explicitly supervised approaches.

语言建模原则上也能够学习 McCann 等人 (2018) 的任务,而无需明确监督哪些符号是需要预测的输出。由于监督目标与非监督目标相同,但仅在序列的子集上进行评估,因此非监督目标的全局最小值也是监督目标的全局最小值。在这种稍微简化的设置中,(Sutskever 等人, 2015) 中讨论的关于密度估计作为原则性训练目标的担忧被回避了。问题反而变成了我们是否能够在实践中优化非监督目标以达到收敛。初步实验证实,足够大的语言模型能够在这种简化的设置中进行多任务学习,但学习速度比显式监督方法慢得多。

While it is a large step from the well-posed setup described above to the messiness of “language in the wild”, Weston (2016) argues, in the context of dialog, for the need to develop systems capable of learning from natural language directly and demonstrated a proof of concept – learning a QA task without a reward signal by using forward prediction of a teacher’s outputs. While dialog is an attractive approach, we worry it is overly restrictive. The internet contains a vast amount of information that is passively available without the need for interactive communication. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.

虽然从上述定义明确的情境到“现实世界中的语言”的复杂性是一个巨大的跨越,但 Weston (2016) 在对话的背景下提出,需要开发能够直接从自然语言中学习的系统,并展示了一个概念验证——通过预测教师的输出来学习问答任务,而无需奖励信号。尽管对话是一种有吸引力的方法,但我们担心它过于局限。互联网包含了大量无需交互通信即可被动获取的信息。我们推测,具有足够能力的大语言模型将开始学习推断并执行自然语言序列中展示的任务,以便更好地预测它们,无论这些任务是如何获取的。如果大语言模型能够做到这一点,它实际上就是在进行无监督的多任务学习。我们通过分析大语言模型在零样本设置下在各种任务上的表现来测试这一点。

2.1. Training Dataset

2.1. 训练数据集

Most prior work trained language models on a single domain of text, such as news articles (Jozefowicz et al., 2016), Wikipedia (Merity et al., 2016), or fiction books (Kiros et al., 2015). Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.

大多数先前的工作在单一领域的文本上训练语言模型,例如新闻文章 (Jozefowicz et al., 2016)、维基百科 (Merity et al., 2016) 或小说书籍 (Kiros et al., 2015)。我们的方法鼓励构建尽可能大且多样化的数据集,以便在尽可能多的领域和上下文中收集任务的自然语言演示。

A promising source of diverse and nearly unlimited text is web scrapes such as Common Crawl. While these archives are many orders of magnitude larger than current language modeling datasets, they have significant data quality issues. Trinh & Le (2018) used Common Crawl in their work on commonsense reasoning but noted a large amount of documents “whose content are mostly unintelligible”. We observed similar data issues in our initial experiments with ”I’m not the cleverest man in the world, but like they say in French: Je ne suis pas un imbecile [I’m not a fool].

一个多样且几乎无限的文本来源是诸如Common Crawl的网络抓取数据。虽然这些存档比当前的语言建模数据集大许多数量级,但它们存在显著的数据质量问题。Trinh & Le (2018) 在他们的常识推理工作中使用了Common Crawl,但指出大量文档“内容大多难以理解”。我们在初步实验中也观察到了类似的数据问题:“我不是世界上最聪明的人,但就像法语里说的:Je ne suis pas un imbecile [我不是傻瓜]。”

“Brevet Sans Garantie Du Gou verne ment”, translated to English: “Patented without government warranty”.

“Brevet Sans Garantie Du Gouvernement”,翻译为英文:“Patented without government warranty”。

Table 1. Examples of naturally occurring demonstrations of English to French and French to English translation found throughout the WebText training set.

表 1. 在 WebText 训练集中发现的英语到法语和法语到英语翻译的自然示例。

Common Crawl. Trinh & Le (2018)’s best results were achieved using a small subsample of Common Crawl which included only documents most similar to their target dataset, the Winograd Schema Challenge. While this is a pragmatic approach to improve performance on a specific task, we want to avoid making assumptions about the tasks to be performed ahead of time.

Common Crawl。Trinh & Le (2018) 的最佳结果是通过使用 Common Crawl 的一个小子样本实现的,该子样本仅包含与他们的目标数据集 Winograd Schema Challenge 最相似的文档。虽然这是一种提高特定任务性能的实用方法,但我们希望避免提前对要执行的任务做出假设。

Instead, we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

相反,我们创建了一个新的网页抓取,强调文档质量。为此,我们只抓取了经过人工筛选/过滤的网页。手动过滤整个网页抓取将非常昂贵,因此作为起点,我们从社交媒体平台 Reddit 抓取了所有获得至少 3 个 karma 的外链。这可以被视为其他用户是否认为该链接有趣、有教育意义或只是有趣的一种启发式指标。

The resulting dataset, WebText, contains the text subset of these 45 million links. To extract the text from HTML responses we use a combination of the Dragnet (Peters & Lecocq, 2013) and Newspaper1 content extractors. All results presented in this paper use a preliminary version of WebText which does not include links created after Dec 2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of $40\ \mathrm{GB}$ of text. We removed all Wikipedia documents from WebText since it is a common data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks.

生成的 WebText 数据集包含了这 4500 万个链接的文本子集。为了从 HTML 响应中提取文本,我们结合使用了 Dragnet (Peters & Lecocq, 2013) 和 Newspaper1 内容提取器。本文中展示的所有结果均使用了 WebText 的初步版本,该版本不包含 2017 年 12 月之后创建的链接,并且在去重和基于启发式的清理后,包含了略多于 800 万份文档,总计 $40\ \mathrm{GB}$ 的文本。我们从 WebText 中移除了所有维基百科文档,因为它是其他数据集的常见数据源,并且由于训练数据与测试评估任务的重叠,可能会使分析复杂化。

2.2. Input Representation

2.2. 输入表示

A general language model (LM) should be able to compute the probability of (and also generate) any string. Current large scale LMs include pre-processing steps such as lowercasing, token iz ation, and out-of-vocabulary tokens which restrict the space of model-able strings. While processing Unicode strings as a sequence of UTF-8 bytes elegantly fulfills this requirement as exemplified in work such as Gillick et al. (2015), current byte-level LMs are not competitive with word-level LMs on large scale datasets such as the One Billion Word Benchmark (Al-Rfou et al., 2018). We observed a similar performance gap in our own attempts to train standard byte-level LMs on WebText.

通用语言模型 (LM) 应该能够计算(并生成)任何字符串的概率。当前的大规模语言模型包括预处理步骤,如小写化、Token化和处理词汇表外的Token,这些步骤限制了模型可处理的字符串范围。虽然将Unicode字符串作为UTF-8字节序列处理可以优雅地满足这一要求,如Gillick等人 (2015) 的工作所示,但当前的字节级语言模型在大规模数据集(如One Billion Word Benchmark (Al-Rfou等人, 2018))上无法与词级语言模型竞争。我们在尝试在WebText上训练标准字节级语言模型时也观察到了类似的性能差距。

Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences. Despite its name, reference BPE implementations often operate on Unicode code points and not byte sequences. These implementations would require including the full space of Unicode symbols in order to model all Unicode strings. This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256. However, directly applying BPE to the byte sequence results in suboptimal merges due to BPE using a greedy frequency based heuristic for building the token vocabulary. We observed BPE including many versions of common words like dog since they occur in many variations such as dog. dog! dog? . This results in a sub-optimal allocation of limited vocabulary slots and model capacity. To avoid this, we prevent BPE from merging across character categories for any byte sequence. We add an exception for spaces which significantly improves the compression efficiency while adding only minimal fragmentation of words across multiple vocab tokens.

字节对编码 (Byte Pair Encoding, BPE) (Sennrich et al., 2015) 是一种介于字符级和词级语言建模之间的实用方法,它有效地在频繁符号序列的词级输入和不频繁符号序列的字符级输入之间进行插值。尽管其名为字节对编码,但参考的 BPE 实现通常操作的是 Unicode 码点而非字节序列。这些实现需要包含完整的 Unicode 符号空间才能对所有 Unicode 字符串进行建模。这会导致在添加任何多符号 Token 之前,基础词汇表的大小超过 130,000。与通常使用 BPE 的 32,000 到 64,000 Token 词汇表相比,这显然过大。相比之下,字节级的 BPE 只需要一个大小为 256 的基础词汇表。然而,由于 BPE 使用基于贪婪频率的启发式方法来构建 Token 词汇表,直接将其应用于字节序列会导致次优的合并。我们观察到 BPE 包含了许多常见单词的多个版本,例如 dog,因为它们以多种变体出现,如 dog.dog!dog?。这导致有限词汇表槽位和模型容量的次优分配。为了避免这种情况,我们阻止 BPE 在任何字节序列中跨字符类别进行合并。我们为空格添加了一个例外,这显著提高了压缩效率,同时仅在最少的词汇 Token 中增加了单词的碎片化。

This input representation allows us to combine the empirical benefits of word-level LMs with the generality of byte-level approaches. Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, token iz ation, or vocab size.

这种输入表示方式使我们能够将词级大语言模型的经验优势与字节级方法的通用性结合起来。由于我们的方法可以为任何 Unicode 字符串分配概率,因此我们可以在任何数据集上评估大语言模型,而无需考虑预处理、Token 化或词汇表大小。

2.3. Model

2.3. 模型

We use a Transformer (Vaswani et al., 2017) based architecture for our LMs. The model largely follows the details of the OpenAI GPT model (Radford et al., 2018) with a

我们使用基于Transformer (Vaswani et al., 2017) 的架构来构建我们的语言模型。该模型主要遵循OpenAI GPT模型 (Radford et al., 2018) 的细节。

Table 2. Architecture hyper parameters for the 4 model sizes.

表 2: 4 种模型规模的架构超参数

few modifications. Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final selfattention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of $1/\bar{\sqrt{N}}$ where $N$ is the number of residual layers. The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used.

进行了少量修改。层归一化 (Ba et al., 2016) 被移动到每个子块的输入处,类似于预激活残差网络 (He et al., 2016),并且在最终的自注意力块之后添加了额外的层归一化。我们使用了一种修改后的初始化方法,该方法考虑了模型深度对残差路径的累积影响。我们在初始化时将残差层的权重缩放为 $1/\bar{\sqrt{N}}$,其中 $N$ 是残差层的数量。词汇表扩展到 50,257。我们还将上下文大小从 512 个 token 增加到 1024 个 token,并使用更大的批量大小 512。

3. Experiments

3. 实验

We trained and benchmarked four LMs with approximately log-uniformly spaced sizes. The architectures are summarized in Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al., 2018). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a $5%$ held-out sample of WebText. All models still underfit WebText and held-out perplexity has as of yet improved given more training time.

我们训练并基准测试了四个大小近似对数均匀分布的大语言模型。架构总结如表 2 所示。最小的模型相当于原始的 GPT,第二小的模型相当于 BERT (Devlin et al., 2018) 中最大的模型。我们最大的模型称为 GPT-2,其参数量比 GPT 多一个数量级。每个模型的学习率都经过手动调整,以在 WebText 的 $5%$ 保留样本上获得最佳困惑度。所有模型仍然对 WebText 欠拟合,且随着训练时间的增加,保留样本的困惑度仍在改善。

3.1. Language Modeling

3.1. 语言建模

As an initial step towards zero-shot task transfer, we are interested in understanding how WebText LM’s perform at zero-shot domain transfer on the primary task they are trained for – language modeling. Since our model operates on a byte level and does not require lossy pre-processing or token iz ation, we can evaluate it on any language model benchmark. Results on language modeling datasets are commonly reported in a quantity which is a scaled or expone nti a ted version of the average negative log probability per canonical prediction unit - usually a character, a byte, or a word. We evaluate the same quantity by computing the log-probability of a dataset according to a WebText LM and dividing by the number of canonical units. For many of these datasets, WebText LMs would be tested significantly outof-distribution, having to predict aggressively standardized text, token iz ation artifacts such as disconnected punctuation and contractions, shuffled sentences, and even the string ${<}\mathrm{UNK}{>}$ which is extremely rare in WebText - occurring only 26 times in 40 billion bytes. We report our main results in Table 3 using invertible de-tokenizers which remove as many of these token iz ation / pre-processing artifacts as possible. Since these de-tokenizers are invertible, we can still calculate the log probability of a dataset and they can be thought of as a simple form of domain adaptation. We observe gains of 2.5 to 5 perplexity for GPT-2 with these de-tokenizers.

作为零样本任务迁移的初步尝试,我们感兴趣的是了解WebText大语言模型在零样本领域迁移上的表现,尤其是在其主要训练任务——语言建模上的表现。由于我们的模型在字节级别上运行,不需要有损预处理或Token化,因此我们可以在任何语言模型基准上对其进行评估。语言建模数据集的结果通常以每个标准预测单元(通常是一个字符、一个字节或一个词)的平均负对数概率的缩放或指数版本的形式报告。我们通过计算数据集的对数概率并根据WebText大语言模型将其除以标准单元的数量来评估相同的量。对于许多这些数据集,WebText大语言模型将在显著偏离分布的情况下进行测试,必须预测高度标准化的文本、Token化伪影(如断开的标点符号和缩写)、打乱的句子,甚至是字符串${<}\mathrm{UNK}{>}$,这在WebText中极为罕见——在400亿字节中仅出现26次。我们使用可逆的去Token化器报告了主要结果,这些去Token化器尽可能多地移除了这些Token化/预处理伪影。由于这些去Token化器是可逆的,我们仍然可以计算数据集的对数概率,并且它们可以被视为一种简单的领域适应形式。我们观察到,使用这些去Token化器后,GPT-2的困惑度提高了2.5到5。结果见表3。

Language Models are Unsupervised Multitask Learners

Table 3. Zero-shot results on many datasets. No training or fine-tuning was performed for any of these results. PTB and WikiText-2 results are from (Gong et al., 2018). CBT results are from (Bajgar et al., 2016). LAMBADA accuracy result is from (Hoang et al., 2018) and LAMBADA perplexity result is from (Grave et al., 2016). Other results are from (Dai et al., 2019).

语言模型是无监督多任务学习者

LAMBADA (PPL) LAMBADA (ACC) CBT-CN (ACC) CBT-NE (ACC) WikiText2 (PPL) PTB (PPL) enwik8 (BPB) text8 (BPC) WikiText103 (PPL) 1BW (PPL)
SOTA 99.8 59.23 85.7 82.3 39.14 46.54 0.99 1.08 18.3 21.8
117M 35.13 45.99 87.65 83.4 29.41 65.85 1.16 1.17 37.50 75.20
345M 15.60 55.48 92.35 87.1 22.76 47.33 1.01 1.06 26.37 55.72
762M 10.87 60.12 93.45 88.0 19.93 40.31 0.97 1.02 22.05 44.575
1542M 8.63 63.24 93.30 89.05 18.34 35.76 0.93 0.98 17.48 42.16

表 3: 多个数据集上的零样本结果。这些结果均未进行任何训练或微调。PTB 和 WikiText-2 的结果来自 (Gong et al., 2018)。CBT 的结果来自 (Bajgar et al., 2016)。LAMBADA 准确率结果来自 (Hoang et al., 2018),LAMBADA 困惑度结果来自 (Grave et al., 2016)。其他结果来自 (Dai et al., 2019)。

WebText LMs transfer well across domains and datasets, improving the state of the art on 7 out of the 8 datasets in a zero-shot setting. Large improvements are noticed on small datasets such as Penn Treebank and WikiText-2 which have only 1 to 2 million training tokens. Large improvements are also noticed on datasets created to measure long-term dependencies like LAMBADA (Paperno et al., 2016) and the Children’s Book Test (Hill et al., 2015). Our model is still significantly worse than prior work on the One Billion Word Benchmark (Chelba et al., 2013). This is likely due to a combination of it being both the largest dataset and having some of the most destructive pre-processing - 1BW’s sentence level shuffling removes all long-range structure.

WebText 语言模型在不同领域和数据集上表现出色,在零样本设置下,8 个数据集中有 7 个达到了当前最佳水平。在小型数据集如 Penn Treebank 和 WikiText-2 上(这些数据集仅有 100 万到 200 万的训练 token),模型表现显著提升。在用于衡量长期依赖性的数据集上,如 LAMBADA (Paperno et al., 2016) 和儿童图书测试 (Hill et al., 2015),模型也有显著提升。然而,我们的模型在 One Billion Word Benchmark (Chelba et al., 2013) 上的表现仍显著落后于之前的工作。这可能是由于该数据集规模最大,并且进行了最具破坏性的预处理——1BW 的句子级别打乱破坏了所有长程结构。


3.2. Children’s Book Test Figure 2. Performance on the Children’s Book Test as a function of model capacity. Human performance are from Bajgar et al. (2016), instead of the much lower estimates from the original paper.


3.2. 儿童图书测试
图 2: 儿童图书测试性能随模型容量的变化。人类表现数据来自 Bajgar 等人 (2016),而非原论文中较低估计值。

The Children’s Book Test (CBT) (Hill et al., 2015) was created to examine the performance of LMs on different categories of words: named entities, nouns, verbs, and prepositions. Rather than reporting perplexity as an evaluation metric, CBT reports accuracy on an automatically constructed cloze test where the task is to predict which of 10 possible choices for an omitted word is correct. Following the LM approach introduced in the original paper, we compute the probability of each choice and the rest of the sentence conditioned on this choice according to the LM, and predict the one with the highest probability. As seen in Figure 2 performance steadily improves as model size is increased and closes the majority of the gap to human performance on this test. Data overlap analysis showed one of the CBT test set books, The Jungle Book by Rudyard Kipling, is in WebText, so we report results on the validation set which has no significant overlap. GPT-2 achieves new state of the art results of $93.3%$ on common nouns and $89.1%$ on named entities. A de-tokenizer was applied to remove PTB style token iz ation artifacts from CBT.

儿童图书测试 (CBT) (Hill et al., 2015) 旨在检验大语言模型在不同类别词汇上的表现:命名实体、名词、动词和介词。CBT 不报告困惑度作为评估指标,而是报告在自动构建的完形填空测试中的准确率,任务是从10个可能的选项中选择被省略的正确单词。按照原始论文中引入的大语言模型方法,我们根据模型计算每个选项及句子其余部分的条件概率,并预测概率最高的选项。如图2所示,随着模型规模的增加,性能稳步提升,并缩小了与人类在该测试中表现的大部分差距。数据重叠分析显示,CBT测试集中的一本书《丛林之书》由 Rudyard Kipling 所著,存在于 WebText 中,因此我们报告了在验证集上的结果,该验证集没有显著重叠。GPT-2 在普通名词上取得了 $93.3%$ 的最新成果,在命名实体上取得了 $89.1%$ 的成果。应用了去 Token 化器以去除 CBT 中的 PTB 风格 Token 化伪影。

3.3. LAMBADA

3.3. LAMBADA

The LAMBADA dataset (Paperno et al., 2016) tests the ability of systems to model long-range dependencies in text. The task is to predict the final word of sentences which require at least 50 tokens of context for a human to successfully predict. GPT-2 improves the state of the art from 99.8 (Grave et al., 2016) to 8.6 perplexity and increases the accuracy of LMs on this test from $19%$ (Dehghani et al., 2018) to $52.66%$ . Investigating GPT-2’s errors showed most predictions are valid continuations of the sentence, but are not valid final words. This suggests that the LM is not using the additional useful constraint that the word must be the final of the sentence. Adding a stop-word filter as an approximation to this further increases accuracy to $63.24%$ , improving the overall state of the art on this task by $4%$ . The previous state of the art (Hoang et al., 2018) used a different restricted prediction setting where the outputs of the model were constrained to only words that appeared in the context. For GPT-2, this restriction is harmful rather than helpful since $19%$ of answers are not in context. We use a version of the dataset without preprocessing.

LAMBADA 数据集 (Paperno et al., 2016) 测试系统在文本中建模长距离依赖关系的能力。任务要求预测句子的最后一个词,这些句子需要至少 50 个 Token 的上下文才能让人成功预测。GPT-2 将最先进水平从 99.8 (Grave et al., 2016) 提高到 8.6 的困惑度,并将大语言模型在此测试中的准确率从 $19%$ (Dehghani et al., 2018) 提高到 $52.66%$。对 GPT-2 的错误进行分析表明,大多数预测是句子的有效延续,但不是有效的最后一个词。这表明大语言模型没有利用到该词必须是句子最后一个词这一额外有用的约束。添加一个停用词过滤器作为近似,进一步将准确率提高到 $63.24%$,使该任务的整体最先进水平提高了 $4%$。之前的最先进水平 (Hoang et al., 2018) 使用了不同的受限预测设置,其中模型的输出被限制为仅出现在上下文中的词。对于 GPT-2 来说,这种限制是有害的,因为 $19%$ 的答案不在上下文中。我们使用未经预处理的版本的数据集。


3.4. Winograd Schema Challenge Figure 3. Performance on the Winograd Schema Challenge as a function of model capacity.

图 3: Winograd Schema Challenge 上模型性能随模型容量的变化。

The Winograd Schema challenge (Levesque et al., 2012) was constructed to measure the capability of a system to perform commonsense reasoning by measuring its ability to resolve ambiguities in text. Recently Trinh & Le (2018) demonstrated significant progress on this challenge using LMs, by predicting the resolution of the ambiguity with higher probability. We follow their problem formulation and visualize the performance of our models with both full and partial scoring techniques in Figure 3. GPT-2 improves state of the art accuracy by $7%$ , achieving $70.70%$ . The dataset is quite small with only 273 examples so we recommend reading Trichelair et al. (2018) to help contextual ize this result.

Winograd Schema挑战(Levesque等人,2012)旨在通过测量系统解决文本中歧义的能力来评估其进行常识推理的能力。最近,Trinh & Le(2018)通过使用大语言模型(LMs)以更高的概率预测歧义的解决,展示了在这一挑战上的显著进展。我们遵循他们的问题表述,并在图3中展示了我们模型在完整和部分评分技术下的性能。GPT-2将最先进的准确率提高了7%,达到了70.70%。该数据集非常小,只有273个示例,因此我们建议阅读Trichelair等人(2018)的研究,以帮助理解这一结果。

3.5. Reading Comprehension

3.5. 阅读理解

The Conversation Question Answering dataset (CoQA) Reddy et al. (2018) consists of documents from 7 different domains paired with natural language dialogues between a question asker and a question answerer about the document. CoQA tests reading comprehension capabilities and also the ability of models to answer questions that depend on conversation history (such as “Why?”).

对话问答数据集 (CoQA) Reddy 等人 (2018) 由来自 7 个不同领域的文档组成,这些文档与提问者和回答者之间关于文档的自然语言对话配对。CoQA 测试阅读理解能力以及模型回答依赖于对话历史的问题(例如“为什么?”)的能力。

Greedy decoding from GPT-2 when conditioned on a document, the history of the associated conversation, and a final token A: achieves 55 F1 on the development set. This matches or exceeds the performance of 3 out of 4 baseline systems without using the $127{,}000+$ manually collected question answer pairs those baselines were trained on. The supervised SOTA, a BERT based system (Devlin et al.,

当基于文档、相关对话历史记录和最后一个 Token A 进行 GPT-2 的贪婪解码时,在开发集上达到了 55 F1 的分数。这一表现与 4 个基线系统中的 3 个相当或更好,且未使用这些基线系统训练时所依赖的超过 127,000 条手动收集的问答对。监督学习的 SOTA(基于 BERT 的系统)由 Devlin 等人提出 [20]。

Table 4. Sum mari z ation performance as measured by ROUGE F1 metrics on the CNN and Daily Mail dataset. Bottom-Up Sum is the SOTA model from (Gehrmann et al., 2018)

R-1 R-2 R-L R-AVG
Bottom-Up Sum 41.22 18.68 38.34 32.75
Lede-3 40.38 17.66 36.62 31.55
Seq2Seq+Attn 31.33 11.81 28.83 23.99
GPT-2 TL;DR: 29.34 8.27 26.58 21.40
Random-3 28.78 8.63 25.52 20.98
GPT-2nohint 21.58 4.03 19.47 15.03

表 4. 在 CNN 和 Daily Mail 数据集上通过 ROUGE F1 指标衡量的摘要性能。Bottom-Up Sum 是来自 (Gehrmann et al., 2018) 的 SOTA 模型。

2018), is nearing the 89 F1 performance of humans. While GPT-2’s performance is exciting for a system without any supervised training, some inspection of its answers and errors suggests GPT-2 often uses simple retrieval based heuristics such as answer with a name from the document in response to a who question.

2018年),GPT-2 的性能接近人类在 89 F1 分数上的表现。尽管 GPT-2 在没有监督训练的情况下表现令人兴奋,但对其答案和错误的一些检查表明,GPT-2 经常使用基于简单检索的启发式方法,例如在回答“who”问题时使用文档中的名字作为答案。

3.6. Sum mari z ation

3.6. 总结

We test GPT-2’s ability to perform sum mari z ation on the CNN and Daily Mail dataset (Nallapati et al., 2016). To induce sum mari z ation behavior we add the text TL;DR: after the article and generate 100 tokens with Top $k$ random sampling (Fan et al., 2018) with $k=2$ which reduces repetition and encourages more abstract ive summaries than greedy decoding. We use the first 3 generated sentences in these 100 tokens as the summary. While qualitatively the generations resemble summaries, as shown in Table 14, they often focus on recent content from the article or confuse specific details such as how many cars were involved in a crash or whether a logo was on a hat or shirt. On the commonly reported ROUGE 1,2,L metrics the generated summaries only begin to approach the performance of classic neural baselines and just barely outperforms selecting 3 random sentences from the article. GPT-2’s performance drops by 6.4 points on the aggregate metric when the task hint is removed which demonstrates the ability to invoke task specific behavior in a language model with natural language.

我们测试了 GPT-2 在 CNN 和 Daily Mail 数据集 (Nallapati et al., 2016) 上进行摘要生成的能力。为了引导摘要行为,我们在文章后添加文本 TL;DR:,并使用 Top $k$ 随机采样 (Fan et al., 2018) 生成 100 个 Token,其中 $k=2$,这减少了重复并鼓励生成比贪婪解码更抽象的摘要。我们使用这 100 个 Token 中生成的前 3 个句子作为摘要。虽然生成的摘要从质量上类似于摘要,如表 14 所示,但它们通常关注文章中的近期内容,或者混淆了具体细节,例如车祸中涉及多少辆车,或者标志是在帽子还是衬衫上。在常用的 ROUGE 1,2,L 指标上,生成的摘要仅开始接近经典神经基线的性能,并且仅略微优于从文章中随机选择 3 个句子的表现。当移除任务提示时,GPT-2 在综合指标上的表现下降了 6.4 分,这表明了通过自然语言在语言模型中调用特定任务行为的能力。

3.7. Translation

3.7. 翻译

We test whether GPT-2 has begun to learn how to translate from one language to another. In order to help it infer that this is the desired task, we condition the language model on a context of example pairs of the format english sentence $=$ french sentence and then after a final prompt of english sentence $=$ we sample from the model with greedy decoding and use the first generated sentence as the translation. On the WMT-14 English-French test set, GPT-2 gets 5 BLEU, which is slightly worse than a word-by-word substitution with a bilingual lexicon inferred in previous work on unsupervised word translation (Conneau et al., 2017b). On the WMT-14 French-English test set, GPT-2 is able to leverage its very strong English language model to perform significantly better, achieving 11.5 BLEU. This outperforms several unsupervised machine translation baselines from (Artetxe et al., 2017) and (Lample et al., 2017) but is still much worse than the 33.5 BLEU of the current best unsupervised machine translation approach (Artetxe et al., 2019). Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step. In order to confirm this, we ran a byte-level language detector2 on WebText which detected only 10MB of data in the French language which is approximately $500\mathrm{x}$ smaller than the monolingual French corpus common in prior unsupervised machine translation research.

我们测试了 GPT-2 是否已经开始学习如何从一种语言翻译到另一种语言。为了帮助模型推断出这是期望的任务,我们以格式为“英语句子 $=$ 法语句子”的示例对作为上下文条件,然后在最后的提示“英语句子 $=$ ”之后,使用贪婪解码从模型中采样,并将生成的第一个句子作为翻译。在 WMT-14 英法测试集上,GPT-2 获得了 5 BLEU 分,略低于在无监督词翻译研究中推断出的双语词典逐词替换的结果 (Conneau et al., 2017b)。在 WMT-14 法英测试集上,GPT-2 能够利用其非常强大的英语语言模型表现显著更好,获得了 11.5 BLEU 分。这优于 (Artetxe et al., 2017) 和 (Lample et al., 2017) 中的几个无监督机器翻译基线,但仍远低于当前最佳无监督机器翻译方法 (Artetxe et al., 2019) 的 33.5 BLEU 分。这一任务的表现令我们感到惊讶,因为我们特意从 WebText 中移除了非英语网页作为过滤步骤。为了确认这一点,我们在 WebText 上运行了一个字节级语言检测器,检测到仅有 10MB 的法语数据,这比之前无监督机器翻译研究中常见的单语法语语料库小约 $500\mathrm{x}$。

Table 5. The 30 most confident answers generated by GPT-2 on the development set of Natural Questions sorted by their probability according to GPT-2. None of these questions appear in WebText according to the procedure described in Section 4.

表 5. GPT-2 在 Natural Questions 开发集上生成的 30 个最自信的答案,按 GPT-2 的概率排序。根据第 4 节描述的程序,这些问题均未出现在 WebText 中。

问题 生成的答案 正确性 概率
Who wrote the book the origin of species? Charles Darwin 83.4%
Who is the founder of the ubuntu project? Mark Shuttleworth 82.0%
Who is the quarterback for the green bay packers? Aaron Rodgers < 81.1%
Panda is a national animal of which country? China 76.8%
Who came up with the theory of relativity? Albert Einstein 76.4%
When was the first star wars film released? 1977 < 71.4%
What is the most common blood type in sweden? A 70.6%
Who is regarded as the founder of psychoanalysis? Sigmund Freud 69.3%
Who took the first steps on the moon in 1969? Neil Armstrong 66.8%
Who is the largest supermarket chain in the uk? Tesco 65.3%
What is the meaning of shalom in english? peace 64.0%
Who was the author of the art of war? Sun Tzu 59.6%
Largest state in the us by land mass? California 59.2%
Green algae is an example of which type of reproduction? parthenogenesis 56.5%
Vikramsamvat calender is official in which country? India 55.6%
Who is mostly responsible for writing the declaration of independence? Thomas Jefferson < 53.3%
What us state forms the western boundary of montana? Montana 52.3%
Who plays ser davos in game of thrones? Peter Dinklage 52.1%
Who appoints the chair of the federal reserve system? Janet Yellen 51.5%
State the process that divides one nucleus into two genetically identical nuclei? mitosis 50.7%
Who won the most mvp awards in the nba? Michael Jordan 50.2%
What river is associated with the city of rome? the Tiber 48.6%
Who is the first president to be impeached? Andrew Johnson 48.3%
Who is the head of the department of homeland security 2017? John Kelly 47.0%
What is the name given to the common currency to the european union? Euro 46.8%
What was the emperor name in star wars? Palpatine 46.5%
Do you have to have a gun permit to shoot at a range? No 46.4%
Who proposed evolution in 1859 as the basis of biological development? Charles Darwin 45.7%
Nuclear power plant that blew up in russia? Chernobyl 45.7%
Who played john connor in the original terminator? Arnold Schwarzenegger 45.2%

3.8. Question Answering

3.8. 问答

A potential way to test what information is contained within a language model is to evaluate how often it generates the correct answer to factoid-style questions. Previous showcasing of this behavior in neural systems where all information is stored in parameters such as A Neural Conversational Model (Vinyals & Le, 2015) reported qualitative results due to the lack of high-quality evaluation datasets. The recently introduced Natural Questions dataset (Kwiatkowski et al.,

测试语言模型中包含哪些信息的一种潜在方法是评估其对事实类问题生成正确答案的频率。之前在神经网络系统中展示这种行为时,由于缺乏高质量评估数据集,例如《A Neural Conversational Model》(Vinyals & Le, 2015) 报告了定性结果。最近引入的 Natural Questions 数据集 (Kwiatkowski et al.,

  1. is a promising resource to test this more quantitatively. Similar to translation, the context of the language model is seeded with example question answer pairs which helps the model infer the short answer style of the dataset. GPT-2 answers $4.1%$ of questions correctly when evaluated by the exact match metric commonly used on reading comprehension datasets like SQUAD.3 As a comparison point, the smallest model does not exceed the $1.0%$ accuracy of an incredibly simple baseline which returns the most common answer for each question type (who, what, where, etc...). GPT-2 answers 5.3 times more questions correctly, suggesting that model capacity has been a major factor in the poor performance of neural systems on this kind of task as of yet. The probability GPT-2 assigns to its generated answers is well calibrated and GPT-2 has an accuracy of $63.1%$ on the $1%$ of questions it is most confident in. The 30 most confident answers generated by GPT-2 on development set questions are shown in Table 5. The performance of GPT-2 is still much, much, worse than the 30 to $50%$ range of open domain question answering systems which hybridize information retrieval with extractive document question answering (Alberti et al., 2019).
  2. 是一个有前景的资源,可以更定量地测试这一点。与翻译类似,大语言模型的上下文被植入了示例问答对,这有助于模型推断数据集的简短答案风格。在阅读理解数据集(如SQUAD)常用的精确匹配指标下,GPT-2 正确回答了 $4.1%$ 的问题。作为对比,最小的模型没有超过一个极其简单的基线的 $1.0%$ 准确率,该基线返回每种问题类型(谁、什么、哪里等)的最常见答案。GPT-2 正确回答的问题数量是其 5.3 倍,这表明模型容量是迄今为止神经网络系统在此类任务上表现不佳的主要因素。GPT-2 为其生成的答案分配的概率校准良好,并且在其最有信心的 $1%$ 问题上,GPT-2 的准确率为 $63.1%$。表 5 展示了 GPT-2 在开发集问题上生成的 30 个最有信心的答案。GPT-2 的表现仍然远远低于结合信息检索与抽取式文档问答的开放域问答系统的 30 到 $50%$ 范围 (Alberti et al., 2019)。

Table 6. Percentage of test set 8 grams overlapping with training sets.

表 6. 测试集 8-gram 与训练集重叠的百分比。

PTB WikiText-2 enwik8 text8 Wikitext-103 1BW
Dataset train 2.67% 0.66% 7.50% 2.34% 9.09% 13.19%
WebText train 0.88% 1.63% 6.31% 3.94% 2.42% 3.75%

4. Generalization vs Memorization

4. 泛化与记忆

Recent work in computer vision has shown that common image datasets contain a non-trivial amount of near-duplicate images. For instance CIFAR-10 has $3.3%$ overlap between train and test images (Barz & Denzler, 2019). This results in an over-reporting of the generalization performance of machine learning systems. As the size of datasets increases this issue becomes increasingly likely which suggests a similar phenomena could be happening with WebText. Therefore it is important to analyze how much test data also shows up in the training data.

最近的计算机视觉研究表明,常见的图像数据集中包含相当数量的近似重复图像。例如,CIFAR-10 的训练集和测试集之间有 3.3% 的重叠 (Barz & Denzler, 2019)。这导致机器学习系统的泛化性能被高估。随着数据集规模的增加,这一问题变得愈发可能,这表明 WebText 可能也存在类似现象。因此,分析测试数据中有多少也出现在训练数据中是非常重要的。

To study this we created Bloom filters containing 8-grams of WebText training set tokens. To improve recall, strings were normalized to contain only lower-cased alphanumeric words with a single space as a delimiter. The Bloom filters were constructed such that the false positive rate is upper bounded by $\textstyle{\frac{1}{10^{8}}}$ . We further verified the low false positive rate by generating 1M strings, of which zero were found by the filter.

为了研究这一点,我们创建了包含 WebText 训练集 Token 的 8-gram 的布隆过滤器 (Bloom filter)。为了提高召回率,字符串被归一化为仅包含小写字母数字单词,并以单个空格作为分隔符。布隆过滤器的构建使得误报率的上限为 $\textstyle{\frac{1}{10^{8}}}$。我们通过生成 100 万个字符串进一步验证了低误报率,其中过滤器未发现任何误报。

These Bloom filters let us calculate, given a dataset, the percentage of 8-grams from that dataset that are also found in the WebText training set. Table 6 shows this overlap analysis for the test sets of common LM benchmarks. Common LM datasets’ test sets have between $1!-!6%$ overlap with WebText train, with an average of overlap of $3.2%$ . Somewhat surprisingly, many datasets have larger overlaps with their own training splits, with an average of $5.9%$ overlap.

这些布隆过滤器使我们能够计算,给定一个数据集,其中在 WebText 训练集中也出现的 8-gram 的百分比。表 6 展示了常见大语言模型基准测试集的这种重叠分析。常见大语言模型数据集的测试集与 WebText 训练集的重叠率在 $1!-!6%$ 之间,平均重叠率为 $3.2%$。有些令人惊讶的是,许多数据集与其自身的训练集有更大的重叠,平均重叠率为 $5.9%$。

Our approach optimizes for recall, and while manual inspection of the overlaps shows many common phrases, there are many longer matches that are due to duplicated data. This is not unique to WebText. For instance, we discovered that the test set of WikiText-103 has an article which is also in the training dataset. Since there are only 60 articles in the test set there is at least an overlap of $1.6%$ .4 Potentially more worryingly, 1BW has an overlap of nearly $13.2%$ with its own training set according to our procedure.

我们的方法优化了召回率,虽然手动检查重叠部分显示了许多常见短语,但也有许多较长的匹配是由于数据重复造成的。这种情况并非 WebText 独有。例如,我们发现 WikiText-103 的测试集中有一篇文章也出现在训练数据集中。由于测试集中只有 60 篇文章,因此至少有 $1.6%$ 的重叠。更令人担忧的是,根据我们的方法,1BW 与其训练集的重叠率接近 $13.2%$。

For the Winograd Schema Challenge, we found only 10 schemata which had any 8-gram overlaps with the WebText training set. Of these, 2 were spurious matches. Of the remaining 8, only 1 schema appeared in any contexts that

对于Winograd Schema Challenge,我们发现只有10个模式与WebText训练集存在8-gram重叠。其中,2个是虚假匹配。在剩下的8个中,只有1个模式出现在任何上下文中。

gave away the answer.

泄露了答案。

For CoQA, about $15%$ of documents in the news domain are already in WebText and the model performs about 3 F1 better on these. CoQA’s development set metric reports the average performance over 5 different domains and we measure a gain of about 0.5-1.0 F1 due to overlap across the various domains. However, no actual training questions or answers are in WebText since CoQA was released after the cutoff date for links in WebText.

对于 CoQA,新闻领域约 15% 的文档已经存在于 WebText 中,模型在这些文档上的表现大约提高了 3 F1。CoQA 的开发集指标报告了 5 个不同领域的平均表现,我们测量到由于各领域之间的重叠,F1 值提高了约 0.5-1.0。然而,WebText 中并未包含实际的训练问题或答案,因为 CoQA 是在 WebText 链接截止日期之后发布的。

On LAMBADA, the average overlap is $1.2%$ . GPT-2 performs about 2 perplexity better on examples with greater than $15%$ overlap. Re calculating metrics when excluding all examples with any overlap shifts results from 8.6 to 8.7 perplexity and reduces accuracy from $63.2%$ to $62.9%$ . This very small change in overall results is likely due to only 1 in 200 examples having significant overlap.

在LAMBADA数据集上,平均重叠率为$1.2%$。GPT-2在重叠率大于$15%$的样本上表现更好,困惑度降低了约2。当排除所有有重叠的样本后重新计算指标,困惑度从8.6变为8.7,准确率从$63.2%$降至$62.9%$。整体结果的微小变化可能是由于仅有1/200的样本存在显著重叠。

Overall, our analysis suggests that data overlap between WebText training data and specific evaluation datasets provides a small but consistent benefit to reported results. However, for most datasets we do not notice significantly larger overlaps than those already existing between standard training and test sets, as Table 6 highlights.

总体而言,我们的分析表明,WebText训练数据与特定评估数据集之间的数据重叠对报告结果提供了微小但一致的益处。然而,对于大多数数据集,我们并未注意到比标准训练集和测试集之间已经存在的重叠显著更大的重叠,如表6所示。

Understanding and quantifying how highly similar text impacts performance is an important research question. Better de-duplication techniques such as scalable fuzzy matching could also help better answer these questions. For now, we recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets.

理解和量化高度相似文本如何影响性能是一个重要的研究问题。更好的去重技术,如可扩展的模糊匹配,也可以帮助更好地回答这些问题。目前,我们建议在创建新 NLP 数据集的训练和测试分割时,使用基于 n-gram 重叠的去重作为重要的验证步骤和合理性检查。

Another potential way of determining whether the performance of WebText LMs is attributable to memorization is inspecting their performance on their own held-out set. As shown in Figure 4, performance on both the training and test sets of WebText are similar and improve together as model size is increased. This suggests even GPT-2 is still under fitting on WebText in many ways.

另一种确定 WebText 大语言模型性能是否归因于记忆的潜在方法是检查它们在自己保留集上的表现。如图 4 所示,WebText 的训练集和测试集上的表现相似,并且随着模型规模的增加而一起提升。这表明即使 GPT-2 在许多方面仍然对 WebText 欠拟合。

GPT-2 is also able to write news articles about the discovery of talking unicorns. An example is provided in Table 13.

GPT-2 也能够撰写关于发现会说话的独角兽的新闻文章。表 13 中提供了一个示例。

5. Related Work

5. 相关工作

A significant portion of this work measured the performance of larger language models trained on larger datasets. This is similar to the work of Jozefowicz et al. (2016) which scaled RNN based language models on the 1 Billion Word Benchmark. Bajgar et al. (2016) also previously improved results on the Children’s Book Test by creating a much larger training dataset out of Project Gutenberg to supplement the standard training dataset. Hestness et al. (2017) conducted a thorough analysis of how the performance of various deep learning models changes as a function of both model capacity and dataset size. Our experiments, while much noisier across tasks, suggest similar trends hold for sub-tasks of an objective and continue into the $^{1\mathrm{B+}}$ parameter regime.

本工作的一个重要部分是测量在更大数据集上训练的更大语言模型的性能。这与 Jozefowicz 等人 (2016) 的工作类似,他们在 10 亿词基准上扩展了基于 RNN 的语言模型。Bajgar 等人 (2016) 之前也通过从古腾堡计划中创建更大的训练数据集来补充标准训练数据集,从而改进了儿童图书测试的结果。Hestness 等人 (2017) 对各种深度学习模型的性能如何随模型容量和数据集大小的变化进行了深入分析。我们的实验虽然在任务之间噪声更大,但表明类似的趋势在目标的子任务中仍然存在,并延续到 $^{1\mathrm{B+}}$ 参数范围内。


Figure 4. The performance of LMs trained on WebText as a function of model size.

图 4: 在 WebText 上训练的语言模型性能随模型大小的变化。

Interesting learned functionality in generative models has been documented before such as the cells in an RNN language model performing line-width tracking and quote/comment detection Karpathy et al. (2015). More inspi rational to our work was the observation of Liu et al. (2018) that a model trained to generate Wikipedia articles also learned to translate names between languages.

在生成式模型中有趣的学习功能之前已被记录,例如RNN语言模型中的单元执行行宽跟踪和引号/评论检测 [Karpathy et al., 2015]。对我们的工作更具启发性的是 [Liu et al., 2018] 的观察,即一个训练用于生成维基百科文章的模型也学会了在不同语言之间翻译名称。

Previous work has explored alternative approaches to filtering and constructing a large text corpus of web pages, such as the iWeb Corpus (Davies, 2018).

先前的工作已经探索了过滤和构建网页大型文本语料库的替代方法,例如 iWeb 语料库 (Davies, 2018)。

There has been extensive work on pre-training methods for language tasks. In addition to those mentioned in the introduction, GloVe (Pennington et al., 2014) scaled word vector representation learning to all of Common Crawl. An influential early work on deep representation learning for text was Skip-thought Vectors (Kiros et al., 2015). McCann et al. (2017) explored the use of representations derived from machine translation models and Howard & Ruder (2018)

在语言任务的预训练方法方面已有大量研究。除了引言中提到的那些,GloVe (Pennington et al., 2014) 将词向量表示学习扩展到整个 Common Crawl。文本深度表示学习的一个有影响力的早期工作是 Skip-thought Vectors (Kiros et al., 2015)。McCann et al. (2017) 探索了从机器翻译模型中提取表示的使用,而 Howard & Ruder (2018)

improved the RNN based fine-tuning approaches of (Dai & Le, 2015). (Conneau et al., 2017a) studied the transfer performance of representations learned by natural language inference models and (Subramania n et al., 2018) explored large-scale multitask training.

改进了基于RNN的微调方法 (Dai & Le, 2015)。(Conneau et al., 2017a) 研究了自然语言推理模型学习到的表示的迁移性能,(Subramanian et al., 2018) 则探索了大规模多任务训练。

(Rama chandra n et al., 2016) demonstrated that seq2seq models benefit from being initialized with pre-trained language models as encoders and decoders. More recent work has shown that LM pre-training is helpful when fine-tuned for difficult generation tasks like chit-chat dialog and dialog based question answering systems as well (Wolf et al., 2019) (Dinan et al., 2018).

(Rama chandra n et al., 2016) 证明了使用预训练语言模型作为编码器和解码器初始化的序列到序列模型能够从中受益。最近的研究表明,在微调用于困难生成任务(如闲聊对话和基于对话的问答系统)时,语言模型预训练同样有帮助 (Wolf et al., 2019) (Dinan et al., 2018)。

6. Discussion

6. 讨论

Much research has been dedicated to learning (Hill et al., 2016), understanding (Levy & Goldberg, 2014), and critically evaluating (Wieting & Kiela, 2019) the representations of both supervised and unsupervised pre-training methods. Our results suggest that unsupervised task learning is an additional promising area of research to explore. These findings potentially help explain the widespread success of pre-training techniques for down-stream NLP tasks as we show that, in the limit, one of these pre-training techniques begins to learn to perform tasks directly without the need for supervised adaption or modification.

大量研究致力于学习 (Hill et al., 2016)、理解 (Levy & Goldberg, 2014) 以及批判性评估 (Wieting & Kiela, 2019) 有监督和无监督预训练方法的表示。我们的结果表明,无监督任务学习是另一个值得探索的有前景的研究领域。这些发现可能有助于解释预训练技术在下游 NLP 任务中广泛成功的原因,因为我们表明,在极限情况下,这些预训练技术之一开始直接学习执行任务,而无需有监督的适应或修改。

On reading comprehension the performance of GPT-2 is competitive with supervised baselines in a zero-shot setting. However, on other tasks such as sum mari z ation, while it is qualitatively performing the task, its performance is still only rudimentary according to quantitative metrics. While suggestive as a research result, in terms of practical applications, the zero-shot performance of GPT-2 is still far from use-able.

在阅读理解任务中,GPT-2 在零样本设置下的表现与有监督的基线模型相当。然而,在其他任务(如摘要生成)中,尽管 GPT-2 在定性上能够完成任务,但根据定量指标,其表现仍然较为基础。虽然作为研究结果具有一定的启发性,但在实际应用中,GPT-2 的零样本性能仍然远未达到可用水平。

We have studied the zero-shot performance of WebText LMs on many canonical NLP tasks, but there are many additional tasks that could be evaluated. There are undoubtedly many practical tasks where the performance of GPT-2 is still no better than random. Even on common tasks that we evaluated on, such as question answering and translation, language models only begin to outperform trivial baselines when they have sufficient capacity.

我们研究了 WebText 大语言模型在许多经典 NLP 任务上的零样本表现,但还有许多其他任务可以评估。毫无疑问,在许多实际任务中,GPT-2 的表现仍然不比随机好。即使在我们评估的常见任务上,如问答和翻译,语言模型只有在具备足够能力时才开始优于简单基线。

While zero-shot performance establishes a baseline of the potential performance of GPT-2 on many tasks, it is not clear where the ceiling is with finetuning. On some tasks, GPT-2’s fully abstract ive output is a significant departure from the extractive pointer network (Vinyals et al., 2015) based outputs which are currently state of the art on many question answering and reading comprehension datasets. Given the prior success of fine-tuning GPT, we plan to investigate fine-tuning on benchmarks such as decaNLP and GLUE, especially since it is unclear whether the additional training data and capacity of GPT-2 is sufficient to overcome the inefficiencies of uni-directional representations demonstrated by BERT (Devlin et al., 2018).

虽然零样本性能为 GPT-2 在许多任务上的潜在性能建立了基线,但尚不清楚通过微调能达到的上限在哪里。在某些任务上,GPT-2 的完全抽象输出与目前在许多问答和阅读理解数据集上最先进的基于提取指针网络 (Vinyals et al., 2015) 的输出有显著差异。鉴于之前微调 GPT 的成功,我们计划在 decaNLP 和 GLUE 等基准上进行微调研究,特别是因为尚不清楚 GPT-2 的额外训练数据和容量是否足以克服 BERT (Devlin et al., 2018) 所展示的单向表示的低效性。

7. Conclusion

7. 结论

When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets. GPT-2 zero-shots to state of the art performance on 7 out of 8 tested language modeling datasets. The diversity of tasks the model is able to perform in a zero-shot setting suggests that high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision.

当大语言模型在足够大且多样化的数据集上进行训练时,它能够在许多领域和数据集中表现出色。GPT-2 在 8 个测试的语言建模数据集中,有 7 个达到了零样本的最先进性能。该模型在零样本设置下能够执行的任务多样性表明,经过训练以最大化多样化文本语料库可能性的高容量模型,开始学习如何在没有明确监督的情况下执行大量任务。

Acknowledgements

致谢

Thanks to everyone who wrote the text, shared the links, and upvoted the content in WebText. Many millions of people were involved in creating the data that GPT-2 was trained on. Also thanks to all the Googlers who helped us with training infrastructure, including Zak Stone, JS Riehl, Jonathan Hseu, Russell Power, Youlong Cheng, Noam Shazeer, Solomon Boulos, Michael Banfield, Aman Gupta, Daniel Sohn, and many more. Finally thanks to the people who gave feedback on drafts of the paper: Jacob Steinhardt, Sam Bowman, Geoffrey Irving, and Madison May.

感谢所有为WebText撰写文本、分享链接并为内容点赞的人。数以百万计的人参与了GPT-2训练数据的创建。同时感谢所有帮助我们搭建训练基础设施的Google员工,包括Zak Stone、JS Riehl、Jonathan Hseu、Russell Power、Youlong Cheng、Noam Shazeer、Solomon Boulos、Michael Banfield、Aman Gupta、Daniel Sohn等。最后,感谢对论文草稿提供反馈的Jacob Steinhardt、Sam Bowman、Geoffrey Irving和Madison May。

References

参考文献

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

Ba, J. L., Kiros, J. R., and Hinton, G. E. 层归一化 (Layer Normalization). arXiv preprint arXiv:1607.06450, 2016.

8. Appendix A: Samples

8. 附录 A: 示例

8.1. Model capacity

8.1. 模型容量

To complement the reported perplexity gains of bigger LMs on WebText show in Figure 4, Tables 7 through 11 show side-by-side completions of the smallest WebText LM and GPT-2 on random unseen WebText test set articles.

为了补充图4中展示的更大规模大语言模型在WebText上报告的困惑度提升,表7至表11展示了最小的WebText大语言模型与GPT-2在随机未见过的WebText测试集文章上的并列补全结果。

8.2. Text Memorization

8.2. 文本记忆

We observe some memorizing behavior in GPT-2 on longer strings that are repeated many times in the dataset such as famous quotes or speeches. For example, when conditioned on the first sentence and a half of the Gettysburg Address (which occurs approximately 40 times throughout WebText), an argmax decode from GPT-2 recovers the speech. Even when sampling without truncation, we find that the model copies the speech for awhile before drifting, albeit in a similar style. It typically drifts within 100-200 tokens, and displays widening diversity once it drifts.

我们观察到 GPT-2 在数据集中重复多次的较长字符串(如名人名言或演讲)上表现出一定的记忆行为。例如,当以葛底斯堡演说(在 WebText 中大约出现 40 次)的前一句半作为条件时,GPT-2 的 argmax 解码能够恢复出该演讲。即使在不进行截断的情况下进行采样,我们发现模型在开始偏离之前会复制一段时间的演讲内容,尽管风格相似。通常,模型会在 100-200 个 Token 内开始偏离,并在偏离后显示出逐渐增加的多样性。

To quantify how often exact memorization shows up in samples, we generated samples from GPT-2 conditioned on WebText test set articles and compared the overlap rates of GPT-2’s generations to the overlap rates of the ground-truth completions. The results of this analysis are shown below and suggest that GPT-2 repeats text from the training set less often then the baseline rate of held-out articles.

为了量化精确记忆在样本中出现的频率,我们从 GPT-2 中生成了以 WebText 测试集文章为条件的样本,并将 GPT-2 生成的重叠率与真实补全的重叠率进行了比较。分析结果如下所示,表明 GPT-2 重复训练集文本的频率低于保留文章的基线频率。

Figure 5. CDF of percentage 8-gram overlap with WebText training set, for both WebText test set and samples (conditioned on WebText test set, with top $^{\textit{k}}$ truncated random sampling with $k=40$ ). Most samples have less than $1%$ overlap, including over $30%$ of samples with no overlap, whereas the median for test set is $2.6%$ overlap.

图 5: WebText 测试集和样本(以 WebText 测试集为条件,使用 top $^{\textit{k}}$ 截断随机采样,$k=40$)与 WebText 训练集的 8-gram 重叠百分比的累积分布函数 (CDF)。大多数样本的重叠率低于 $1%$,其中超过 $30%$ 的样本没有重叠,而测试集的中位数重叠率为 $2.6%$。

8.3. Diversity

8.3. 多样性

Table 12 shows multiple completions of the same random WebText test set context, showing the diversity of completions with standard sampling settings.

表 12 展示了同一随机 WebText 测试集上下文的多个补全结果,展示了标准采样设置下的补全多样性。

8.4. Robustness

8.4. 鲁棒性

Table 13 shows the previously mentioned talking unicorns news article. We find the model to be capable of handling out of distribution contexts, but the quality of these samples is generally lower.

表 13 展示了之前提到的会说话的独角兽新闻文章。我们发现模型能够处理分布外上下文,但这些样本的质量通常较低。

Context (WebText test)

上下文 (WebText 测试)

Outfit 1: Typical

装扮 1: 典型

This pairing was the first outfit I thought of when I bought the shoes. It’s like a summer version of this Jake Grantham outfit; in fact, my shoes are close to the colors of his Nike Racers! Instead of a heavy Harris Tweed jacket and denim shirt, I’m wearing a cotton DB jacket and and a linen shirt. Both fabrics (in these colors) are an absolute must for summer, as they go with both dark and and light pants! As you can see, they pair wonderfully with the dark jeans and shoes. It’s a pseudo menswear/prep outfit.

这双鞋是我买鞋时想到的第一套搭配。它就像是Jake Grantham这套服装的夏季版本;实际上,我的鞋子颜色接近他的Nike Racers!我没有选择厚重的Harris Tweed夹克和牛仔衬衫,而是穿了一件棉质双排扣夹克和一件亚麻衬衫。这两种面料(在这些颜色中)绝对是夏季必备,因为它们既适合深色裤子也适合浅色裤子!正如你所见,它们与深色牛仔裤和鞋子搭配得非常出色。这是一套伪男装/预科风格的服装。

Overall, this is a very casual outfit which is why I paired my sneakers with it. I’m not about wearing a full wool suit with sneakers (as GQ shows a lot) but I’m definitely open to keeping things casual, like this cotton DB. Casual fabrics are key to pulling off your sneakers in a dressed down menswear outfit. I’d even suggest to wear these sneakers with a khaki chino suit or a white linen suit. Just be sure to ditch the tie or wear a tee or polo; wearing a tie with sneakers is a bit too much

总的来说,这是一套非常休闲的装扮,所以我搭配了运动鞋。我不太喜欢穿全套羊毛西装配运动鞋(就像GQ经常展示的那样),但我绝对愿意保持休闲风格,比如这件棉质的双排扣西装。休闲面料是让运动鞋在休闲男装中脱颖而出的关键。我甚至建议可以搭配卡其色斜纹棉布西装或白色亚麻西装。只要记得不要打领带,或者穿T恤或Polo衫;穿运动鞋打领带就有点过头了。

Small model completion

小模型补全

GPT-2 completion

GPT-2 补全

for me!

为我!

Outfit 2: High End

装扮 2: 高端

This outfit was just in the eye of the storm. It’s a hybrid of what ASICS does with streetwear (like J. Crew) and what J. Crew does with casual pieces that are a bit more traditional. Again, as I mentioned before, it’s a low budget item and they can be very expensive. These will definitely be an issue if you’re trying to work as a casual sneaker maker, and that’s OK but don’t hesitate if you want to try the lower price versions of their outfits!

这套服装正处于风暴的中心。它是ASICS与街头服饰(如J. Crew)的混合体,也是J. Crew与更为传统的休闲单品的结合。正如我之前提到的,这是一款低预算商品,但它们可能非常昂贵。如果你试图成为一名休闲运动鞋制造商,这绝对会成为一个问题,但如果你想尝试他们服装的低价版本,请不要犹豫!

This version features a very low cut-cut neckline (which is not a bad thing), slightly longer legs, and a pair of long legs to keep things nice. My preferred color of color these items are blue-gray (which is my first colors, as they are a light white), or black (which is my favorite of my designs). As always, my pick of colors from this store have nothing to do with that other than the design. For a high end, I would recommend this to everyone!

此版本采用极低的V领设计(这并非坏事),裤腿略长,搭配一双长腿,整体效果非常出色。我偏爱的颜色是蓝灰色(这是我首选的颜色,因为它们接近浅白色)或黑色(这是我最喜欢的设计颜色)。一如既往,我在这家店选择的颜色仅与设计有关,与其他无关。对于高端产品,我会向所有人推荐这款!

Outfit 3: High Light

Outfit 3: 高光

This was a ”high light” look while I had a chance to get some color. I think the color is based on

这是一个“高光”造型,趁我有机会染上一些颜色。我认为颜色是基于

Outfit 2: Casual Jacket

Outfit 2: 休闲夹克

I know that some people might be opposed to wearing sneakers in a turtleneck, but I wanted to be true to myself, so I went with a slim fitting turtleneck in a color more similar to my favorite color of the day. Overall, it’s a classic turtleneck, with nothing too flashy. I did swap my shoes for these ASICS Tiger Killshots and I wish I hadn’t did that. The shoe itself is very comfortable, however, I found that wearing them in these sneakers made my feet look a bit larger. I’d wear them with denim, chinos, whatever, but would never recommend wearing them alone. There are just too many variables involved.

我知道有些人可能反对在高领毛衣下穿运动鞋,但我希望忠于自我,因此选择了一件修身的高领毛衣,颜色更接近我当天最喜欢的颜色。整体来说,这是一件经典的高领毛衣,没有过于花哨的元素。我确实换上了这双ASICS Tiger Killshots运动鞋,但事后我希望自己没有这么做。鞋子本身非常舒适,但我发现穿上这双运动鞋后,我的脚看起来有点大。我会搭配牛仔裤、卡其裤等穿着,但绝不建议单独穿这双鞋。涉及的因素实在太多了。

Outfit 3: Dress

服装 3: 连衣裙

This is the final outfit I wore the ASICS Tiger Killshots. This version of the ASICS is a much more modern and streamlined version. The shoe has the most modern and minimal design that I can think of! With a lot of black, these sneakers look very cool. They’ll definitely be very comfortable for summer.

这是我最终穿上的 ASICS Tiger Killshots。这个版本的 ASICS 更加现代化和流线型。这双鞋的设计是我能想到的最现代和极简的!搭配大量黑色,这些运动鞋看起来非常酷。它们在夏天穿起来一定会非常舒适。

To me, wearing them with a suit is the easiest way to have a good pair of sneaker shoes. That’s pretty much it! If

对我来说,搭配西装穿一双好运动鞋是最简单的方式。就是这样!

Context (WebText test)

上下文 (WebText 测试)

Some of the most glorious historical attractions in Spain date from the period of Muslim rule, including The Mezquita, built as the Great Mosque of C´ordoba and the Medina Azahara, also in C´ordoba and now in ruins but still visitable as such and built as the Madinat al-Zahra, the Palace of al-Andalus; and the Alhambra in Granada, a splendid, intact palace. There are also two synagogues still standing that were built during the era of Muslim Spain: Santa Mar´ıa la Blanca in Toledo and the Synagogue of Co´rdoba, in the Old City.

西班牙一些最辉煌的历史景点可以追溯到穆斯林统治时期,包括作为科尔多瓦大清真寺建造的梅斯吉塔 (The Mezquita) 和同样位于科尔多瓦的麦地那阿萨哈拉 (Medina Azahara),后者如今已成为废墟,但仍可作为安达卢斯宫殿 (Madinat al-Zahra) 参观;以及格拉纳达的阿尔罕布拉宫 (Alhambra),一座宏伟且保存完好的宫殿。此外,还有两座在穆斯林西班牙时期建造的犹太教堂仍然屹立不倒:托莱多的圣玛丽亚拉布兰卡 (Santa María la Blanca) 和位于老城区的科尔多瓦犹太教堂 (Synagogue of Córdoba)。

Reconquista and Imperial era [ edit ]

收复失地运动与帝国时代 [编辑]

Small model completion

小模型补全

GPT-2 completion

GPT-2 补全

The era of Muslim Spain is very different from that of other parts of the world, with a rather extensive medieval history. While the Spanish Empire existed during the second half of the 9th century BC it was quite slowly transformed to a small, secularized empire by $1150;\mathrm{A.D.}$ , in a much more prosperous and diverse state. In $1050,\mathrm{CE}$ the emperor of Naples, at least partially in characterised by a series of wars and invasions from the north-western coast to the western coast of the region (e.g., on the Danube River which extended all the way from Gibraltar into the south of Spain, but was quickly destroyed on the same day by the Roman army), assumed the title of Emperor of Naples. He made a large contribution over the last few decades to restore and maintain the empire, the first of which to the west. His successors were Ferdinand, who took power from the Papal Throne in 1148, while the first imperial successor was King Alfred II of Sweden. The empire was again made one of the few independent states in the European Continent and although no more than 200 million people lived on its territory during this period the people had much to say to the Emperor. From 1720 on the Muslims made up half the population of Spain.

穆斯林西班牙时代与世界其他地区截然不同,拥有相当广泛的中世纪历史。虽然西班牙帝国存在于公元前9世纪下半叶,但到了公元1150年,它已逐渐转变为一个规模较小、世俗化的帝国,处于一个更加繁荣和多元的状态。公元1050年,那不勒斯皇帝(至少在某种程度上以一系列战争和从西北海岸到该地区西部海岸的入侵为特征,例如从直布罗陀一直延伸到西班牙南部的多瑙河,但当天就被罗马军队迅速摧毁)获得了那不勒斯皇帝的称号。他在过去几十年中为恢复和维护帝国做出了巨大贡献,其中第一个贡献是向西扩张。他的继任者是费迪南德,他于1148年从教皇宝座夺取了权力,而第一位帝国继任者是瑞典的阿尔弗雷德二世。帝国再次成为欧洲大陆上为数不多的独立国家之一,尽管在此期间其领土上居住的人口不超过2亿,但人民对皇帝有很多话要说。从1720年起,穆斯林占西班牙人口的一半。

The Reconquista (literally, ”The return to the country”) of 1092 and 1095 saw the establishment in the Iberian peninsula of a united kingdom of Castile and Aragon in which a Muslim dynasty would exist. This was the last time in Europe where Muslim rule would be considered ”native” and the last occasion where Christian-inflamed and organized conflicts would make themselves known in Europe.

1092年和1095年的收复失地运动(字面意思是“重返故土”)见证了伊比利亚半岛上卡斯蒂利亚和阿拉贡联合王国的建立,其中将存在一个穆斯林王朝。这是欧洲最后一次穆斯林统治被视为“本土”统治,也是最后一次基督教煽动和组织的冲突在欧洲显现的场合。

The reign of Muslim Moors in Spain is often referred to, especially among non-Muslims, as the ”Conquest of Granada” and Spain is sometimes referred to as Spain in ”Granada”. The last years of Muslim rule were followed by the Spanish Reconquista where the Muslims were driven or compelled out of the Iberian Peninsula with the victory of Ferdinand IX. in the Battle of Poitiers in March 1189 and the conquest of Jerusalem in 1187 by Saladin.

穆斯林摩尔人在西班牙的统治常被非穆斯林称为“格拉纳达征服”,西班牙有时也被称为“格拉纳达的西班牙”。穆斯林统治的最后几年之后是西班牙的收复失地运动,随着费迪南德九世在1189年3月的普瓦捷战役中的胜利,以及萨拉丁在1187年征服耶路撒冷,穆斯林被驱逐或被迫离开伊比利亚半岛。

The Reconquista period in Spain came to an end when the Moors retreated north. However the Moorish rule would not be fully replaced by the Christian rule until the Reconquista era resumed when the Reconquista lasted from 1229 to 1492.

西班牙的收复失地运动时期在摩尔人向北撤退时结束。然而,摩尔人的统治直到1229年至1492年收复失地运动重新开始时才完全被基督教统治所取代。

The last Muslim state in Spain, Granada, was conquered by the

西班牙最后一个穆斯林国家格拉纳达被征服

Context (WebText test)

上下文 (WebText 测试)

Pinky

Pinky

The pink ghost’s AI is designed to ”feel” opposite of the red ghost’s behavior. Pinky actually attempts to get out in front of Pac-Man. This is accomplished by setting the target 4 tiles ahead of Pac-Man’s current location in the direction that Pac-Man is travelling. One exception to this is when Pac-Man is traveling up. Due to an overflow bug in the code, the calculation includes a left offset equal to the expected up offset.

粉红幽灵的AI设计为与红幽灵的行为“相反”。Pinky实际上试图跑到Pac-Man的前面。这是通过将目标设置在Pac-Man当前行进方向前方4个格子的位置来实现的。唯一例外的是当Pac-Man向上移动时。由于代码中的一个溢出错误,计算中会包含一个与预期向上偏移相等的向左偏移。

The use of Pac-Man’s expected position allows the player to exploit the navigation code for Pinky. This was most often seen in situations where Pac-Man and Pinky are in close proximity to each other. For example, if they were headed toward one another, once they are less than 4 tiles apart, the target tile is now behind Pinky. If he encounters an intersection while in this configuration, he will attempt to double-back (by turning) in order to reach the target tile.

利用 Pac-Man 的预期位置,玩家可以针对 Pinky 的导航代码进行利用。这种情况最常见于 Pac-Man 和 Pinky 彼此接近时。例如,如果他们朝彼此移动,一旦他们之间的距离小于 4 个格子,目标格子就会在 Pinky 后方。如果在这种配置下遇到交叉路口,Pinky 会尝试掉头(通过转向)以到达目标格子。

Player’s would often perform what was referred to as a ”head fake” by changing directions rapidly just as Pinky was approaching an intersection. By displacing the target tile just as Pinky

玩家经常会执行所谓的“头部假动作”,即在Pinky接近交叉口时迅速改变方向。通过在Pinky接近时移动目标格子

Small model completion

小模型补全

GPT-2 completion

GPT-2 补全

was coming to the left, they would get the ”correct” direction.

即将到达左侧时,他们会得到“正确”的方向。

If the location is actually farther from what PacMan was at, the player’s will not respond to the ghost’s movement, and the ghost will instead return to the player’s current location. This is because Pac-Man does not move by turning.

如果位置实际上离 PacMan 所在的位置更远,玩家将不会对幽灵的移动做出反应,幽灵将返回玩家的当前位置。这是因为 Pac-Man 不是通过转向来移动的。

The ghost will have to go to another location of some kind to return where he began, and this is achieved in many ways. It will often simply leave the location, as for example if Pac-Man did jump off a cliff, he would return to the spot where he was before vanishing into space. There is a good number of possibilities these ghost paths may be possible, but the most obvious ones are if the player wants to avoid getting caught up in the game, with a goal as easy as keeping Pac-Man in the game for 1/3 as long as he was.

幽灵将不得不前往某种类型的地点以返回其起始位置,这可以通过多种方式实现。它通常会简单地离开该地点,例如,如果 Pac-Man 跳下悬崖,他会回到消失在空间之前所在的位置。这些幽灵路径有很多可能性,但最明显的情况是玩家希望避免被游戏困住,目标简单到让 Pac-Man 在游戏中停留的时间仅为原来的 1/3。

was about to cross the intersection, Pac-Man would be able to dodge the ghosts’ projectiles and return to the safe location of the safe house.

即将穿过十字路口时,Pac-Man 能够躲避幽灵的投射物并返回安全屋的安全位置。

Buster

Buster

The yellow ghost’s AI is specifically designed to try to avoid Pac-Man from leaving the safe house. The player must press the SPACE bar to reveal the safe house’s doors, then go into them to find the yellow ghost hiding in the next room.

黄色幽灵的AI专门设计用于阻止Pac-Man离开安全屋。玩家必须按下空格键以打开安全屋的门,然后进入其中,在下一个房间中找到隐藏的黄色幽灵。

The player must also note that the ghost will not attempt to exit the safe house if it sees Pac-Man already there. In fact, the ghost will not attempt an exit in the first place unless they’re already in the safe house.

玩家还必须注意,如果幽灵看到Pac-Man已经在安全屋中,它将不会尝试离开安全屋。事实上,除非幽灵已经在安全屋中,否则它们一开始就不会尝试离开。

These two behaviors are based on the fact that the player does not always attempt to reach the safe house before Pac-Man (or Blinky).

这两种行为基于一个事实:玩家并不总是试图在Pac-Man(或Blinky)之前到达安全屋。

Chasing

追逐

In Pac-Man’s case, this would be similar to how Pac-Man can ”catch up” in a fight.

在《吃豆人》的情况下,这类似于吃豆人在战斗中如何“追赶”。

A ghost could chase through enemies. This is useful for many reasons, both for Pac-Man and his AI. Unlike most Ghost games, this ghost simply travels in the direction from

鬼魂可以穿过敌人追逐。这对Pac-Man和他的AI都有很多用处。与大多数鬼魂游戏不同,这个鬼魂只是沿着方向移动。

The player must also beware of this glitch if a ghost (or another player) is already in the safe house and attempting to get out as soon as Pac-Man or Blinky is gone. At this point, the ghost will attempt to exit the safe house in a similar manner as it is in ”Chase” mode.

如果幽灵(或另一位玩家)已经在安全屋中,并试图在Pac-Man或Blinky离开后立即出去,玩家也必须注意这个故障。此时,幽灵会以类似于“追逐”模式的方式尝试离开安全屋。

Pac-Man

Pac-Man

There are some ghosts that follow Pac-Man. These ghosts are

有一些幽灵会跟随吃豆人。这些幽灵是

Table 9. Random unseen contexts (top), and non-cherry-picked completions from both the smallest (left) and largest (right) models. Contexts are 768 tokens, with approximately 256 tokens worth of paragraphs shown. Completions are 256 tokens and fully shown. Top ${\bf\nabla}\cdot{\bf\nabla}k$ random sampling with $k=40$ was used for generation.

表 9. 随机未见过的上下文(顶部),以及来自最小(左)和最大(右)模型的非精选补全。上下文为 768 个 Token,显示约 256 个 Token 的段落。补全为 256 个 Token 并完全显示。生成时使用了 ${\bf\nabla}\cdot{\bf\nabla}k$ 随机采样,其中 $k=40$。

Context (WebText test)

上下文 (WebText 测试)

Corporal Michael P. Goeldin was an unskilled laborer from Ireland when he enlisted in Company A in November 1860. Goldein survived the war. Corporal Patrick O’Neal, also from Ireland, first enlisted in 1854 and served with Company L, 3d U.S. Artillery, in Oregon. He returned to the East Coast and enlisted in the company in 1860. O’Neal served until 1874, when he was named superintendent of the National Cemetery at Willets Point, New York. Corporal Benjamin Browne was a shoemaker from Orange County, New York. In August 1862, he enlisted in the newly formed 124th New York Volunteers, and was one of sixty-one men who transferred into Company A that October. Browne reenlisted in the company in February 1864 while it was camped at Brandy Station. He returned to civilian life after completing his enlistment in 1867.

Michael P. Goeldin 下士在 1860 年 11 月加入 A 连时,是一名来自爱尔兰的非技术工人。Goldein 在战争中幸存下来。Patrick O’Neal 下士同样来自爱尔兰,他于 1854 年首次入伍,并在俄勒冈州的美国第 3 炮兵连 L 连服役。他回到东海岸后,于 1860 年加入了该连。O’Neal 一直服役到 1874 年,之后被任命为纽约 Willets Point 国家公墓的负责人。Benjamin Browne 下士是来自纽约奥兰治县的鞋匠。1862 年 8 月,他加入了新组建的纽约第 124 志愿军,并在同年 10 月成为转入 A 连的 61 名士兵之一。Browne 在 1864 年 2 月该连驻扎在 Brandy Station 时再次入伍。他在 1867 年完成服役后回归平民生活。

On 10 June, Artificer William Collins was promoted to corporal, probably to fill a combat leadership void for the crossing of the James River. Collins’s service record does not reflect the qualities he demonstrated to earn this promotion, but he had obviously overcome some serious problems. Born in Sacketts Harbor, New York, Collins enlisted in the company in December 1853 at the age of twenty-two, and reenlisted in December 1858. Just a month before the war began in April 1861, Collins went ”over the hill” and was not caught until three years later. Returned to the company on 22 March 1864, he was tried

6月10日,工匠 William Collins 被提升为下士,可能是为了填补詹姆斯河渡河作战中的领导空缺。Collins 的服役记录并未反映出他获得此次晋升所展现的品质,但他显然克服了一些严重的问题。Collins 出生于纽约的 Sacketts Harbor,于1853年12月加入连队,时年22岁,并于1858年12月再次入伍。就在1861年4月战争开始前一个月,Collins 擅自离队,直到三年后才被抓获。1864年3月22日,他被送回连队并接受审判。

Small model completion

小模型补全

GPT-2 completion

GPT-2 补全

on military charges and was discharged from the U.S. Military. He returned to the company, and was assigned to Division of Internal Affairs during the Civil War.

因军事指控被美国军队开除。他回到公司,并在内战期间被分配到内务部门。

As an engineer, Collins also had training in engineering from his native New York. After graduate studies in the Union College of Engineering, he was employed in the United States Army for two years at the same time. There he worked under Captain Daniel Pfeiffer and was promoted to colonel in May 1870, three years after Captain Pfeiffer’s enlistment. Collins returned home to England after completing his first year in the First Division to serve the next two years in the Army Corps of Engineers, and by his second year, he began to experience his service duty.¡—endoftext—¿MOSCOW, July 5 (Reuters) - U.K.-based Russian state media and government sources and experts say Moscow has sought to downplay the ”military buildup” of the Syrian opposition while maintaining support for President Donald Trump’s plan to build a wall along the U.S.-Mexico border, calling it a ”military coup”.

作为工程师,Collins 还在其家乡纽约接受了工程学培训。在联合工程学院完成研究生学业后,他同时在美国陆军服役了两年。在那里,他在 Daniel Pfeiffer 上尉的领导下工作,并于 1870 年 5 月晋升为上校,即 Pfeiffer 上尉入伍三年后。Collins 在第一师完成第一年服役后返回英国,在接下来的两年里在陆军工程兵团服役,到第二年时,他开始体验到自己的服役职责。¡—endoftext—¿

莫斯科,7 月 5 日(路透社)——英国和俄罗斯的国家媒体、政府消息来源以及专家表示,莫斯科试图淡化叙利亚反对派的“军事集结”,同时继续支持美国总统 Donald Trump 沿美墨边境修建隔离墙的计划,称其为“军事政变”。

U.S.-led Syria war veteran and Syria envoy Michael Flynn (R), right, gestures with U.S.-based Syrian political analyst Ahmet Koryat, with

美国领导的叙利亚战争老兵兼叙利亚特使 Michael Flynn (右) 与美国叙利亚政治分析师 Ahmet Koryat 一起做手势

in June of that year. Sentenced to be shot for desertion, he remained on active duty.

同年6月,他因逃兵罪被判处枪决,但仍继续服役。

Captain James J. Dall was born in Virginia in 1829 during the Revolution and joined his father in the Army in 1836. He served with the 2d Virginia Infantry until 1845, when he went to the Western Reserve Regiment; and, on his discharge, he reenlisted with Company I, 25th New York Volunteers (which in 1845 numbered about $4{,}500\ \mathrm{men})$ ). One of his brothers, James, served with his company in the same capacity in the same brigade. While the regiment remained in Connecticut through 1846, Captain Dall served with Company I at various times before his discharge. A veteran of the Mexican War, he had served two years on the staff of Brigadier General John J. Sullivan at San Antonio. During the Peninsula Campaign of 1863, he commanded a company of twenty-three and was in charge of the battery of fifty-four heavy mortars and the gunboat Pinta of the Battery, a gunboat and several other boats. Captain Dall was on active duty with Company I in the fall of 1865. Two years later, he was promoted to second lieutenant.

James J. Dall 上尉于 1829 年革命期间出生于弗吉尼亚州,并于 1836 年随父亲加入军队。他在弗吉尼亚第二步兵团服役至 1845 年,随后加入了西部预备团;退伍后,他重新加入了第 25 纽约志愿军 I 连(该连在 1845 年约有 4,500 人)。他的兄弟 James 在同一旅的同一连队中担任相同职务。当该团在 1846 年期间驻扎在康涅狄格州时,Dall 上尉在退伍前多次与 I 连一起服役。作为墨西哥战争的老兵,他曾在圣安东尼奥的 John J. Sullivan 准将的参谋部服役两年。在 1863 年的半岛战役期间,他指挥了一支由 23 人组成的连队,并负责管理 54 门重型迫击炮的炮台以及炮艇 Pinta 和其他几艘船只。Dall 上尉于 1865 年秋季与 I 连一起执行现役任务。两年后,他被晋升为少尉。

Captain John A. O’

John A. O’ 上尉

Context (WebText test)

上下文 (WebText 测试)

Xmonad, Xresources, and Fonts

Xmonad、Xresources 和字体

Xresources can be an absolute pain (they were for me). So, this a bit of help:

Xresources 可能是一个绝对的痛点(对我来说确实如此)。因此,这里提供一些帮助:

First of all, the way that fonts are generally handled in Xmonad is through XFT . Xft is the X FreeType Interface library and is a library that handles the actual rendering of fonts in Xmonad/urxvt (Xresources specifies a bunch of resources to things that launch under the X window manager. Settings for individual programs are specified in Xresources like so: Urxvt.background:background settins here ).

首先,Xmonad 中处理字体的方式通常是通过 XFT。Xft 是 X FreeType 接口库,负责在 Xmonad/urxvt 中处理字体的实际渲染(Xresources 为在 X 窗口管理器下启动的程序指定了一系列资源。单个程序的设置在 Xresources 中如下所示:Urxvt.background:background settings here)。

Anyway, fonts are set using this syntax in Xresources :

无论如何,字体在 Xresources 中使用以下语法设置:

urxvt*font:xft:your font name goes here:size $=$ the size of the font goes here,xft:fall-back font 1 goes here

urxvt*font:xft:你的字体名称:size $=$ 字体大小,xft:备用字体 1

So, thats the most basic part of fonts in urxvt specified via Xresources.

所以,这是通过Xresources在urxvt中指定字体的最基本部分。

Fixing /.ssh/ Permissions in Cygwin

修复 Cygwin 中的 /.ssh/ 权限

In Cygwin on Windows I found that I could not set the permissions of my /.ssh/ folder to be 0600, as is required for ssh to allow you to use keys. The symptom I had was that I no matter what I did, it always modfied the owner and the group

在 Windows 的 Cygwin 中,我发现无法将我的 /.ssh/ 文件夹权限设置为 0600,而这是 ssh 允许你使用密钥所必需的。我遇到的症状是,无论我做什么,它总是修改所有者和组。

Small model completion

小模型补全

GPT-2 completion

GPT-2 补全

was set to $/.\mathrm{ssh}/\$ , which was the way things were. This could not be fixed completely. I found I could make that much more workable with just typing the username with your mouse then trying to set permissions to be $\sqrt{0600^{\circ}}$ for the group, which is what my system is now configured to do.

设置为 $/.\mathrm{ssh}/\$ ,这是当时的情况。这无法完全修复。我发现只需用鼠标输入用户名,然后尝试将权限设置为 $\sqrt{0600^{\circ}}$ 对于组,这样可以使系统更加可用,这也是我现在系统的配置方式。

, making these files read-only. I tried just starting ssh and using a program like passwd which modifies the owner/group to 0644 permissions, but it didn’t work. It was annoying but not nearly as bad as it could be, just annoying to have to fiddle with permissions.

将这些文件设置为只读。我尝试启动ssh并使用像passwd这样的程序来修改所有者/组为0644权限,但没有成功。这很烦人,但还不至于太糟糕,只是需要调整权限有些麻烦。

So I decided to do this as the reverse:

所以我决定反过来做:

However, as of Cygwin 7 on Linux I have found that doing so is now ”unnecessary” for my system! I also installed a package called chroot. It does that by allowing you to install/upgrade to your system directory.

然而,在 Linux 上的 Cygwin 7 中,我发现这样做现在对我的系统来说是“不必要的”!我还安装了一个名为 chroot 的包。它通过允许你安装/升级到系统目录来实现这一点。

So, as of Windows 8 and up, installing that can require chroot, so I tried and tried and found that it wasn’t so bad when installing the package on Linux and then uninstalling it. As of my last write-up on Cygwin, the script just does its job (the sudo commands do the rest), if you try it you will have to log the change.

因此,从 Windows 8 开始,安装该软件可能需要使用 chroot,所以我尝试了多次,发现如果在 Linux 上安装该软件包然后再卸载它,情况并不那么糟糕。根据我上次关于 Cygwin 的写作,脚本会完成它的工作(sudo 命令会完成剩下的部分),如果你尝试它,你将需要记录更改。

If that wasn’t confusing enough, this is what I tried and just went back to my /.ssh/ directory and the chroot script. I found I was finally running out of permissions (because i forgot to create a symlink to ssh from the file system and

如果这还不够让人困惑,我尝试了以下方法,然后回到了我的 /.ssh/ 目录和 chroot 脚本。我发现我终于用完了权限(因为我忘记从文件系统创建一个指向 ssh 的符号链接)。

For each .ssh folder in your Cygwin directory, set the permission of each file and directory to 2644. Then start ssh with: ssh-add /.ssh/

对于 Cygwin 目录中的每个 .ssh 文件夹,将每个文件和目录的权限设置为 2644。然后使用以下命令启动 ssh:ssh-add /.ssh/

The first time you connect, your permissions of the /.ssh folder that you just had in Cygwin will be set to 0600. Then every time you connect again, your permissions will change to 0644.

首次连接时,你在 Cygwin 中刚刚拥有的 /.ssh 文件夹权限将被设置为 0600。之后每次重新连接时,权限将更改为 0644。

Makefile for Xresources as used in /.xsessionrc

Makefile 用于 Xresources,如 /.xsessionrc 中所用

These are made in my .xsessionrc :

这些是在我的 .xsessionrc 文件中创建的:

.xprofile: .xsession-errors: color=””

.xprofile: .xsession-errors: color=""

For example:

例如:

.xprofile: color $:=$ blue

.xprofile: color $:=$ blue

As the colors are stored in a separate file, the color in your /.xsession-errors will look like cyan instead of blue.

由于颜色存储在单独的文件中,你的 /.xsession-errors 中的颜色将显示为青色而不是蓝色。

This is mostly because, as of now, only the color is assigned to your

这主要是因为,到目前为止,只有颜色被分配给了你的

Context (WebText test)

上下文 (WebText 测试)

Aaron loves mint chocolate cake, but he requires that it be paired with mini chocolate chips, so I threw some of those in between the layers. I also had a few Peppermint Jo Jos on hand so I crushed them up and threw some of those in along with some crushed meringue cookies because, why not? It’s a total smorgasbord of minty chocolate chippy cookie crunchy goodness. I didn’t measure how much of each topping I used, but after I tasted the finished product, I wish I had added more. You can add anything you want- crushed candy canes, peppermint bark, etc. And don’t be afraid to use a heavy hand. Texture $=\operatorname{good}$ .

Aaron 喜欢薄荷巧克力蛋糕,但他要求搭配迷你巧克力片,所以我在蛋糕层之间撒了一些。我手边还有一些 Peppermint Jo Jos,所以我把它们碾碎,和碎蛋白酥饼干一起撒了进去,为什么不呢?这完全是一场薄荷巧克力片饼干脆脆的美味盛宴。我没有测量每种配料用了多少,但尝过成品后,我真希望自己加得更多。你可以添加任何你想要的东西——碎糖果棒、薄荷树皮等等。别怕多放。口感 $=\operatorname{好}$。

If you don’t have 7-inch cake pans, you can get 3 shorter 8-inch layers out of this

如果你没有7英寸的蛋糕模具,可以用这个食谱做出3个较短的8英寸蛋糕层

GPT-2 补全 1 GPT-2 补全 2 GPT-2 补全 3
食谱。我做了较长的版本,然后用模具做了一层黄油奶油霜,并将其中1/3放入冰箱作为顶部装饰。只要确保你在做这个蛋糕之前准备好所有的奶油霜,你就不必等待它,因为蛋白会沉在搅拌碗的底部。你想要蛋白打到硬性发泡吗?然后把你的蛋糕盘放在一锅热水上加热,直到它刚刚变暖。这不需要超过3-5分钟;只要足够让它开始 食谱。如果你这样做,并且你知道如何制作那个蛋糕,那么你的生活将永远不会一样。我有两个非常简单的规则 食谱,可能会产生大约相同数量的蛋糕。巧克力酪乳蛋糕
当涉及到饼干屑或巧克力片时。1)当我买生糖时,我不会使用它,直到它冷却并且上面至少有半茶匙糖。这是我的规则。2)如果我有一个食谱需要某种东西,并且我想在食谱中使用它,我不会在网上用其他东西代替它。就像这样。我发现这张非常可爱的卡片,上面有可爱的小猫在粉末上 打印大块巧克力蛋糕,涂上酪乳,撒上脆脆的切碎巧克力,淋上巧克力酱。简单但比巧克力蛋糕更有新意!作者:妈妈制作配料巧克力酪乳蛋糕 1 1/2 杯无盐黄油,室温 1 1/2 杯砂糖 3 个大鸡蛋加1个蛋黄 巧克力釉料 1/3 杯可可
GPT-2 补全 4 食谱。蛋白酥饼干 1 3/4 杯黄油,软化 1 杯糖 1 个蛋黄 3 汤匙重奶油 3 1/2-4 杯面粉 少许盐 薄荷-Jojo 配料 1 1/4 杯糖粉 1 杯切碎的山核桃 1/2 杯切碎的薄荷叶 1/2 杯切碎的新鲜薄荷(约1/2张) 1 茶匙薄荷提取物 1/4 茶匙海盐 少许海 GPT-2 补全 5 食谱。我在圣诞节早晨的两次失败并不是因为霜饰技巧不好。那是另一个成功的故事。蛋糕非常湿润,而且,只是一个美丽的蛋糕。它里面几乎没有空气或水分,但所有的巧克力和薄荷完美地混合在一起。让我有点不安的是,所有这些成分——蛋黄、糖、融化的黄油、烤约30-40分钟、巧克力片、融化的巧克力和薄荷——必须直接混合到蛋糕面糊中,这非常耗时。一个朋友建议,如果我把面糊放在一个单独的碗里,1 1 3/4 杯糖粉 6 盎司半 GPT-2 补全 6 食谱,这将为你提供大约6-9份美味的巧克力蛋糕。为了防止你的蛋糕层粘在一起,我用了8英寸的圆形挞盘,铺上烘焙纸,然后从中切出4个7英寸的圆形蛋糕,并将它们夹在中间(这对单层巧克力蛋糕也很有效)。当你准备好烤这个蛋糕时,取下烘焙纸,预热烤箱至350度,直到顶部金黄,测试器取出干净。巧克力酪乳层蛋糕与薄荷Jo

Table 13. Conditional generation on an out-of-distribution context by GPT-2. Cherry pick of 10 samples generated with $k=40$ .

表 13: GPT-2 在分布外上下文条件下的生成。精选了 10 个使用 $k=40$ 生成的样本。

GPT-2: The original site in Vallon-Pont-D’arc in Southern France is a Unesco World Heritage site and is the oldest known and the best preserved cave decorated by man. The replica cave was built a few miles from the original site in Vallon-Pont-D’Arc in Southern France. The cave contains images of 14 different species of animals including woolly rhinoceros, mammoths, and big cats.

GPT-2:位于法国南部瓦隆蓬达尔克的原址是联合国教科文组织世界遗产地,也是已知最古老且保存最完好的人类装饰洞穴。复制洞穴建在距离原址几英里的法国南部瓦隆蓬达尔克。洞穴内描绘了14种不同动物的图像,包括毛犀牛、猛犸象和大型猫科动物。

Reference: Cave mimics famous Caverne du Pont-d’Arc in France, the oldest cave decorated by man and the best preserved. The replica contains all 1,000 paintings which include 425 such as a woolly rhinoceros and mammoths. Minute details were copied using 3D modelling and anamorphic techniques, often used to shoot widescreen images. The modern cave also includes replica paw prints of bears, bones and details preserved in the original cave.

参考:洞穴模仿了法国著名的 Caverne du Pont-d’Arc,这是人类装饰的最古老且保存最完好的洞穴。复制品包含了所有 1000 幅画作,其中包括 425 幅如毛犀牛和猛犸象的画作。使用 3D 建模和常用于拍摄宽屏图像的变形技术复制了微小的细节。现代洞穴还包括熊的爪印复制品、骨头以及原始洞穴中保存的细节。

Article: ‘If you can find my submarine, it’s yours,’ Russian oil billionaire Roman Abramovich once said.

文章:俄罗斯石油亿万富翁 Roman Abramovich 曾说过:“如果你能找到我的潜艇,它就是你的了。”

And it seems the oligarch and Chelsea FC owner, whose submarine is just one of the extras that came with his £300million superyacht Eclipse (perfect for getting to shore undetected), is not the only wealthy businessman splashing out on underwater exploration.

而且,这位寡头兼切尔西足球俱乐部老板的潜艇只是他价值3亿英镑的超级游艇Eclipse的附加品之一(非常适合在不被发现的情况下上岸),他并不是唯一一位在水下探索上挥霍的富商。

GPT-2: The U-Boat Worx submarine can dive to 984 feet (300 metres) and is capable of diving to the bottom of the ocean. It can be used for diving, snorkelling and even snorkel diving. The submarine is also equipped with a snorkelling platform.

GPT-2: U-Boat Worx 潜水艇可以下潜至 984 英尺 (300 米),并且能够潜至海底。它可用于潜水、浮潜甚至浮潜潜水。该潜水艇还配备了浮潜平台。

Reference: Feel like a James Bond villain with the most advanced submerged ocean vehicles. Some convert from a yacht into a submarine, and others detached off for a speedy exploration. Oliver’s Travels offers a Mile Low package on its submarine, complete with chef and butler for $\mathcal{L}175\small{,}000$ a night.

参考:感受像詹姆斯·邦德反派一样,拥有最先进的水下海洋交通工具。有些可以从游艇转变为潜艇,有些则可以分离出来进行快速探索。Oliver’s Travels 提供其潜艇的 Mile Low 套餐,配备厨师和管家,每晚价格为 $\mathcal{L}175\small{,}000$。

Article: Amina Ali Qassim is sitting with her youngest grandchild on her lap, wiping away tears with her headscarf. Only a few months old, this is the baby girl whose ears she desperately tried to cover the night the aerial bombardment started. She lay awake, she says, in a village mosque on the Yemeni island of Birim, counting explosions as the baby cried.

文章:Amina Ali Qassim 正抱着她最小的孙子坐在那里,用头巾擦拭着泪水。这个只有几个月大的女婴,正是在空袭开始的那晚,她拼命想要捂住耳朵的孩子。她说,她躺在也门 Birim 岛上一个村庄的清真寺里,数着爆炸声,而婴儿在哭泣。

It could have been worse though. They could have still been in their house when the first missile landed. ”Our neighbor shouted to my husband ’you have to leave, they’re coming.’ And we just ran. As soon as we left the house, the first missile fell right by it and then a second on it. It burned everything to the ground,” Qassim tells us

情况本可能更糟。如果第一枚导弹落下时他们还在家里,那就更糟了。Qassim告诉我们:“我们的邻居对我丈夫喊道,‘你们必须离开,他们来了。’我们刚跑出去,第一枚导弹就落在了房子旁边,紧接着第二枚导弹击中了房子。一切都烧成了灰烬。”

GPT-2: Yemen is in the middle of a civil war. Saudi Arabia is leading the coalition bombing campaign. It’s been bombing Yemen for more than two months now.

GPT-2:也门正处于内战之中。沙特阿拉伯领导着联合轰炸行动。目前,沙特阿拉伯已经对也门进行了超过两个月的轰炸。

Reference: Amina Ali Qassim’s family sought shelter in a mosque before fleeing Yemen. Thousands like them are boarding boats to sail to Djibouti. Saudi Arabia has been pounding Yemen in a bid to defeat Houthi rebels.

参考:Amina Ali Qassim 的家人在逃离也门之前在一座清真寺寻求庇护。成千上万像他们一样的人正在登船前往吉布提。沙特阿拉伯一直在轰炸也门,试图击败胡塞叛军。

英文参考 GPT-2 法文翻译
这次重新发布,标题为《The Next Day Extra》,以三张光盘的形式呈现:原始专辑、未发表的录音室录音和混音,以及一张包含已经发布的四段视频的 DVD。 新的重新发布,首先是光盘,包含专辑、录音室录音和混音,以及一张包含已经发布的一些视频的 DVD。
法文参考 GPT-2 英文翻译
一位男士解释说,他所接受的免费疝气手术将使他能够重新工作。 一位男士解释说,他在手术前所认识的疝气功能,以及他所成功的疝气功能,我已经成功了。
英文参考 GPT-2 法文翻译
一位男士解释说,他所接受的免费疝气手术将使他能够重新工作。 一位男士解释说,他在手术前所认识的疝气功能,以及他所成功的疝气功能,我已经成功了。
法文参考 GPT-2 英文翻译
一位男士解释说,他所接受的免费疝气手术将使他能够重新工作。 一位男士告诉我,他所承诺的手术免费将不允许他旅行。
英文参考 GPT-2 法文翻译
在伦敦的开放政府会议上通过视频链接发言时,Kerry 先生说:“毫无疑问,总统、我以及其他政府成员实际上已经了解到一些事情,这些事情一直在自动模式下进行,因为这种能力一直存在,可以追溯到第二次世界大战和冷战的艰难岁月,当然还有 9/11。” 在国际版在线会议的视频中,国际版部长说:“只有一个问题,总统、我以及所有其他政府成员已经明显意识到某些事情,这些事情在战争期间以及战争的一部分中已经很明显了。”
法文参考 GPT-2 英文翻译
在伦敦通过视频链接参加政府间会议时,Kerry 先生说:“毫无疑问,总统、我以及其他政府成员已经了解到某些事情,这些事情一直在自动模式下进行,因为我们有这种能力,从第二次世界大战到冷战的艰难岁月,当然还有 9/11。” 在美国和伦敦之间的会议上,总统、我以及其他政府成员已经意识到某些选择,这些选择在过去是为了能够以更自动化的方式做某些事情。”

Context (passage and previous question/answer pairs)

上下文(段落及之前的问答对)

The 2008 Summer Olympics torch relay was run from March 24 until August 8, 2008, prior to the 2008 Summer Olympics, with the theme of “one world, one dream”. Plans for the relay were announced on April 26, 2007, in Beijing, China. The relay, also called by the organizers as the “Journey of Harmony”, lasted 129 days and carried the torch $137{,}000,\mathrm{km}$ $\mathrm{(85,000;mi)}$ – the longest distance of any Olympic torch relay since the tradition was started ahead of the 1936 Summer Olympics.

2008年夏季奥运会火炬接力于2008年3月24日至8月8日举行,主题为“同一个世界,同一个梦想”。接力计划于2007年4月26日在中国北京公布。组织者将此次接力称为“和谐之旅”,历时129天,火炬传递距离达137,000公里(85,000英里),是自1936年夏季奥运会以来奥运火炬接力历史上最长的一次。

After being lit at the birthplace of the Olympic Games in Olympia, Greece on March 24, the torch traveled to the Pana thin aiko Stadium in Athens, and then to Beijing, arriving on March 31. From Beijing, the torch was following a route passing through six continents. The torch has visited cities along the Silk Road, symbolizing ancient links between China and the rest of the world. The relay also included an ascent with the flame to the top of Mount Everest on the border of Nepal and Tibet, China from the Chinese side, which was closed specially for the event.

3月24日在希腊奥林匹亚的奥运会发源地点燃后,火炬传递至雅典的帕纳辛奈科体育场,随后于3月31日抵达北京。从北京出发,火炬沿着一条穿越六大洲的路线传递。火炬途经丝绸之路沿线的城市,象征着中国与世界其他地区的古老联系。传递过程中还包括从中国一侧攀登至尼泊尔与中国西藏边境的珠穆朗玛峰顶,该路线为此活动特别封闭。

Model answer: Everest Turker answers: unknown, yes, Yes, yes

模型答案:Everest Turker 回答:未知,是,是,是

Context (passage and previous question/answer pairs)

上下文(段落及先前的问题/答案对)

Tom goes everywhere with Catherine Green, a 54-year-old secretary. He moves around her office at work and goes shopping with her. ”Most people don’t seem to mind Tom,” says Catherine, who thinks he is wonderful. ”He’s my fourth child,” she says. She may think of him and treat him that way as her son. He moves around buying his food, paying his health bills and his taxes, but in fact Tom is a dog.

Tom 和 54 岁的秘书 Catherine Green 形影不离。他在她的办公室里四处走动,陪她购物。Catherine 说:“大多数人似乎都不介意 Tom。”她认为 Tom 很棒。“他是我的第四个孩子,”她说。她可能把他当作儿子一样对待。他四处走动,购买食物,支付医疗账单和税款,但实际上 Tom 是一只狗。

Catherine and Tom live in Sweden, a country where everyone is expected to lead an orderly life according to rules laid down by the government, which also provides a high level of care for its people. This level of care costs money.

Catherine 和 Tom 住在瑞典,这个国家期望每个人都按照政府制定的规则过上有序的生活,政府也为人民提供了高水平的关怀。这种关怀水平需要资金支持。

People in Sweden pay taxes on everything, so aren’t surprised to find that owning a dog means more taxes. Some people are paying as much as 500 Swedish kronor in taxes a year for the right to keep their dog, which is spent by the government on dog hospitals and sometimes medical treatment for a dog that falls ill. However, most such treatment is expensive, so owners often decide to offer health and even life for their dog.

瑞典人对一切都要交税,因此养狗意味着更多的税收并不会让他们感到惊讶。一些人每年要支付高达500瑞典克朗的税款来获得养狗的权利,这些钱由政府用于狗医院,有时也用于治疗生病的狗。然而,大多数此类治疗费用昂贵,因此狗主人常常决定为他们的狗提供健康甚至生命保障。

In Sweden dog owners must pay for any damage their dog does. A Swedish Kennel Club official explains what this means: if your dog runs out on the road and gets hit by a passing car, you, as the owner, have to pay for any damage done to the car, even if your dog has been killed in the accident.

在瑞典,狗主人必须为他们的狗造成的任何损害支付赔偿。瑞典养犬俱乐部的一位官员解释了这意味着什么:如果你的狗跑到路上并被过往的汽车撞到,作为主人,你必须为汽车造成的任何损害支付赔偿,即使你的狗在事故中丧生。

Model answer: Stockholm Turker answers: Sweden, Sweden, in Sweden, Sweden

模型答案:斯德哥尔摩 Turker 答案:瑞典,瑞典,在瑞典,瑞典

阅读全文(20积分)