PYTEXT: A SEAMLESS PATH FROM NLP RESEARCH TO PRODUCTION

PYTEXT：从 NLP 研究到生产的无缝路径

Ahmed Aly1 Kushal Lakhotia1 Shicong Zhao1 Mrinal Mohit Barlas Oguz2 Abhinav Arora 1 Sonal Gupta 1 Christopher Dewan 2 Stef Nelson-Lindall 2 Rushin Shah 1

Ahmed Aly1 Kushal Lakhotia1 Shicong Zhao1 Mrinal Mohit Barlas Oguz2 Abhinav Arora1 Sonal Gupta1 Christopher Dewan2 Stef Nelson-Lindall2 Rushin Shah1

1Facebook Conversational AI 2Facebook AI

1 Facebook 对话式 AI
2 Facebook AI

ABSTRACT

摘要

We introduce PyText1 – a deep learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapid experimentation and of serving models at scale. It achieves this by providing simple and extensible interfaces for model components, and by using PyTorch’s capabilities of exporting models for inference via the optimized Caffe2 execution engine. We report our own experience of migrating experimentation and production workflows to PyText, which enabled us to iterate faster on novel modeling ideas and then seamlessly ship them at industrial scale.

我们介绍 PyText1——一个基于 PyTorch 的深度学习 NLP 建模框架。PyText 解决了快速实验和大规模模型服务的常见冲突需求。它通过为模型组件提供简单且可扩展的接口，并利用 PyTorch 的能力通过优化的 Caffe2 执行引擎导出模型进行推理，实现了这一目标。我们报告了将实验和生产工作流迁移到 PyText 的经验，这使得我们能够更快地迭代新的建模想法，并在工业规模上无缝部署它们。

1 INTRODUCTION

1 引言

When building a machine learning system, especially one based on neural networks, there is usually a trade-off between ease of experimentation and deployment readiness, often with conflicting requirements. For instance, to rapidly try out flexible and non-conventional modeling ideas, researchers tend to use modern imperative deep-learning frameworks like PyTorch2 or TensorFlow Eager3. These frameworks provide an easy, eager-execution interface that facilitates writing advanced and dynamic models quickly, but also suffer from overhead in latency at inference and impose deployment challenges. In contrast, productionoriented systems are typically written in declarative frameworks that express the model as a static graph, such as Caffe24 and Tensor flow 5. While being highly optimized for production scenarios, they are often harder to use, and make the experimentation life-cycle much longer. This conflict is even more prevalent in natural language processing (NLP) systems, since most NLP models are inherently very dynamic, and not easily expressible in a static graph. This adds to the challenge of serving these models at an industrial scale.

在构建机器学习系统时，尤其是在基于神经网络的系统中，通常需要在实验便捷性和部署准备度之间进行权衡，这两者往往存在冲突的需求。例如，为了快速尝试灵活且非传统的建模思路，研究人员倾向于使用现代的命令式深度学习框架，如 PyTorch2 或 TensorFlow Eager3。这些框架提供了易于使用的即时执行接口，便于快速编写高级和动态模型，但在推理时存在延迟开销，并且给部署带来了挑战。相比之下，面向生产的系统通常使用声明式框架编写，这些框架将模型表示为静态图，例如 Caffe24 和 TensorFlow5。尽管这些框架在生产场景中高度优化，但它们通常更难使用，并且会大大延长实验周期。这种冲突在自然语言处理 (NLP) 系统中更为普遍，因为大多数 NLP 模型本质上是高度动态的，难以用静态图表达。这进一步增加了在工业规模上部署这些模型的挑战。

PyText, built on PyTorch $1.0~^{6}$ , is designed to achieve the following:

PyText，基于 PyTorch $1.0~^{6}$ ，旨在实现以下目标：

Table 1. Comparison of NLP Modeling Frameworks

表 1. NLP 建模框架对比

NLP 框架	深度学习支持	易于原型设计	工业性能
CoreNLP	X	√	√
AllenNLP	√	√	X
FLAIR	√	√	X
Spacy 2.0	√	X	√
PyText	√	√	√

Existing popular frameworks for building state-of-the-art NLP models include Stanford CoreNLP (Manning et al., 2014), AllenNLP (Gardner et al., 2017), FLAIR (Akbik et al., 2018) and Spacy $2.0~^{7}$ . CoreNLP has been a popular library for both research and production, but does not support neural network models very well. AllenNLP and

现有的用于构建最先进 NLP 模型的流行框架包括 Stanford CoreNLP (Manning et al., 2014)、AllenNLP (Gardner et al., 2017)、FLAIR (Akbik et al., 2018) 和 Spacy $2.0~^{7}$ 。CoreNLP 一直是研究和生产中的流行库，但对神经网络模型的支持不够好。AllenNLP 和

FLAIR are easy-to-use for prototypes but it is hard to productionize the models since they are in Python, which doesn’t support large scale real time requests due to lack of good multi-threading support. Spacy 2.0 has some state-of-the-art NLP models built for production use-cases but is not easily extensible for quick prototyping and building new models.

FLAIR 易于用于原型开发，但由于其基于 Python语言，难以将模型投入生产，因为 Python语言缺乏良好的多线程支持，无法应对大规模实时请求。Spacy 2.0 内置了一些最先进的 NLP 模型，适用于生产环境，但在快速原型设计和构建新模型方面不易扩展。

2 FRAMEWORK DESIGN

2 框架设计

PyText is a modeling framework that helps researchers and engineers build end-to-end pipelines for training or inference. Apart from workflows for experimentation with model architectures, it provides ways to customize handling of raw data, reporting of metrics, training methodology and exporting of trained models. PyText users are free to implement one or more of these components and can expect the entire pipeline to work out of the box. A number of default pipelines are implemented for popular tasks which can be used as-is. We now dive deeper into building blocks of the framework and its design.

PyText 是一个建模框架，帮助研究人员和工程师构建用于训练或推理的端到端管道。除了用于实验模型架构的工作流外，它还提供了自定义原始数据处理、指标报告、训练方法以及训练模型导出的方式。PyText 用户可以自由实现一个或多个这些组件，并期望整个管道能够开箱即用。许多默认管道已为流行任务实现，可以直接使用。现在我们将深入探讨该框架的构建模块及其设计。

2.1 Component

2.1 组件

Everything in PyText is a component. A component is clearly defined by the parameters required to configure it. All components are maintained in a global registry which makes PyText aware of them. They currently include –

PyText 中的所有内容都是一个组件。组件通过配置所需的参数明确定义。所有组件都维护在一个全局注册表中，这使得 PyText 能够识别它们。它们目前包括——

Task: combines various components required for a training or inference task into a pipeline. Figure 1 shows a sample config for a document classification task. It can be configured as a JSON file that defines the parameters of all the children components.

任务：将训练或推理任务所需的各种组件组合成一个流水线。图 1 展示了一个文档分类任务的示例配置。它可以配置为一个 JSON 文件，定义所有子组件的参数。

Data Handler: processes raw input data and prepare batches of tensors to feed to the model.

数据处理器：处理原始输入数据并准备批量张量以输入模型。

Model: defines the neural network architecture.

模型：定义神经网络架构。

Optimizer: encapsulates model parameter optimization us- ing loss from forward pass of the model.

优化器 (Optimizer)：封装了使用模型前向传播的损失进行模型参数优化的过程。

Metric Reporter: implements the relevant metric computation and reporting for the models.

指标报告器：实现模型的相关指标计算和报告。

Trainer: uses the data handler, model, loss and optimizer to train a model and perform model selection by validating against a holdout set.

训练器：使用数据处理器、模型、损失函数和优化器来训练模型，并通过在保留集上进行验证来执行模型选择。

Predictor: uses the data handler and model for inference given a test dataset.

预测器：使用数据处理器和模型对给定的测试数据集进行推理。

Exporter: exports a trained PyTorch model to a Caffe2 graph using $\mathrm{ONNX^{8}}$ .

Exporter: 使用 $\mathrm{ONNX^{8}}$ 将训练好的 PyTorch 模型导出为 Caffe2 图。

Figure 1. Document Classification Task Config

图 1: 文档分类任务配置

2.2 Design Overview

2.2 设计概述

The task bootstraps a PyText job and creates all the required components. There are two modes in which a job can be run:

任务启动一个 PyText 任务并创建所有必需的组件。任务可以在两种模式下运行：

• Train: Trains a model either from scratch or from a saved check-point. Task uses the Data Handler to create batch iterators over training, evaluation and test datasets and passes these iterators along with model, optimizer and metrics reporter to the trainer. Subsequently, the trained model is serialized in PyTorch format as well as converted to a static Caffe2 graph.

• 训练：从头开始或从保存的检查点训练模型。任务使用数据处理器（Data Handler）创建训练、评估和测试数据集的批量迭代器，并将这些迭代器与模型、优化器和指标报告器一起传递给训练器。随后，训练好的模型以 PyTorch 格式序列化，并转换为静态的 Caffe2 图。

Figure 2. PyText Framework Design

图 2: PyText 框架设计

• Predict: Loads a pre-trained model and computes its prediction for a given test set. The task Manager, again, uses the Data Handler to create a batch iterator over the test data-set and passes it with the model to the predictor for inference.

• 预测：加载预训练模型并计算其对给定测试集的预测结果。任务管理器再次使用数据处理器在测试数据集上创建批量迭代器，并将其与模型一起传递给预测器进行推理。

Figure 2 illustrates the overall design of the framework.

图 2: 展示了框架的整体设计。

3 MODELING SUPPORT

3 建模支持

We now discuss the native support for building and extending models in PyText.

我们现在讨论 PyText 中构建和扩展模型的原生支持。

3.1 Terminology

3.1 术语

Module: is a reusable component that is implemented without any knowledge of which model it will be used in. It defines a clear input and output interface such that it can be plugged into another module or model.

模块：是一个可重用的组件，它在实现时无需了解将用于哪个模型。它定义了清晰的输入和输出接口，以便可以插入到另一个模块或模型中。

Model: has a one-to-one mapping with a task. Each model can be made up of a combination of modules for running a training or prediction job.

模型：与任务具有一一对应的映射关系。每个模型可以由多个模块组合而成，用于运行训练或预测任务。

3.2 Model Abstraction

3.2 模型抽象

PyText provides a simple, easily extensible model abstraction. We break up a single-task model into Token Embed- ding, Representation, Decoder and Output layers, each of which is configurable. Further, each module can be saved and loaded individually to be reused in other models.

PyText 提供了一个简单且易于扩展的模型抽象。我们将单任务模型分解为 Token Embedding（Token 嵌入）、Representation（表示）、Decoder（解码器）和 Output（输出）层，每一层都是可配置的。此外，每个模块可以单独保存和加载，以便在其他模型中重复使用。

Token Embedding: converts a batch of numerical i zed tokens into a batch of vector embeddings for each token. It can be configured to use embeddings of a number of styles: pretrained word-based, trainable word-based, character-based with CNN and highway networks(Kim et al., 2016), pretrained deep contextual character-based (e.g., ELMo(Peters et al., 2018)), token-level gazetteer features or morphologybased (e.g. capitalization).

Token Embedding: 将一批数值化的 Token 转换为一组向量嵌入 (vector embeddings)。它可以配置为使用多种风格的嵌入：预训练的词嵌入 (pretrained word-based)、可训练的词嵌入 (trainable word-based)、基于字符的嵌入 (character-based) 并带有 CNN 和高速公路网络 (highway networks) (Kim et al., 2016)、预训练的深度上下文字符嵌入 (pretrained deep contextual character-based)（例如 ELMo (Peters et al., 2018)）、Token 级别的地名录特征 (token-level gazetteer features) 或基于形态的特征 (morphology-based)（例如大写）。

Representation: processes a batch of embedded tokens to a representation of the input. The implementation of what it emits as output depends on the task, e.g., the representation of the document for a text classification task will differ from that for a word tagging task. Logically this part of the model should implement the sub-network such that its output can be interpreted as features over the input. Examples of the different representations that are present in PyText are; Bidirectional LSTM and CNN representations.

表示：将一批嵌入的 Token 处理为输入的表示。其输出的具体实现取决于任务，例如，文本分类任务的文档表示将与词性标注任务的表示不同。从逻辑上讲，模型的这一部分应实现子网络，以便其输出可以解释为输入的特征。PyText 中存在的不同表示示例包括：双向 LSTM 和 CNN 表示。

Decoder: is responsible for generating logits from the input representation. Logically this part of the model should implement the sub-network that generates model output over the features learned by the representation.

解码器 (Decoder)：负责从输入表示生成 logits。从逻辑上讲，模型的这一部分应实现从表示学习到的特征生成模型输出的子网络。

Output Layer: concerns itself with generating prediction and the loss (when label or ground truth is provided).

输出层：负责生成预测和计算损失（当提供标签或真实值时）。

These modules compose the base model implementation, they can be easily extended for more complicated architectures.

这些模块构成了基础模型的实现，它们可以轻松扩展以支持更复杂的架构。

3.3 Multi-task Model Training

3.3 多任务模型训练

PyText supports multi-task training (Collobert & Weston, 2008) to optimize multiple tasks jointly as a first-class citizen. We use multi-task model by allowing parameter sharing between modules of the multiple single task models. We use the model abstraction for single task discussed in Section 3.2 to define the tasks and let the user declare which modules of those single tasks should be shared. This enables training a model with one or more input representations jointly against multiple tasks.

PyText 支持多任务训练 (Collobert & Weston, 2008) ，将多任务联合优化作为一等公民。我们通过允许多个单任务模型的模块之间共享参数来使用多任务模型。我们使用第 3.2 节中讨论的单任务模型抽象来定义任务，并让用户声明这些单任务的哪些模块应该共享。这使得能够针对多个任务联合训练一个或多个输入表示的模型。

Multi-task models make the following assumptions:

多任务模型做出以下假设：

• If there are n tasks in the multi-task model setup then there must be n data sources containing data for one task each.

• 如果在多任务模型设置中有 n 个任务，那么必须有 n 个数据源，每个数据源包含一个任务的数据。

• The single task scenario must be implemented for it to be reused for the multi-task setup.

• 单任务场景必须实现，以便在多重任务设置中重复使用。

Figure 3. Joint document classification and word tagging model

图 3: 联合文档分类和词标注模型

3.3.1 Multi-task Model Examples

3.3.1 多任务模型示例

PyText provides the flexibility of building any multi-task model architecture with the appropriate model configuration, if the two assumptions listed above are satisfied. The examples below give a flavor of two sample model architectures built with PyText for joint learning against more than one task.

PyText 提供了构建任何多任务模型架构的灵活性，只要满足上述两个假设，就可以通过适当的模型配置实现。以下示例展示了使用 PyText 构建的两个样本模型架构，用于针对多个任务进行联合学习。

Figure 3 illustrates a model that learns a shared document representation for document classification and word tagging tasks. This model is useful for natural language understanding where given a sentence, we want to predict the intent behind it and tag the slots in the sentence. Jointly optimizing for two tasks helps the model learn a robust sentence representation for the two tasks. Further, we can use this pre-trained sentence representation for other tasks where training data is scarce.

图 3 展示了一个模型，该模型学习了一个共享的文档表示，用于文档分类和词性标注任务。该模型在自然语言理解中非常有用，当给定一个句子时，我们希望预测其背后的意图并标注句子中的槽位。联合优化这两个任务有助于模型为这两个任务学习一个鲁棒的句子表示。此外，我们可以将这个预训练的句子表示用于其他训练数据稀缺的任务。

Figure 4 illustrates a model that learns document and query representations using query-document relevance and individual query and document classification tasks. This is often used in information retrieval where, given a query and a document, we want to predict their relevance; but we also add query and document classification tasks to increase robustness of learned representations.

图 4 展示了一个模型，该模型通过查询-文档相关性以及单独的查询和文档分类任务来学习文档和查询的表示。这种方法通常用于信息检索中，在给定查询和文档的情况下，我们希望预测它们的相关性；但我们也添加了查询和文档分类任务，以增强学习表示的鲁棒性。

3.4 Model Zoo

3.4 模型库

PyText models are focused on NLP tasks that can be configured with a variety of modules. We enumerate here the classes of models that are currently supported.

PyText 模型专注于可以配置多种模块的自然语言处理 (NLP) 任务。我们在此列举当前支持的模型类别。

Figure 4. Joint query-document relevance and document classification model

图 4: 联合查询-文档相关性和文档分类模型

• Text Classification: classifies a sentence or a document into an appropriate category. PyText includes reference implementations of Bidirectional LSTM (Schuster & Paliwal, 1997) with Self-Attention (Lin et al., 2017) and Convolutional Neural Network (Kim, 2014) models for text classification.

• 文本分类：将句子或文档分类到适当的类别中。PyText 包含了双向 LSTM (Schuster & Paliwal, 1997) 与自注意力机制 (Lin et al., 2017) 以及卷积神经网络 (Kim, 2014) 模型的参考实现，用于文本分类。

• Word Tagging: labels word sequences, i.e. classifies each word in a sequence to an appropriate category. Common examples of such tasks include Partof-Speech (POS) tagging, Named Entity Recognition (NER) and Slot Filling in spoken language understanding. PyText contains reference implementations of Bidirectional LSTM with Slot-Attention and Bidirectional Sequential Convolutional Neural Network (Vu, 2016) for word tagging.

• 词性标注：为词序列打标签，即将序列中的每个词分类到适当的类别。此类任务的常见示例包括词性标注 (POS)、命名实体识别 (NER) 和口语理解中的槽位填充。PyText 包含了带有槽位注意力的双向 LSTM 和双向序列卷积神经网络 (Vu, 2016) 的参考实现，用于词性标注。

• Semantic Parsing: maps a natural language sentence into a formal representation of its meaning. PyText provides a reference implementation for Recurrent Neural Network Grammars (Dyer et al., 2016) (Gupta et al., 2018) for semantic parsing.

• 语义解析：将自然语言句子映射为其意义的正式表示。PyText 提供了用于语义解析的循环神经网络语法 (Dyer et al., 2016) (Gupta et al., 2018) 的参考实现。

• Language Modeling: assigns a probability to a sequence of words (sentence) in a language. It also assigns a probability for the likelihood of a given word to follow a sequence of words. PyText provides a reference implementation for a stacked LSTM Language Model (Mikolov et al., 2010).

• 语言建模：为语言中的单词序列（句子）分配概率。它还为给定单词跟随单词序列的可能性分配概率。PyText 提供了一个堆叠 LSTM 语言模型（Mikolov 等，2010）的参考实现。

• Joint Models: We utilize the multi-task training support illustrated earlier to fuse and train models for two or more of the tasks mentioned here and optimize their parameters jointly.

联合模型：我们利用前面展示的多任务训练支持，将两个或多个任务的模型融合并训练，并联合优化它们的参数。

[论文翻译]PYTEXT：从 NLP 研究到生产的无缝路径

原文地址：https://arxiv.org/pdf/1812.08729v1