[论文翻译]PyTorch 中的自动微分


原文地址:https://openreview.net/pdf?id=BJJsrmfCZ


Automatic differentiation in PyTorch

PyTorch 中的自动微分

Luca Antiga OROBIX Srl

Luca Antiga OROBIX Srl

Adam Lerer Facebook AI Research

Adam Lerer Facebook AI Research

Abstract

摘要

In this article, we describe an automatic differentiation module of PyTorch — a library designed to enable rapid research on machine learning models. It builds upon a few projects, most notably Lua Torch, Chainer, and HIPS Autograd [4], and provides a high performance environment with easy access to automatic differentiation of models executed on different devices (CPU and GPU). To make prototyping easier, PyTorch does not follow the symbolic approach used in many other deep learning frameworks, but focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Note that this preprint is a draft of certain sections from an upcoming paper covering all PyTorch features.

在本文中,我们介绍了 PyTorch 的自动微分模块——这是一个旨在加速机器学习模型研究的库。它基于多个项目构建,其中最著名的是 Lua Torch、Chainer 和 HIPS Autograd [4],并提供了一个高性能环境,能够轻松访问在不同设备(CPU 和 GPU)上执行的模型的自动微分。为了使原型设计更加容易,PyTorch 没有遵循许多其他深度学习框架中使用的符号方法,而是专注于纯命令式程序的微分,重点在于可扩展性和低开销。请注意,这篇预印本是即将发表的涵盖所有 PyTorch 功能的论文中某些部分的草稿。

1 Background

1 背景

PyTorch, like most other deep learning libraries, supports reverse-mode [6] automatic different i ation [2] of scalar functions (or vector-Jacobian products of functions with multiple outputs), the most important form of automatic differentiation for deep learning applications which usually differentiate a single scalar loss.

PyTorch 和大多数其他深度学习库一样,支持标量函数的反向模式 [6] 自动微分 [2](或多输出函数的向量-雅可比积),这是深度学习应用中最重要的自动微分形式,通常用于对单个标量损失进行微分。

Within this domain, PyTorch’s support for automatic differentiation follows in the steps of Chainer, HIPS autograd [4] and twitter-autograd (twitter-autograd was, itself, a port of HIPS autograd to Lua). These libraries have two distinguishing features:

在这一领域中,PyTorch 对自动微分 (automatic differentiation) 的支持继承了 Chainer、HIPS autograd [4] 和 twitter-autograd(twitter-autograd 本身是 HIPS autograd 到 Lua 的移植)的步伐。这些库有两个显著特点:

• Dynamic, define-by-run execution. A dynamic framework defines the function to be differentiated simply by running the desired computation, as opposed to specifying a static graph structure which is differentiated symbolically ahead of time and then run many times. This permits users to use any host-language features they want (e.g., arbitrary control flow constructs), at the cost of requiring differentiation to be carried out every iteration. Dynamic execution distinguishes PyTorch from static frameworks like TensorFlow [1], Caffe, etc. • Immediate, eager execution. An eager framework runs tensor computations as it encounters them; it avoids ever materializing a “forward graph”, recording only what is necessary to differentiate the computation. This stands in contrast to DyNet [5], which employs lazy evaluation, literally rebuilding the forwards and backwards graph every training iteration. Immediate execution allows CPU and GPU computation to be pipelined, but gives up the opportunity for whole-network optimization and batching.

• 动态、按需执行的运行机制。动态框架通过直接运行所需计算来定义待微分的函数,而不是预先指定一个静态图结构进行符号微分后再多次运行。这种方式允许用户使用宿主语言的任何特性(例如,任意的控制流结构),代价是每次迭代都需要进行微分。动态执行使 PyTorch 区别于 TensorFlow [1]、Caffe 等静态框架。
• 即时、热切的执行方式。热切框架在遇到张量计算时立即执行;它避免生成“前向图”,仅记录微分计算所需的内容。这与 DyNet [5] 形成对比,后者采用惰性求值,每次训练迭代都会重新构建前向和反向图。即时执行允许 CPU 和 GPU 计算进行流水线处理,但放弃了全网络优化和批处理的机会。

Although PyTorch’s support for automatic differentiation was heavily inspired by its predecessors (especially twitter-autograd and Chainer), it introduces some novel design and implementation choices, which make it the one of the fastest implementations among automatic differentiation libraries supporting this kind of dynamic eager execution:

尽管 PyTorch 对自动微分 (automatic differentiation) 的支持深受其前辈(尤其是 twitter-autograd 和 Chainer)的启发,但它引入了一些新颖的设计和实现选择,使其成为支持这种动态即时执行 (dynamic eager execution) 的自动微分库中最快的实现之一:

Figure 1: An example use of PyTorch’s automatic different ation module (torch.autograd).

图 1: PyTorch 的自动微分模块 (torch.autograd) 的使用示例。

2 Interface

2 界面

Figure 1 gives a simple example of automatic differentiation in PyTorch. You write code as if you were executing tensor operations directly; however, instead of operating on Tensors (PyTorch’s equivalent of Numpy’s nd-arrays), the user manipulates Variables, which store extra metadata necessary for AD. Variables support a backward() method, which computes the gradient of all input Variables involved in computation of this quantity.

图 1: PyTorch 中自动微分的一个简单示例。你可以像直接执行张量操作一样编写代码;然而,用户操作的不是张量(PyTorch 中相当于 Numpy 的 nd-arrays),而是变量(Variables),这些变量存储了自动微分(AD)所需的额外元数据。变量支持 backward() 方法,该方法计算参与此量计算的所有输入变量的梯度。

By default, these gradients are accumulated in the grad field of input variables, a design inherited from Chainer. However, PyTorch also provides a HIPS autograd-style functional interface for computing gradients: the function torch.autograd.grad (f(x,y,z)) , (x,y), ) computes the derivative of f w.r.t. x and y only (no gradient is computed for z ). Unlike the Chainer-style API, this call doesn’t mutate .grad attributes; instead, it returns a tuple containing gradient w.r.t. each of the inputs requested in the call.

默认情况下,这些梯度会累积在输入变量的 grad 字段中,这一设计继承自 Chainer。然而,PyTorch 也提供了 HIPS autograd 风格的功能接口用于计算梯度:函数 torch.autograd.grad (f(x,y,z)) , (x,y), ) 仅计算 f 对 x 和 y 的导数(不计算 z 的梯度)。与 Chainer 风格的 API 不同,此调用不会修改 .grad 属性;相反,它返回一个包含调用中请求的每个输入的梯度的元组。

Variable flags It is important to not compute derivatives when they are not needed. PyTorch, following Chainer, provides two facilities for excluding subgraphs from derivative computation: the “requires grad” and “volatile” flags. These flags obey the following rules: If any input Variable is volatile, the output is volatile. Otherwise, if any input Variable requires grad, the output requires grad. Additionally, having volatile set, implies that “requires grad” is unset. If an operation wouldn’t require grad, the derivative closure is not even instantiated. The volatile flag makes it easy to disable differentiation (e.g., when running a model for inference, ones does not have to set “requires grad” to false on each parameter).

变量标志
在不需要计算导数时,不进行计算是非常重要的。PyTorch 借鉴了 Chainer 的做法,提供了两种机制来排除子图的导数计算:“requires grad” 和 “volatile” 标志。这些标志遵循以下规则:如果任何输入变量是 volatile 的,则输出也是 volatile 的。否则,如果任何输入变量需要 grad,则输出也需要 grad。此外,设置 volatile 意味着 “requires grad” 被取消。如果某个操作不需要 grad,则导数闭包甚至不会被实例化。volatile 标志使得禁用微分变得容易(例如,在运行模型进行推理时,不必将每个参数的 “requires grad” 设置为 false)。

Hooks One downside of automatic differentiation is that the differentiation is relatively opaque to users: unlike the forward pass, which is invoked by user-written Python code, the differentiation is carried out from library code, which users have little visibility into. To allow users to inspect gradients, we provide a hooks mechanism to allow users to observe when the backwards of a function is invoked: x.register hook(lambda grad: print(grad)). This code registers a hook on x, which prints the gradient of x whenever it is computed. A hook callback can also return a new gradient which is used in place of the original gradient; this capability has proven to be useful for metal earning and reinforcement learning.

Hooks 的一个缺点是自动微分对用户来说相对不透明:与用户编写的 Python 代码调用的前向传播不同,微分是由库代码执行的,用户对此几乎不可见。为了允许用户检查梯度,我们提供了一个 hooks 机制,允许用户在函数的反向传播被调用时进行观察:x.register_hook(lambda grad: print(grad))。这段代码在 x 上注册了一个 hook,每当计算 x 的梯度时,就会打印该梯度。hook 回调还可以返回一个新的梯度,用于替代原始梯度;这一功能在元学习和强化学习中被证明非常有用。

Extensions PyTorch users can create custom differentiable operations by specifying a pair of forward and backward functions in Python. The forward function computes the operation, while the backward method extends the vector-Jacobian product. This can be used to make arbitrary Python libraries (e.g., Scipy [3]) differentiable (critically taking advantage of PyTorch’s zero-copy NumPy conversion).

扩展 PyTorch 用户可以通过在 Python 语言中指定一对前向和反向函数来创建自定义的可微分操作。前向函数计算操作,而反向方法扩展了向量-雅可比积。这可以用于使任意的 Python 库(例如 Scipy [3])可微分(关键是利用 PyTorch 的零拷贝 NumPy 转换)。

3 Implementation

3 实现

Internally, a Variable is simply a wrapper around a Tensor that also holds a reference to a graph of Function objects. This graph is an immutable, purely functional representation of the derivative of computed function; Variables are simply mutable pointers to this graph (they are mutated when an in-place operation occurs; see Section 3.1).

在内部,Variable 只是 Tensor 的一个封装,它还持有一个指向 Function 对象图的引用。这个图是一个不可变的、纯函数式的计算函数导数表示;Variable 只是指向这个图的可变指针(当发生原地操作时,它们会被修改;参见第 3.1 节)。

A Function can be thought of as a closure that has all context necessary to compute vector-Jacobian products. They accept the gradients of the outputs, and return the gradients of the inputs (formally, the left product including the term for their respective operation.) A graph of Functions is a single argument closure that takes in a left product and multiplies it by the derivatives of all operations it contains. The left products passed around are themselves Variables, making the evaluation of the graph differentiable.

函数 (Function) 可以被视为一个闭包,它包含了计算向量-雅可比积所需的所有上下文。它们接受输出的梯度,并返回输入的梯度(形式上,左积包括其各自操作的项)。函数图是一个单参数闭包,它接受左积并将其乘以其中包含的所有操作的导数。传递的左积本身是变量 (Variables),这使得图的评估是可微分的。

Memory management The main use case for PyTorch is training machine learning models