Automatic differentiation in PyTorch
PyTorch 中的自动微分
Luca Antiga OROBIX Srl
Luca Antiga OROBIX Srl
Adam Lerer Facebook AI Research
Adam Lerer Facebook AI Research
Abstract
摘要
In this article, we describe an automatic differentiation module of PyTorch — a library designed to enable rapid research on machine learning models. It builds upon a few projects, most notably Lua Torch, Chainer, and HIPS Autograd [4], and provides a high performance environment with easy access to automatic differentiation of models executed on different devices (CPU and GPU). To make prototyping easier, PyTorch does not follow the symbolic approach used in many other deep learning frameworks, but focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Note that this preprint is a draft of certain sections from an upcoming paper covering all PyTorch features.
在本文中,我们介绍了 PyTorch 的自动微分模块——这是一个旨在加速机器学习模型研究的库。它基于多个项目构建,其中最著名的是 Lua Torch、Chainer 和 HIPS Autograd [4],并提供了一个高性能环境,能够轻松访问在不同设备(CPU 和 GPU)上执行的模型的自动微分。为了使原型设计更加容易,PyTorch 没有遵循许多其他深度学习框架中使用的符号方法,而是专注于纯命令式程序的微分,重点在于可扩展性和低开销。请注意,这篇预印本是即将发表的涵盖所有 PyTorch 功能的论文中某些部分的草稿。
1 Background
1 背景
PyTorch, like most other deep learning libraries, supports reverse-mode [6] automatic different i ation [2] of scalar functions (or vector-Jacobian products of functions with multiple outputs), the most important form of automatic differentiation for deep learning applications which usually differentiate a single scalar loss.
PyTorch 和大多数其他深度学习库一样,支持标量函数的反向模式 [6] 自动微分 [2](或多输出函数的向量-雅可比积),这是深度学习应用中最重要的自动微分形式,通常用于对单个标量损失进行微分。
Within this domain, PyTorch’s support for automatic differentiation follows in the steps of Chainer, HIPS autograd [4] and twitter-autograd (twitter-autograd was, itself, a port of HIPS autograd to Lua). These libraries have two distinguishing features:
在这一领域中,PyTorch 对自动微分 (automatic differentiation) 的支持继承了 Chainer、HIPS autograd [4] 和 twitter-autograd(twitter-autograd 本身是 HIPS autograd 到 Lua 的移植)的步伐。这些库有两个显著特点:
• Dynamic, define-by-run execution. A dynamic framework defines the function to be differentiated simply by running the desired computation, as opposed to specifying a static graph structure which is differentiated symbolically ahead of time and then run many times. This permits users to use any host-language features they want (e.g., arbitrary control flow constructs), at the cost of requiring differentiation to be carried out every iteration. Dynamic execution distinguishes PyTorch from static frameworks like TensorFlow [1], Caffe, etc. • Immediate, eager execution. An eager framework runs tensor computations as it encounters them; it avoids ever materializing a “forward graph”, recording only what is necessary to differentiate the computation. This stands in contrast to DyNet [5], which employs lazy evaluation, literally rebuilding the forwards and backwards graph every training iteration. Immediate execution allows CPU and GPU computation to be pipelined, but gives up the opportunity for whole-network optimization and batching.
• 动态、按需执行的运行机制。动态框架通过直接运行所需计算来定义待微分的函数,而不是预先指定一个静态图结构进行符号微分后再多次运行。这种方式允许用户使用宿主语言的任何特性(例如,任意的控制流结构),代价是每次迭代都需要进行微分。动态执行使 PyTorch 区别于 TensorFlow [1]、Caffe 等静态框架。
• 即时、热切的执行方式。热切框架在遇到张量计算时立即执行;它避免生成“前向图”,仅记录微分计算所需的内容。这与 DyNet [5] 形成对比,后者采用惰性求值,每次训练迭代都会重新构建前向和反向图。即时执行允许 CPU 和 GPU 计算进行流水线处理,但放弃了全网络优化和批处理的机会。
Although PyTorch’s support for automatic differentiation was heavily inspired by its predecessors (especially twitter-autograd and Chainer), it introduces some novel design and implementation choices, which make it the one of the fastest implementations among automatic differentiation libraries supporting this kind of dynamic eager execution:
尽管 PyTorch 对自动微分 (automatic differentiation) 的支持深受其前辈(尤其是 twitter-autograd 和 Chainer)的启发,但它引入了一些新颖的设计和实现选择,使其成为支持这种动态即时执行 (dynamic eager execution) 的自动微分库中最快的实现之一:
Figure 1: An example use of PyTorch’s automatic different ation module (torch.autograd).
图 1: PyTorch 的自动微分模块 (torch.autograd) 的使用示例。
2 Interface
2 界面
Figure 1 gives a simple example of automatic differentiation in PyTorch. You write code as if you were executing tensor operations directly; however, instead of operating on Tensors (PyTorch’s equivalent of Numpy’s nd-arrays), the user manipulates Variables, which store extra metadata necessary for AD. Variables support a backward() method, which computes the gradient of all input Variables involved in computation of this quantity.
图 1: PyTorch 中自动微分的一个简单示例。你可以像直接执行张量操作一样编写代码;然而,用户操作的不是张量(PyTorch 中相当于 Numpy 的 nd-arrays),而是变量(Variables),这些变量存储了自动微分(AD)所需的额外元数据。变量支持 backward() 方法,该方法计算参与此量计算的所有输入变量的梯度。
By default, these gradients are accumulated in the grad field of input variables, a design inherited from Chainer. However, PyTorch also provides a HIPS autograd-style functional interface for computing gradients: the function torch.autograd.grad $\left(\mathbf{f}\left(\mathbf{x},\mathbf{y},\mathbf{z}\right)\right)$ , $\left(\mathbf{x},\mathbf{y}\right),$ ) computes the derivative of f w.r.t. x and y only (no gradient is computed for $_{z}$ ). Unlike the Chainer-style API, this call doesn’t mutate .grad attributes; instead, it returns a tuple containing gradient w.r.t. each of the inputs requested in the call.
默认情况下,这些梯度会累积在输入变量的 grad 字段中,这一设计继承自 Chainer。然而,PyTorch 也提供了 HIPS autograd 风格的功能接口用于计算梯度:函数 torch.autograd.grad $\left(\mathbf{f}\left(\mathbf{x},\mathbf{y},\mathbf{z}\right)\right)$ , $\left(\mathbf{x},\mathbf{y}\right),$ ) 仅计算 f 对 x 和 y 的导数(不计算 $_{z}$ 的梯度)。与 Chainer 风格的 API 不同,此调用不会修改 .grad 属性;相反,它返回一个包含调用中请求的每个输入的梯度的元组。
Variable flags It is important to not compute derivatives when they are not needed. PyTorch, following Chainer, provides two facilities for excluding subgraphs from derivative computation: the “requires grad” and “volatile” flags. These flags obey the following rules: If any input Variable is volatile, the output is volatile. Otherwise, if any input Variable requires grad, the output requires grad. Additionally, having volatile set, implies that “requires grad” is unset. If an operation wouldn’t require grad, the derivative closure is not even instantiated. The volatile flag makes it easy to disable differentiation (e.g., when running a model for inference, ones does not have to set “requires grad” to false on each parameter).
变量标志
在不需要计算导数时,不进行计算是非常重要的。PyTorch 借鉴了 Chainer 的做法,提供了两种机制来排除子图的导数计算:“requires grad” 和 “volatile” 标志。这些标志遵循以下规则:如果任何输入变量是 volatile 的,则输出也是 volatile 的。否则,如果任何输入变量需要 grad,则输出也需要 grad。此外,设置 volatile 意味着 “requires grad” 被取消。如果某个操作不需要 grad,则导数闭包甚至不会被实例化。volatile 标志使得禁用微分变得容易(例如,在运行模型进行推理时,不必将每个参数的 “requires grad” 设置为 false)。
Hooks One downside of automatic differentiation is that the differentiation is relatively opaque to users: unlike the forward pass, which is invoked by user-written Python code, the differentiation is carried out from library code, which users have little visibility into. To allow users to inspect gradients, we provide a hooks mechanism to allow users to observe when the backwards of a function is invoked: x.register hook(lambda grad: print(grad)). This code registers a hook on x, which prints the gradient of $\mathtt{x}$ whenever it is computed. A hook callback can also return a new gradient which is used in place of the original gradient; this capability has proven to be useful for metal earning and reinforcement learning.
Hooks 的一个缺点是自动微分对用户来说相对不透明:与用户编写的 Python 代码调用的前向传播不同,微分是由库代码执行的,用户对此几乎不可见。为了允许用户检查梯度,我们提供了一个 hooks 机制,允许用户在函数的反向传播被调用时进行观察:x.register_hook(lambda grad: print(grad))。这段代码在 x 上注册了一个 hook,每当计算 $\mathtt{x}$ 的梯度时,就会打印该梯度。hook 回调还可以返回一个新的梯度,用于替代原始梯度;这一功能在元学习和强化学习中被证明非常有用。
Extensions PyTorch users can create custom differentiable operations by specifying a pair of forward and backward functions in Python. The forward function computes the operation, while the backward method extends the vector-Jacobian product. This can be used to make arbitrary Python libraries (e.g., Scipy [3]) differentiable (critically taking advantage of PyTorch’s zero-copy NumPy conversion).
扩展 PyTorch 用户可以通过在 Python 语言中指定一对前向和反向函数来创建自定义的可微分操作。前向函数计算操作,而反向方法扩展了向量-雅可比积。这可以用于使任意的 Python 库(例如 Scipy [3])可微分(关键是利用 PyTorch 的零拷贝 NumPy 转换)。
3 Implementation
3 实现
Internally, a Variable is simply a wrapper around a Tensor that also holds a reference to a graph of Function objects. This graph is an immutable, purely functional representation of the derivative of computed function; Variables are simply mutable pointers to this graph (they are mutated when an in-place operation occurs; see Section 3.1).
在内部,Variable 只是 Tensor 的一个封装,它还持有一个指向 Function 对象图的引用。这个图是一个不可变的、纯函数式的计算函数导数表示;Variable 只是指向这个图的可变指针(当发生原地操作时,它们会被修改;参见第 3.1 节)。
A Function can be thought of as a closure that has all context necessary to compute vector-Jacobian products. They accept the gradients of the outputs, and return the gradients of the inputs (formally, the left product including the term for their respective operation.) A graph of Functions is a single argument closure that takes in a left product and multiplies it by the derivatives of all operations it contains. The left products passed around are themselves Variables, making the evaluation of the graph differentiable.
函数 (Function) 可以被视为一个闭包,它包含了计算向量-雅可比积所需的所有上下文。它们接受输出的梯度,并返回输入的梯度(形式上,左积包括其各自操作的项)。函数图是一个单参数闭包,它接受左积并将其乘以其中包含的所有操作的导数。传递的左积本身是变量 (Variables),这使得图的评估是可微分的。
Memory management The main use case for PyTorch is training machine learning models on GPU. As one of the biggest limitations of GPUs is low memory capacity, PyTorch takes great care to make sure that all intermediate values are freed as soon as they become unneeded. Indeed, Python is well-suited for this purpose, because it is reference counted by default (using a garbage collector only to break cycles).
内存管理 PyTorch 的主要用途是在 GPU 上训练机器学习模型。由于 GPU 的最大限制之一是内存容量较低,PyTorch 非常注重确保所有中间值在不再需要时立即释放。实际上,Python 非常适合这一目的,因为它默认使用引用计数(仅使用垃圾回收器来打破循环)。
PyTorch’s Variable and Function must be designed to work well in a reference counted regime. For example, a Function records pointers to the Function which consumes its result, so that a Function subgraph is freed when its retaining output Variable becomes dead. This is opposite of the conventional ownership for closures, where a closure retains the closures it invokes (a pointer to the Function which produces its result.)
PyTorch 的 Variable 和 Function 必须设计为在引用计数机制下良好工作。例如,一个 Function 会记录指向消费其结果的 Function 的指针,以便当其保留的输出 Variable 失效时,Function 子图会被释放。这与传统的闭包所有权相反,在传统闭包中,闭包会保留它调用的闭包(指向生成其结果的 Function 的指针)。
Another challenge is avoiding reference cycles. A naıve implementation of automatic differentiation can easily introduce such cycles (e.g. when a differentiable function would like to save a reference to its output). PyTorch breaks them by recording not a full-fledged variable, but instead a “saved variable”, which omits a pointer to the Function in such cases.
另一个挑战是避免引用循环。自动微分的简单实现很容易引入这种循环(例如,当一个可微函数想要保存对其输出的引用时)。PyTorch 通过记录一个“保存的变量”而不是一个完整的变量来打破这些循环,在这种情况下,该变量省略了指向 Function 的指针。
$\mathbf{C}{+}+$ operators We have found that even though it is possible to express all operations in Python using the extension API, they suffer from high interpreter overhead. Moving operators to $\mathrm{C}{+}{+}$ reduces the overhead and lowers the latency a single differentiable operation dispatched form Python to as few as 3.4us, compared to 1.7us for tensor operation. An additional benefit is that one can have multiple threads executing them in parallel (in contrast to Python, which limits parallelism due to the GIL). This is especially important in context of multiple GPUs, which cannot be saturated by a single CPU thread.
$\mathbf{C}{+}+$ 运算符
我们发现,尽管可以使用扩展 API 在 Python 语言中表达所有操作,但它们会受到高解释器开销的影响。将运算符移至 $\mathrm{C}{+}{+}$ 可以减少开销,并将从 Python 分派的单个可微分操作的延迟降低至 3.4 微秒,而张量操作的延迟为 1.7 微秒。另一个好处是可以让多个线程并行执行这些操作(与 Python 不同,Python 由于 GIL 限制了并行性)。这在多 GPU 环境中尤为重要,因为单个 CPU 线程无法充分利用多 GPU 的性能。
3.1 Supporting in-place operations
3.1 支持原地操作
Frequently, users of PyTorch wish to perform operations in-place on a tensor, so as to avoid allocating a new tensor when it’s known to be unnecessary. Intuitively, an in-place operation is equivalent to its corresponding non-inplace operation, except that the Variable which is modified in-place has its computation history “rebased” to point to the derivative of the in-place operator, instead of its previous Function (the computation history always remains purely functional). However, these in-place operations interact with autograd in a subtle way.
PyTorch 的用户经常希望对张量进行原地操作,以避免在已知不必要的情况下分配新的张量。直观地说,原地操作等同于其对应的非原地操作,只是原地修改的变量会将其计算历史“重新定位”到指向原地操作的导数,而不是其之前的函数(计算历史始终保持纯函数性质)。然而,这些原地操作与自动求导的交互方式较为微妙。
Invalidation An in-place operation can invalidate data that is needed to compute derivatives. Consider the following example:
无效化
就地操作可能会使计算导数所需的数据无效。考虑以下示例:

This program instructs PyTorch to perform an in-place operation on $y$ ; however, this is unsound if $y$ , whose value is $\operatorname{tanh}(x)$ , was saved so that it could be used in the backwards computation (recall that $\operatorname{tanh}^{\prime}(x)=1-\operatorname{tanh}^{2}(x))$ .
该程序指示 PyTorch 对 $y$ 执行原地操作;然而,如果 $y$ 的值是 $\operatorname{tanh}(x)$ 并且被保存以便在反向计算中使用(回想一下 $\operatorname{tanh}^{\prime}(x)=1-\operatorname{tanh}^{2}(x))$),这是不合理的。
Making a copy of $y$ when saving it would be inefficient, PyTorch instead fails at runtime when differentiating this program. Every underlying storage of a variable is associated with a version counter, which tracks how many in-place operations have been applied to the storage. When a variable is saved, we record the version counter at that time. When an attempt to use the saved variable is made, an error is raised if the saved value doesn’t match the current one.
在保存 $y$ 时复制一份会效率低下,因此 PyTorch 在对此程序进行微分时会在运行时失败。每个变量的底层存储都与一个版本计数器相关联,该计数器跟踪已应用于该存储的原地操作次数。当保存一个变量时,我们会记录当时的版本计数器。当尝试使用保存的变量时,如果保存的值与当前值不匹配,则会引发错误。
Aliasing PyTorch supports nontrivial aliasing between variables; operations like transpose and narrow produce new tensors with new sizes and strides which share storage with the original tensors. The problem with aliasing is that it can require nontrivial transformations on computation histories of many variables. Consider the following example:
别名
PyTorch 支持变量之间的非平凡别名;像转置 (transpose) 和窄化 (narrow) 这样的操作会生成具有新大小和步幅的张量,这些张量与原始张量共享存储。别名的问题在于,它可能需要对许多变量的计算历史进行非平凡的转换。考虑以下示例:

Ordinarily, an in-place operation on $x$ would only affect the computational history of $x$ . However, in this case, the in-place addition to $x$ also causes some elements of $y$ to be updated; thus, $y$ ’s computational history has changed as well. Supporting this case is fairly nontrivial, so PyTorch rejects this program, using an additional field in the version counter (see Invalidation paragraph) to determine that the data is shared.
通常情况下,对 $x$ 进行的原地操作只会影响 $x$ 的计算历史。然而,在这种情况下,对 $x$ 的原地加法操作也会导致 $y$ 的某些元素被更新;因此,$y$ 的计算历史也发生了变化。支持这种情况相当复杂,因此 PyTorch 拒绝了该程序,使用版本计数器中的附加字段(参见失效段落)来确定数据是共享的。
In future work, we are looking to relax this restriction. The challenge is that there may be arbitrarily many aliases of a variable, so it is not feasible to go through each one-by-one and update their computation histories. However, it may be possible to lazily initialize a computation history, materializing it only when a computation produces a result that is not aliased to the original variable.
在未来的工作中,我们希望能够放宽这一限制。挑战在于,一个变量可能有任意多的别名,因此逐一更新它们的计算历史是不可行的。然而,或许可以延迟初始化计算历史,仅在计算产生的结果与原始变量没有别名关系时才具体化它。
