PyTorch: An Imperative Style, High-Performance Deep Learning Library
PyTorch:一种命令式风格的高性能深度学习库
Abstract
摘要
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs.
深度学习框架通常要么注重可用性,要么注重速度,但很少同时兼顾两者。PyTorch 是一个机器学习库,它证明了这两个目标实际上是兼容的:它提供了一种命令式和 Python语言 风格的编程方式,支持将代码作为模型,使调试变得容易,并且与其他流行的科学计算库保持一致,同时保持高效并支持 GPU 等硬件加速器。
In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.
在本文中,我们详细介绍了推动 PyTorch 实现的原则,以及这些原则如何在其架构中得到体现。我们强调,PyTorch 的每个方面都是一个常规的 Python语言 程序,完全由用户控制。我们还解释了如何谨慎且务实地实现其运行时的关键组件,使它们能够协同工作以实现出色的性能。我们展示了各个子系统的效率,以及 PyTorch 在多个常见基准测试中的整体速度。
33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
第33届神经信息处理系统会议 (NeurIPS 2019),加拿大温哥华。
1 Introduction
1 引言
With the increased interest in deep learning in recent years, there has been an explosion of machine learning tools. Many popular frameworks such as Caffe [1], CNTK [2], TensorFlow [3], and Theano [4], construct a static dataflow graph that represents the computation and which can then be applied repeatedly to batches of data. This approach provides visibility into the whole computation ahead of time, and can theoretically be leveraged to improve performance and s cal ability. However, it comes at the cost of ease of use, ease of debugging, and flexibility of the types of computation that can be represented.
近年来,随着对深度学习兴趣的增加,机器学习工具呈现爆炸式增长。许多流行的框架,如 Caffe [1]、CNTK [2]、TensorFlow [3] 和 Theano [4],构建了一个静态数据流图来表示计算,然后可以重复应用于数据批次。这种方法提供了对整个计算的提前可见性,理论上可以用于提高性能和可扩展性。然而,它是以易用性、调试便利性以及可表示计算类型的灵活性为代价的。
Prior work has recognized the value of dynamic eager execution for deep learning, and some recent frameworks implement this define-by-run approach, but do so either at the cost of performance (Chainer [5]) or using a less expressive, faster language (Torch [6], DyNet [7]), which limits their applicability.
先前的研究已经认识到动态即时执行对深度学习的价值,一些最近的框架实现了这种按运行定义的方法,但要么以性能为代价(Chainer [5]),要么使用表达能力较弱但更快的语言(Torch [6], DyNet [7]),这限制了它们的适用性。
However, with careful implementation and design choices, dynamic eager execution can be achieved largely without sacrificing performance. This paper introduces PyTorch, a Python library that performs immediate execution of dynamic tensor computations with automatic differentiation and GPU acceleration, and does so while maintaining performance comparable to the fastest current libraries for deep learning. This combination has turned out to be very popular in the research community with, for instance, 296 ICLR 2019 submissions mentioning PyTorch.
然而,通过谨慎的实现和设计选择,动态即时执行可以在很大程度上不牺牲性能的情况下实现。本文介绍了 PyTorch,这是一个 Python 库,它能够即时执行动态张量计算,并具备自动微分和 GPU 加速功能,同时保持与当前最快的深度学习库相媲美的性能。这种组合在研究社区中非常受欢迎,例如,有 296 篇 ICLR 2019 提交的论文提到了 PyTorch。
2 Background
2 背景
Four major trends in scientific computing have become increasingly important for deep learning.
科学计算中的四大趋势对深度学习变得越来越重要。
First, starting in the 1960s, the development of domain specific languages such as APL [8], MATLAB [9], R [10] and Julia [11], turned multidimensional arrays (often referred to as tensors) into first-class objects supported by a comprehensive set of mathematical primitives (or operators) to manipulate them. Separately, libraries such as NumPy[12], Torch[6], Eigen[13] and Lush[14] made array-based programming productive in general purpose languages such as Python, Lisp, C++ and Lua.
首先,从20世纪60年代开始,诸如APL [8]、MATLAB [9]、R [10]和Julia [11]等特定领域语言的发展,将多维数组(通常称为张量)转变为一等对象,并支持一套全面的数学原语(或操作符)来操作它们。与此同时,NumPy [12]、Torch [6]、Eigen [13]和Lush [14]等库使得基于数组的编程在Python、Lisp、C++和Lua等通用编程语言中变得高效。
Second, the development of automatic differentiation [15] made it possible to fully automate the daunting labor of computing derivatives. This made it significantly easier to experiment with different machine learning approaches while still llowing for effcient gradient based optimization. The autograd [16] package popularized the use of this technique for NumPy arrays, and similar approaches are used in frameworks such as Chainer [5], DyNet [7], Lush [14], Torch [6], Jax [17] and Flux.jl [18].
其次,自动微分(automatic differentiation)[15] 的发展使得计算导数的繁重工作完全自动化成为可能。这大大简化了尝试不同机器学习方法的过程,同时仍能进行高效的基于梯度的优化。autograd [16] 包推广了这种技术在 NumPy 数组中的应用,类似的框架如 Chainer [5]、DyNet [7]、Lush [14]、Torch [6]、Jax [17] 和 Flux.jl [18] 也采用了类似的方法。
Third, with the advent of the free software movement, the scientific community moved away from closed proprietary software such as Matlab[9], and towards the open-source Python ecosystem with packages like NumPy [12], SciPy [19], and Pandas [20]. This fulfilled most of the numerical analysis needs of researchers while allowing them to take advantage of a vast repository of libraries to handle dataset preprocessing, statistical analysis, plotting, and more. Moreover, the openness, interoperability, and flexibility of free software fostered the development of vibrant communities that could quickly address new or changing needs by extending the existing functionality of a library or if needed by developing and releasing brand new ones. While there is a rich offering of open-source software for neural networks in languages other than Python, starting with Lush [14] in Lisp, Torch [6] in C++ , Objective-C and Lua, EBLearn [21] in C++ ,Caffe[1] in C++ , the network effects of a large ecosystem such as Python made it an essential skill to jumpstart one's research. Hence, since 2014, most deep learning frameworks converged on a Python interface as an essential feature.
第三,随着自由软件运动的兴起,科学界逐渐摒弃了如 Matlab [9] 这样的闭源专有软件,转向了以 NumPy [12]、SciPy [19] 和 Pandas [20] 等包为代表的开源 Python 生态系统。这满足了研究人员大部分的数值分析需求,同时使他们能够利用庞大的库资源来处理数据集预处理、统计分析、绘图等任务。此外,自由软件的开放性、互操作性和灵活性促进了活跃社区的发展,这些社区能够通过扩展现有库的功能或在需要时开发和发布全新的库来快速应对新的或变化的需求。尽管在 Python 之外的其他语言中也有丰富的开源神经网络软件,从 Lisp 的 Lush [14] 开始,到 C++ 的 Torch [6]、Objective-C 和 Lua,再到 C++ 的 EBLearn [21] 和 Caffe [1],但像 Python 这样的大型生态系统的网络效应使其成为启动研究的必备技能。因此,自 2014 年以来,大多数深度学习框架都将 Python 接口作为一项基本功能。
Finally, the availability and com mod it iz ation of general-purpose massively parallel hardware such as GPUs provided the computing power required by deep learning methods. Specialized libraries such as cuDNN [22], along with a body of academic work (such as [23] and [24]), produced a set of high-performance reusable deep learning kernels that enabled frameworks such as Caffe [1], Torch7 [25], or TensorFlow [3] to take advantage of these hardware accelerators.
最后,GPU等通用大规模并行硬件的可用性和商品化,为深度学习方法提供了所需的计算能力。cuDNN [22] 等专用库,以及一系列学术工作(如 [23] 和 [24]),产生了一组高性能的可重用深度学习内核,使得 Caffe [1]、Torch7 [25] 或 TensorFlow [3] 等框架能够利用这些硬件加速器。
PyTorch builds on these trends by providing an array-based programming model accelerated by GPUs and differentiable via automatic differentiation integrated in the Python ecosystem.
PyTorch 基于这些趋势,提供了一个通过 GPU 加速的基于数组的编程模型,并通过集成在 Python 生态系统中的自动微分实现可微分性。
3 Design principles
3 设计原则
PyTorch's success stems from weaving previous ideas into a design that balances speed and ease of use. There are four main principles behind our choices:
PyTorch 的成功源于将先前的理念融入到一个平衡速度与易用性的设计中。我们的选择背后有四个主要原则:
Be Pythonic Data scientists are familiar with the Python language, its programming model, and its tools. PyTorch should be a first-class member of that ecosystem. It follows the commonly established design goals of keeping interfaces simple and consistent, ideally with one idiomatic way of doing things. It also integrates naturally with standard plotting, debugging, and data processing tools.
遵循 Python 风格
数据科学家熟悉 Python 语言、其编程模型及其工具。PyTorch 应该是该生态系统中的一等公民。它遵循了保持接口简单一致的设计目标,理想情况下提供一种惯用的方式来完成工作。它还自然地与标准绘图、调试和数据处理工具集成。
Put researchers first PyTorch strives to make writing models, data loaders, and optimizers as easy and productive as possible. The complexity inherent to machine learning should be handled internally by the PyTorch library and hidden behind intuitive APIs free of side-effects and unexpected performance cliffs.
以研究人员为先
PyTorch 致力于使编写模型、数据加载器和优化器尽可能简单高效。机器学习固有的复杂性应由 PyTorch 库内部处理,并隐藏在直观的 API 背后,避免副作用和意外的性能瓶颈。
Provide pragmatic performance To be useful, PyTorch needs to deliver compelling performance, although not at the expense of simplicity and ease of use. Trading 10 of speed for a significantly simpler to use model is acceptable; 100 is not. Therefore, its implementation accepts added complexity in order to deliver that performance. Additionally, providing tools that allow researchers to manually control the execution of their code will empower them to find their own performance improvements independent of those that the library provides automatically.
提供实用的性能
为了实用,PyTorch 需要提供引人注目的性能,尽管不能以牺牲简单性和易用性为代价。用 10% 的速度换取显著更简单的模型是可以接受的;但 100% 的速度损失则不行。因此,其实现接受增加的复杂性以提供这种性能。此外,提供允许研究人员手动控制代码执行的工具,将使他们能够找到独立于库自动提供的性能改进。
Worse is better [26] Given a fixed amount of engineering resources, and all else being equal, the time saved by keeping the internal implementation of PyTorch simple can be used to implement additional features, adapt to new situations, and keep up with the fast pace of progress in the field of AI. Therefore it is better to have a simple but slightly incomplete solution than a comprehensive but complex and hard to maintain design.
更差即更好 [26] 在工程资源固定的情况下,其他条件相同,通过保持 PyTorch 内部实现的简单性所节省的时间可以用来实现更多功能、适应新情况,并跟上 AI 领域快速发展的步伐。因此,拥有一个简单但略有不足的解决方案,比一个全面但复杂且难以维护的设计更好。
4 Usability centric design
4 以可用性为中心的设计
4.1 Deep learning models are just Python programs
4.1 深度学习模型只是 Python 程序
In a surprisingly short amount of time, machine learning grew from recognizing individual digits [27] into autonomously playing StarCraft [28]. Consequently, the neural networks themselves evolved rapidly from simple sequences of feed forward layers into incredibly varied numerical programs often composed of many loops and recursive functions. To support this growing complexity, PyTorch foregoes the potential benefits of a graph-meta programming based approach to preserve the imperative programming model of Python. This design was pioneered for model authoring by Chainer[5] and Dynet[7]. PyTorch extends this to all aspects of deep learning workflows. Defining layers, composing models, loading data, running optimizers, and parallel i zing the training process are all expressed using the familiar concepts developed for general purpose programming.
在极短的时间内,机器学习从识别单个数字 [27] 发展到自主玩《星际争霸》 [28]。因此,神经网络本身也从简单的前馈层序列迅速发展为通常由许多循环和递归函数组成的极其多样化的数值程序。为了支持这种日益增长的复杂性,PyTorch 放弃了基于图元编程方法的潜在好处,以保留 Python 的命令式编程模型。这一设计由 Chainer [5] 和 Dynet [7] 率先用于模型编写。PyTorch 将其扩展到深度学习工作流程的各个方面。定义层、组合模型、加载数据、运行优化器以及并行化训练过程,都是使用为通用编程开发的熟悉概念来表达的。
This solution ensures that any new potential neural network architecture can be easily implemented with PyTorch. For instance, layers (which in modern machine learning should really be understood as stateful functions with implicit parameters) are typically expressed as Python classes whose constructors create and initialize their parameters, and whose forward methods process an input activation. Similarly, models are usually represented as classes that compose individual layers, but let us state again that nothing forces the user to structure their code in that way. Listing 1 demonstrates how an entire model can be created by composing functionality provided by PyTorch such as 2d convolution, matrix multiplication, dropout, and softmax to classify gray-scale images. Note that linear layers are of course part of the library, but we show an example implementation to highlight how simple it is.
该解决方案确保任何新的潜在神经网络架构都可以轻松使用 PyTorch 实现。例如,层(在现代机器学习中,应将其理解为具有隐式参数的状态函数)通常表示为 Python 类,其构造函数创建并初始化参数,而其 forward
方法处理输入激活。同样,模型通常表示为组合各个层的类,但让我们再次强调,没有任何东西强制用户以这种方式组织代码。代码清单 1 展示了如何通过组合 PyTorch 提供的功能(如 2D 卷积、矩阵乘法、Dropout 和 Softmax)来创建整个模型,以对灰度图像进行分类。请注意,线性层当然是库的一部分,但我们展示了一个示例实现,以突出其简单性。
class Linear Layer(Module) :class Full Basic Model(nn.Module) :def -init_(self, in_sz, out_sz): def --init--(self): super().--init--()super().--init_-()t1 = torch.randn(in_sz, out_sz) self.conv = nn.Conv2d(1,128,3) self. w = nn .Parameter(t1) self.fc = Linear Layer(128, 10) ~ t 2 = torch.randn(out_sz) self.b = nn.Parameter(t2) def forward(self, x): t1= self.conv (×) def forward(self, activation s): t2=nn . functional.relu(t1) t = torch.mm(activation s, self.w) t3= self.fc(t1) return t + self.b return nn.functional.softmax(t3)
class Linear Layer(Module) :
class Full Basic Model(nn.Module) :
def _-init__(self, in_sz, out_sz):
def --init--(self):
super().--init--()
super().--init_-()
t1 \$=\$ torch.randn(in_sz, out_sz)
self.conv \$=\$ nn.Conv2d(1,128,3)
self. \$w~=~\mathsf{n n}\$ .Parameter(t1)
self.fc \$=\$ Linear Layer(128, 10)
\$\mathrm{~\textit~{~t~2~}~}=\$ torch.randn(out_sz)
self.b \$=\$ nn.Parameter(t2)
def forward(self, x):
\$\begin{array}{r l}{\operatorname{t}1}&{{}=}\end{array}\$ self.conv \$(\times)\$
def forward(self, activation s):
\$\mathtt{t}2=\mathsf{n n}\$ . functional.relu(t1)
\${\mathrm{~\small~t~}}=\$ torch.mm(activation s, self.w)
\$\mathtt{t3}=\$ self.fc(t1)
return t \$^+\$ self.b
return nn.functional.softmax(t3)
Listing 1: A custom layer used as a building block for a simple but complete neural network.
列表 1: 用作简单但完整的神经网络构建块的自定义层。
This “everything is a just a program" philosophy is not limited to just the models, and applies to optimizers and data loaders as well. This facilitates the experimentation of new training techniques. For example, to implement the very popular generative adversarial networks, one needs to specify two separate models (the generator and the disc rim in at or), and two loss functions that depend on both models at the same time. Rigid APIs would struggle with this setup, but the simple design employed in PyTorch easily adapts to this setting as shown in Listing 2.
这种“一切皆程序”的理念不仅限于模型,同样适用于优化器和数据加载器。这有助于新训练技术的实验。例如,要实现非常流行的生成对抗网络 (Generative Adversarial Networks, GAN),需要指定两个独立的模型(生成器和判别器),以及两个同时依赖于这两个模型的损失函数。僵化的 API 会难以应对这种设置,但 PyTorch 采用的简单设计可以轻松适应这种情况,如代码清单 2 所示。
disc rim in at or = create disc rim in at or() generator = create generator()optimD = optim.Adam(disc rim in at or.parameters()) optimG = optim.Adam(generator.parameters())
discriminator = create_discriminator()
generator = create_generator()
optimD = optim.Adam(discriminator.parameters())
optimG = optim.Adam(generator.parameters())
def step(real sample): #(1) Update Disc rim in at orerrD_real = loss(disc rim in at or(real sample), real_label) errD_real.backward()fake = generator(get_noise()) errD_fake = loss(disc rim in at or(fake.detach(), fake_label)errD_fake.backward() optimD.step() # (2) Update Generator errG = loss(disc rim in at or(fake), real_label)errG.backward() optimG.step()
def step(real_sample): # (1) 更新判别器
errD_real = loss(discriminator(real_sample), real_label)
errD_real.backward()
fake = generator(get_noise())
errD_fake = loss(discriminator(fake.detach()), fake_label)
errD_fake.backward()
optimD.step() # (2) 更新生成器
errG = loss(discriminator(fake), real_label)
errG.backward()
optimG.step()
Since PyTorch programs execute eagerly, all the features of Python are available throughout the whole design process. Print statements, standard debuggers, and common visualization tools like matplotlib all work as expected. Users do not have to wait for lengthy compilation before they can start running their programs, and more importantly intermediate computations can be observed to understand how a model works and whether its results are correct.
由于 PyTorch 程序是即时执行的,因此在设计过程中可以充分利用 Python 的所有特性。打印语句、标准调试器以及常见的可视化工具(如 matplotlib)都能按预期工作。用户无需等待漫长的编译过程即可开始运行程序,更重要的是,可以观察中间计算过程,以理解模型的工作原理并验证其结果的正确性。
4.2 Interoperability and extensibility
4.2 互操作性和可扩展性
Easy and efficient interoperability is one of the top priorities for PyTorch because it opens the possibility to leverage the rich ecosystem of Python libraries as part of user programs. Hence, PyTorch allows for bidirectional exchange of data with external libraries. For example, it provides a mechanism to convert between NumPy arrays and PyTorch tensors using the torch. from_numpy() function and . numpy() tensor method. Similar functionality is also available to exchange data stored using the DLPack [29] format. Note that this exchange happens in both cases without any data copying - objects on both sides only describe how to interpret a memory region which is shared among them. Hence, those operations are actually extremely cheap, and take constant time no matter how large the converted arrays are.
轻松高效的互操作性是 PyTorch 的首要任务之一,因为它为用户程序提供了利用丰富的 Python 库生态系统的可能性。因此,PyTorch 允许与外部库进行数据的双向交换。例如,它提供了使用 torch.from_numpy()
函数和 .numpy()
张量方法在 NumPy 数组和 PyTorch 张量之间进行转换的机制。类似的功能也可用于交换使用 DLPack [29] 格式存储的数据。请注意,这两种情况下的交换都不涉及任何数据复制——双方的对象仅描述如何解释共享的内存区域。因此,这些操作实际上非常高效,无论转换的数组有多大,所需的时间都是恒定的。
Moreover, many of the critical systems are designed specifically to be extensible. For instance, the automatic differentiation system allows users to add support for custom differentiable functions. To do that users can define a new subclass of torch.autograd. Function that implements forward() and backward() methods, which specify the function and its derivative (or more formally the vectorJacobian product). Similarly new datasets can be added by subclass ing torch.utils. data. Dataset and implementing two methods: --getitem_- (the indexing operator) and --len_- (the length operator), making datasets behave like (possibly lazy) lists. How these work is completely up to the implementer, and many users leverage other Python packages for data loading. The Dataloader class consumes objects conforming to this interface and provides an iterator over the data which takes care of shuffing, batching, parallel iz ation, and management of pinned CUDA memory to improve throughput.
此外,许多关键系统专门设计为可扩展的。例如,自动微分系统允许用户添加对自定义可微函数的支持。为此,用户可以定义一个新的 torch.autograd.Function
子类,并实现 forward()
和 backward()
方法,这些方法指定了函数及其导数(或更正式地说,向量雅可比积)。类似地,可以通过子类化 torch.utils.data.Dataset
并实现两个方法来添加新的数据集:__getitem__
(索引操作符)和 __len__
(长度操作符),使数据集表现得像(可能是惰性的)列表。这些方法的具体实现完全由开发者决定,许多用户利用其他 Python 包进行数据加载。Dataloader
类消费符合此接口的对象,并提供数据上的迭代器,负责数据的打乱、批处理、并行化以及固定 CUDA 内存的管理,以提高吞吐量。
Most importantly, users are free to replace any component of PyTorch that does not meet the needs or performance requirements of their project. They are all designed to be completely interchangeable, and PyTorch takes great care not to impose any particular solution.
最重要的是,用户可以自由替换 PyTorch 中任何不符合项目需求或性能要求的组件。所有组件都被设计为完全可互换的,PyTorch 非常注意不强制使用任何特定的解决方案。
4.3 Automatic differentiation
4.3 自动微分
Since gradient based optimization is vital to deep learning, PyTorch must be able to automatically compute gradients of models specified by our users, and those can be arbitrary Python programs. However, Python is a dynamic programming language that allows changing most behaviors at runtime, making ahead of time source-to-source differentiation cumbersome. Instead, PyTorch uses the operator overloading approach, which builds up a representation of the computed function every time it is executed. In its current implementation [30], PyTorch performs reverse-mode automatic differentiation, which computes the gradient of a scalar output with respect to a multivariate input. Differentiating functions with more outputs than inputs is more efficiently executed using forwardmode automatic differentiation, but this use case is less common for machine learning applications. PyTorch can be easily extended to perform forward-mode differentiation using array-level dual numbers [31, 32].
由于基于梯度的优化对深度学习至关重要,PyTorch 必须能够自动计算用户指定模型的梯度,而这些模型可以是任意的 Python 程序。然而,Python 是一种动态编程语言,允许在运行时改变大多数行为,这使得提前进行源到源的微分变得繁琐。因此,PyTorch 采用了运算符重载的方法,每次执行时都会构建计算函数的表示。在当前的实现中 [30],PyTorch 执行反向模式自动微分,计算标量输出相对于多元输入的梯度。对于输出多于输入的函数,使用前向模式自动微分更高效,但这种用例在机器学习应用中较少见。PyTorch 可以轻松扩展以使用数组级双数执行前向模式微分 [31, 32]。
Another interesting and uncommon feature of our system is that it can differentiate through code employing mutation on tensors, which is one of the basic building blocks of imperative programs. To ensure safety, we have implemented a versioning system for tensors, which lets us track their modifications and ensure that we always use the data we expect. One interesting tradeoff is that while we could utilize techniques like copy-on-write to support arbitrary programs, we chose to not go down this path, as performance-wise it is usually beneficial for the users to rewrite their code to ensure that no copies have to be performed. Hence, while most mutations are benign and can be handled automatically, the really complicated cases result in a user error, which lets them know that they likely want to restructure the program. This allows us to avoid introducing subtle and hard-to-find performance cliffs.
我们系统的另一个有趣且不常见的特性是,它能够通过代码对张量进行突变操作进行微分,这是命令式程序的基本构建块之一。为了确保安全性,我们为张量实现了一个版本控制系统,该系统可以跟踪它们的修改,并确保我们始终使用预期的数据。一个有趣的权衡是,虽然我们可以利用写时复制等技术来支持任意程序,但我们选择不走这条路,因为在性能方面,用户通常可以通过重写代码来确保不需要执行复制操作,从而获得更好的性能。因此,虽然大多数突变是良性的并且可以自动处理,但在真正复杂的情况下会导致用户错误,这让他们知道他们可能需要重新构建程序。这使我们能够避免引入微妙且难以发现的性能瓶颈。
5 Performance focused implementation
5 性能导向的实现
Running deep learning algorithms efficiently from a Python interpreter is notoriously challenging: for instance, the global interpreter lock [33] effectively ensures that only one of any number of concurrent threads is running at any given time. Deep learning frameworks based on the construction of a static data-fow graph sidestep this problem by deferring the evaluation of the computation to a custom interpreter.
在 Python 语言解释器中高效运行深度学习算法是众所周知的挑战:例如,全局解释器锁 [33] 有效地确保了在任何给定时间内,多个并发线程中只有一个在运行。基于静态数据流图构建的深度学习框架通过将计算评估推迟到自定义解释器来规避这个问题。
PyTorch solved the problem differently, by carefully optimizing every aspect of its execution while simultaneously empowering its users to easily leverage additional optimization strategies.
PyTorch 通过精心优化其执行的每个方面,同时使用户能够轻松利用额外的优化策略,以不同的方式解决了这个问题。
5.1An efficient + core
5.1 高效的 + 核心
Despite being closely integrated in the Python ecosystem, most of PyTorch is written in C++ to achieve high performance. This core libtorch library implements the tensor data structure, the GPU and CPU operators, and basic parallel primitives. It also provides the automatic differentiation system, including the gradient formulas for most built-in functions. This ensures that the computation of the derivatives of functions composed of core PyTorch operators is executed entirely in a multi threaded evaluator which does not require holding the Python global interpreter lock [33]. Python bindings are generated using YAML meta-data files. An interesting side-effect of this approach is that it allowed our community to quickly create bindings to multiple other languages resulting in projects like NimTorch [34], hasktorch [35] and others.
尽管与 Python 生态系统紧密集成,但 PyTorch 的大部分代码是用 C++ 编写的,以实现高性能。这个核心的 libtorch 库实现了张量数据结构、GPU 和 CPU 操作符以及基本的并行原语。它还提供了自动微分系统,包括大多数内置函数的梯度公式。这确保了由核心 PyTorch 操作符组成的函数的导数计算完全在一个多线程评估器中执行,该评估器不需要持有 Python 全局解释器锁 [33]。Python 绑定是使用 YAML 元数据文件生成的。这种方法的一个有趣的副作用是,它允许我们的社区快速创建与其他多种语言的绑定,从而产生了诸如 NimTorch [34]、hasktorch [35] 等项目。
This design also allowed us to create first-class C++ bindings and modeling libraries that can be used in places where Python is inconvenient, such as the game engine for Starcraft [36] or on mobile platforms. It is even possible to take the Python code describing a PyTorch model and run it without Python using the Torch Script engine [37].
这种设计还使我们能够创建一流的 C++ 绑定和建模库,这些库可以在 Python 不方便使用的地方使用,例如《星际争霸》的游戏引擎 [36] 或移动平台。甚至可以使用 Torch Script 引擎 [37] 来运行描述 PyTorch 模型的 Python 代码,而无需 Python。
5.2 Separate control and data flow
5.2 分离控制流与数据流
PyTorch maintains a strict separation between its control (i.e. program branches, loops) and data flow (i.e. tensors and the operations performed on them). The resolution of the control fow is handled byPython and optimized C++ code executed on the host CPU, and result in a linear sequence of operator invocations on the device. Operators can be run either on CPU or on GPU.
PyTorch 严格区分其控制流(即程序分支、循环)和数据流(即张量及其上的操作)。控制流的解析由 Python 语言和运行在主机 CPU 上的优化 C++ 代码处理,并最终在设备上生成一系列线性操作符调用。操作符可以在 CPU 或 GPU 上运行。
PyTorch is designed to execute operators asynchronously on GPU by leveraging the CUDA stream mechanism [38] to queue CUDA kernel invocations to the GPUs hardware FIFO. This allows the system to overlap the execution of Python code on CPU with tensor operators on GPU. Because the tensor operations usually take a significant amount of time, this lets us saturate the GPU and reach peak performance even in an interpreted language with fairly high overhead like Python. Note that this mechanism is nearly invisible to the user. Unless they implement their own multi-stream primitives all of the CPU-GPU synchronization is handled by the library.
PyTorch 通过利用 CUDA 流机制 [38] 将 CUDA 内核调用排队到 GPU 的硬件 FIFO 中,从而设计为在 GPU 上异步执行操作符。这使得系统能够在 CPU 上执行 Python 代码的同时,与 GPU 上的张量操作符重叠执行。由于张量操作通常需要大量时间,这使得我们即使在像 Python 这样开销较高的解释型语言中也能充分利用 GPU 并达到峰值性能。需要注意的是,这种机制对用户几乎是透明的。除非用户自己实现多流原语,否则所有的 CPU-GPU 同步都由库处理。
PyTorch could leverage a similar mechanism to also execute operators asynchronously on the CPU. However the costs of cross-thread communication and synchronization would negate the performance benefit of such an optimization.
PyTorch 可以利用类似的机制在 CPU 上异步执行操作符。然而,跨线程通信和同步的成本将抵消这种优化的性能优势。
5.3 Custom caching tensor allocator
5.3 自定义缓存张量分配器
Almost every operator must dynamically allocate an output tensor to hold the result of its execution. It is therefore critical to optimize the speed of the dynamic memory allocators. PyTorch can rely on optimized libraries [39-41] to handle this task on CPU. However, on GPU the cudaFree routine may block its caller until all previously queued work on all GPUs completes. To avoid this bottleneck, PyTorch implements a custom allocator which increment ally builds up a cache of CUDA memory and reassigns it to later allocations without further use of CUDA APIs. The incremental allocation is also crucial for better interoperability, because taking up all GPU memory ahead of time would prevent the user from utilizing other GPU-enabled Python packages.
几乎每个操作符都必须动态分配一个输出张量来保存其执行结果。因此,优化动态内存分配器的速度至关重要。PyTorch 可以依赖优化库 [39-41] 来处理 CPU 上的这一任务。然而,在 GPU 上,cudaFree
例程可能会阻塞其调用者,直到所有 GPU 上先前排队的工作完成。为了避免这一瓶颈,PyTorch 实现了一个自定义分配器,该分配器逐步构建 CUDA 内存缓存,并将其重新分配给后续的分配,而无需进一步使用 CUDA API。增量分配对于更好的互操作性也至关重要,因为提前占用所有 GPU 内存会阻止用户使用其他支持 GPU 的 Python 包。
To further improve its effectiveness, this allocator was tuned for the specific memory usage patterns of deep learning. For example, it rounds up allocations to multiples of 512 bytes to avoid fragmentation issues. Mor