PyTorch: An Imperative Style, High-Performance Deep Learning Library
PyTorch:一种命令式风格的高性能深度学习库
Abstract
摘要
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs.
深度学习框架通常要么注重可用性,要么注重速度,但很少同时兼顾两者。PyTorch 是一个机器学习库,它证明了这两个目标实际上是兼容的:它提供了一种命令式和 Python语言 风格的编程方式,支持将代码作为模型,使调试变得容易,并且与其他流行的科学计算库保持一致,同时保持高效并支持 GPU 等硬件加速器。
In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.
在本文中,我们详细介绍了推动 PyTorch 实现的原则,以及这些原则如何在其架构中得到体现。我们强调,PyTorch 的每个方面都是一个常规的 Python语言 程序,完全由用户控制。我们还解释了如何谨慎且务实地实现其运行时的关键组件,使它们能够协同工作以实现出色的性能。我们展示了各个子系统的效率,以及 PyTorch 在多个常见基准测试中的整体速度。
33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
第33届神经信息处理系统会议 (NeurIPS 2019),加拿大温哥华。
1 Introduction
1 引言
With the increased interest in deep learning in recent years, there has been an explosion of machine learning tools. Many popular frameworks such as Caffe [1], CNTK [2], TensorFlow [3], and Theano [4], construct a static dataflow graph that represents the computation and which can then be applied repeatedly to batches of data. This approach provides visibility into the whole computation ahead of time, and can theoretically be leveraged to improve performance and s cal ability. However, it comes at the cost of ease of use, ease of debugging, and flexibility of the types of computation that can be represented.
近年来,随着对深度学习兴趣的增加,机器学习工具呈现爆炸式增长。许多流行的框架,如 Caffe [1]、CNTK [2]、TensorFlow [3] 和 Theano [4],构建了一个静态数据流图来表示计算,然后可以重复应用于数据批次。这种方法提供了对整个计算的提前可见性,理论上可以用于提高性能和可扩展性。然而,它是以易用性、调试便利性以及可表示计算类型的灵活性为代价的。
Prior work has recognized the value of dynamic eager execution for deep learning, and some recent frameworks implement this define-by-run approach, but do so either at the cost of performance (Chainer [5]) or using a less expressive, faster language (Torch [6], DyNet [7]), which limits their applicability.
先前的研究已经认识到动态即时执行对深度学习的价值,一些最近的框架实现了这种按运行定义的方法,但要么以性能为代价(Chainer [5]),要么使用表达能力较弱但更快的语言(Torch [6], DyNet [7]),这限制了它们的适用性。
However, with careful implementation and design choices, dynamic eager execution can be achieved largely without sacrificing performance. This paper introduces PyTorch, a Python library that performs immediate execution of dynamic tensor computations with automatic differentiation and GPU acceleration, and does so while maintaining performance comparable to the fastest current libraries for deep learning. This combination has turned out to be very popular in the research community with, for instance, 296 ICLR 2019 submissions mentioning PyTorch.
然而,通过谨慎的实现和设计选择,动态即时执行可以在很大程度上不牺牲性能的情况下实现。本文介绍了 PyTorch,这是一个 Python 库,它能够即时执行动态张量计算,并具备自动微分和 GPU 加速功能,同时保持与当前最快的深度学习库相媲美的性能。这种组合在研究社区中非常受欢迎,例如,有 296 篇 ICLR 2019 提交的论文提到了 PyTorch。
2 Background
2 背景
Four major trends in scientific computing have become increasingly important for deep learning.
科学计算中的四大趋势对深度学习变得越来越重要。
First, starting in the 1960s, the development of domain specific languages such as APL [8], MATLAB [9], R [10] and Julia [11], turned multidimensional arrays (often referred to as tensors) into first-class objects supported by a comprehensive set of mathematical primitives (or operators) to manipulate them. Separately, libraries such as NumPy[12], Torch[6], Eigen[13] and Lush[14] made array-based programming productive in general purpose languages such as Python, Lisp, $\mathrm{C}{+}{+}$ and Lua.
首先,从20世纪60年代开始,诸如APL [8]、MATLAB [9]、R [10]和Julia [11]等特定领域语言的发展,将多维数组(通常称为张量)转变为一等对象,并支持一套全面的数学原语(或操作符)来操作它们。与此同时,NumPy [12]、Torch [6]、Eigen [13]和Lush [14]等库使得基于数组的编程在Python、Lisp、$\mathrm{C}{+}{+}$和Lua等通用编程语言中变得高效。
Second, the development of automatic differentiation [15] made it possible to fully automate the daunting labor of computing derivatives. This made it significantly easier to experiment with different machine learning approaches while still llowing for effcient gradient based optimization. The autograd [16] package popularized the use of this technique for NumPy arrays, and similar approaches are used in frameworks such as Chainer [5], DyNet [7], Lush [14], Torch [6], Jax [17] and Flux.jl [18].
其次,自动微分(automatic differentiation)[15] 的发展使得计算导数的繁重工作完全自动化成为可能。这大大简化了尝试不同机器学习方法的过程,同时仍能进行高效的基于梯度的优化。autograd [16] 包推广了这种技术在 NumPy 数组中的应用,类似的框架如 Chainer [5]、DyNet [7]、Lush [14]、Torch [6]、Jax [17] 和 Flux.jl [18] 也采用了类似的方法。
Third, with the advent of the free software movement, the scientific community moved away from closed proprietary software such as Matlab[9], and towards the open-source Python ecosystem with packages like NumPy [12], SciPy [19], and Pandas [20]. This fulfilled most of the numerical analysis needs of researchers while allowing them to take advantage of a vast repository of libraries to handle dataset preprocessing, statistical analysis, plotting, and more. Moreover, the openness, interoperability, and flexibility of free software fostered the development of vibrant communities that could quickly address new or changing needs by extending the existing functionality of a library or if needed by developing and releasing brand new ones. While there is a rich offering of open-source software for neural networks in languages other than Python, starting with Lush [14] in Lisp, Torch [6] in $\mathrm{C}{+}{+}$ , Objective-C and Lua, EBLearn [21] in $\mathrm{C}{+}{+}$ ,Caffe[1] in $\mathrm{C}{+}{+}$ , the network effects of a large ecosystem such as Python made it an essential skill to jumpstart one's research. Hence, since 2014, most deep learning frameworks converged on a Python interface as an essential feature.
第三,随着自由软件运动的兴起,科学界逐渐摒弃了如 Matlab [9] 这样的闭源专有软件,转向了以 NumPy [12]、SciPy [19] 和 Pandas [20] 等包为代表的开源 Python 生态系统。这满足了研究人员大部分的数值分析需求,同时使他们能够利用庞大的库资源来处理数据集预处理、统计分析、绘图等任务。此外,自由软件的开放性、互操作性和灵活性促进了活跃社区的发展,这些社区能够通过扩展现有库的功能或在需要时开发和发布全新的库来快速应对新的或变化的需求。尽管在 Python 之外的其他语言中也有丰富的开源神经网络软件,从 Lisp 的 Lush [14] 开始,到 $\mathrm{C}{+}{+}$ 的 Torch [6]、Objective-C 和 Lua,再到 $\mathrm{C}{+}{+}$ 的 EBLearn [21] 和 Caffe [1],但像 Python 这样的大型生态系统的网络效应使其成为启动研究的必备技能。因此,自 2014 年以来,大多数深度学习框架都将 Python 接口作为一项基本功能。
Finally, the availability and com mod it iz ation of general-purpose massively parallel hardware such as GPUs provided the computing power required by deep learning methods. Specialized libraries such as cuDNN [22], along with a body of academic work (such as [23] and [24]), produced a set of high-performance reusable deep learning kernels that enabled frameworks such as Caffe [1], Torch7 [25], or TensorFlow [3] to take advantage of these hardware accelerators.
最后,GPU等通用大规模并行硬件的可用性和商品化,为深度学习方法提供了所需的计算能力。cuDNN [22] 等专用库,以及一系列学术工作(如 [23] 和 [24]),产生了一组高性能的可重用深度学习内核,使得 Caffe [1]、Torch7 [25] 或 TensorFlow [3] 等框架能够利用这些硬件加速器。
PyTorch builds on these trends by providing an array-based programming model accelerated by GPUs and differentiable via automatic differentiation integrated in the Python ecosystem.
PyTorch 基于这些趋势,提供了一个通过 GPU 加速的基于数组的编程模型,并通过集成在 Python 生态系统中的自动微分实现可微分性。
3 Design principles
3 设计原则
PyTorch's success stems from weaving previous ideas into a design that balances speed and ease of use. There are four main principles behind our choices:
PyTorch 的成功源于将先前的理念融入到一个平衡速度与易用性的设计中。我们的选择背后有四个主要原则:
Be Pythonic Data scientists are familiar with the Python language, its programming model, and its tools. PyTorch should be a first-class member of that ecosystem. It follows the commonly established design goals of keeping interfaces simple and consistent, ideally with one idiomatic way of doing things. It also integrates naturally with standard plotting, debugging, and data processing tools.
遵循 Python 风格
数据科学家熟悉 Python 语言、其编程模型及其工具。PyTorch 应该是该生态系统中的一等公民。它遵循了保持接口简单一致的设计目标,理想情况下提供一种惯用的方式来完成工作。它还自然地与标准绘图、调试和数据处理工具集成。
Put researchers first PyTorch strives to make writing models, data loaders, and optimizers as easy and productive as possible. The complexity inherent to machine learning should be handled internally by the PyTorch library and hidden behind intuitive APIs free of side-effects and unexpected performance cliffs.
以研究人员为先
PyTorch 致力于使编写模型、数据加载器和优化器尽可能简单高效。机器学习固有的复杂性应由 PyTorch 库内部处理,并隐藏在直观的 API 背后,避免副作用和意外的性能瓶颈。
Provide pragmatic performance To be useful, PyTorch needs to deliver compelling performance, although not at the expense of simplicity and ease of use. Trading $10%$ of speed for a significantly simpler to use model is acceptable; $100%$ is not. Therefore, its implementation accepts added complexity in order to deliver that performance. Additionally, providing tools that allow researchers to manually control the execution of their code will empower them to find their own performance improvements independent of those that the library provides automatically.
提供实用的性能
为了实用,PyTorch 需要提供引人注目的性能,尽管不能以牺牲简单性和易用性为代价。用 10% 的速度换取显著更简单的模型是可以接受的;但 100% 的速度损失则不行。因此,其实现接受增加的复杂性以提供这种性能。此外,提供允许研究人员手动控制代码执行的工具,将使他们能够找到独立于库自动提供的性能改进。
Worse is better [26] Given a fixed amount of engineering resources, and all else being equal, the time saved by keeping the internal implementation of PyTorch simple can be used to implement additional features, adapt to new situations, and keep up with the fast pace of progress in the field of AI. Therefore it is better to have a simple but slightly incomplete solution than a comprehensive but complex and hard to maintain design.
更差即更好 [26] 在工程资源固定的情况下,其他条件相同,通过保持 PyTorch 内部实现的简单性所节省的时间可以用来实现更多功能、适应新情况,并跟上 AI 领域快速发展的步伐。因此,拥有一个简单但略有不足的解决方案,比一个全面但复杂且难以维护的设计更好。
4 Usability centric design
4 以可用性为中心的设计
4.1 Deep learning models are just Python programs
4.1 深度学习模型只是 Python 程序
In a surprisingly short amount of time, machine learning grew from recognizing individual digits [27] into autonomously playing StarCraft [28]. Consequently, the neural networks themselves evolved rapidly from simple sequences of feed forward layers into incredibly varied numerical programs often composed of many loops and recursive functions. To support this growing complexity, PyTorch foregoes the potential benefits of a graph-meta programming based approach to preserve the imperative programming model of Python. This design was pioneered for model authoring by Chainer[5] and Dynet[7]. PyTorch extends this to all aspects of deep learning workflows. Defining layers, composing models, loading data, running optimizers, and parallel i zing the training process are all expressed using the familiar concepts developed for general purpose programming.
在极短的时间内,机器学习从识别单个数字 [27] 发展到自主玩《星际争霸》 [28]。因此,神经网络本身也从简单的前馈层序列迅速发展为通常由许多循环和递归函数组成的极其多样化的数值程序。为了支持这种日益增长的复杂性,PyTorch 放弃了基于图元编程方法的潜在好处,以保留 Python 的命令式编程模型。这一设计由 Chainer [5] 和 Dynet [7] 率先用于模型编写。PyTorch 将其扩展到深度学习工作流程的各个方面。定义层、组合模型、加载数据、运行优化器以及并行化训练过程,都是使用为通用编程开发的熟悉概念来表达的。
This solution ensures that any new potential neural network architecture can be easily implemented with PyTorch. For instance, layers (which in modern machine learning should really be understood as stateful functions with implicit parameters) are typically expressed as Python classes whose constructors create and initialize their parameters, and whose forward methods process an input activation. Similarly, models are usually represented as classes that compose individual layers, but let us state again that nothing forces the user to structure their code in that way. Listing 1 demonstrates how an entire model can be created by composing functionality provided by PyTorch such as 2d convolution, matrix multiplication, dropout, and softmax to classify gray-scale images. Note that linear layers are of course part of the library, but we show an example implementation to highlight how simple it is.
该解决方案确保任何新的潜在神经网络架构都可以轻松使用 PyTorch 实现。例如,层(在现代机器学习中,应将其理解为具有隐式参数的状态函数)通常表示为 Python 类,其构造函数创建并初始化参数,而其 forward 方法处理输入激活。同样,模型通常表示为组合各个层的类,但让我们再次强调,没有任何东西强制用户以这种方式组织代码。代码清单 1 展示了如何通过组合 PyTorch 提供的功能(如 2D 卷积、矩阵乘法、Dropout 和 Softmax)来创建整个模型,以对灰度图像进行分类。请注意,线性层当然是库的一部分,但我们展示了一个示例实现,以突出其简单性。
class Linear Layer(Module) :class Full Basic Model(nn.Module) :def -init_(self, in_sz, out_sz): def --init--(self): super().--init--()super().--init_-()t1 $=$ torch.randn(in_sz, out_sz) self.conv $=$ nn.Conv2d(1,128,3) self. $w~=~\mathsf{n n}$ .Parameter(t1) self.fc $=$ Linear Layer(128, 10) $\mathrm{~\textit~{~t~2~}~}=$ torch.randn(out_sz) self.b $=$ nn.Parameter(t2) def forward(self, x): $\begin{array}{r l}{\operatorname{t}1}&{{}=}\end{array}$ self.conv $(\times)$ def forward(self, activation s): $\mathtt{t}2=\mathsf{n n}$ . functional.relu(t1) ${\mathrm{~\small~t~}}=$ torch.mm(activation s, self.w) $\mathtt{t3}=$ self.fc(t1) return t $^+$ self.b return nn.functional.softmax(t3)
class Linear Layer(Module) :
class Full Basic Model(nn.Module) :
def _-init__(self, in_sz, out_sz):
def --init--(self):
super().--init--()
super().--init_-()
t1 \$=\$ torch.randn(in_sz, out_sz)
self.conv \$=\$ nn.Conv2d(1,128,3)
self. \$w~=~\mathsf{n n}\$ .Parameter(t1)
self.fc \$=\$ Linear Layer(128, 10)
\$\mathrm{~\textit~{~t~2~}~}=\$ torch.randn(out_sz)
self.b \$=\$ nn.Parameter(t2)
def forward(self, x):
\$\begin{array}{r l}{\operatorname{t}1}&{{}=}\end{array}\$ self.conv \$(\times)\$
def forward(self, activation s):
\$\mathtt{t}2=\mathsf{n n}\$ . functional.relu(t1)
\${\mathrm{~\small~t~}}=\$ torch.mm(activation s, self.w)
\$\mathtt{t3}=\$ self.fc(t1)
return t \$^+\$ self.b
return nn.functional.softmax(t3)
Listing 1: A custom layer used as a building block for a simple but complete neural network.
列表 1: 用作简单但完整的神经网络构建块的自定义层。
This “everything is a just a program" philosophy is not limited to just the models, and applies to optimizers and data loaders as well. This facilitates the experimentation of new training techniques. For example, to implement the very popular generative adversarial networks, one needs to specify two separate models (the generator and the disc rim in at or), and two loss functions that depend on both models at the same time. Rigid APIs would struggle with this setup, but the simple design employed in PyTorch easily adapts to this setting as shown in Listing 2.
这种“一切皆程序”的理念不仅限于模型,同样适用于优化器和数据加载器。这有助于新训练技术的实验。例如,要实现非常流行的生成对抗网络 (Generative Adversarial Networks, GAN),需要指定两个独立的模型(生成器和判别器),以及两个同时依赖于这两个模型的损失函数。僵化的 API 会难以应对这种设置,但 PyTorch 采用的简单设计可以轻松适应这种情况,如代码清单 2 所示。
disc rim in at or $=$ create disc rim in at or() generator $=$ create generator()optimD $=$ optim.Adam(disc rim in at or.parameters()) optimG $=$ optim.Adam(generator.parameters())
discriminator = create_discriminator()
generator = create_generator()
optimD = optim.Adam(discriminator.parameters())
optimG = optim.Adam(generator.parameters())
def step(real sample): #(1) Update Disc rim in at orerrD_real $=$ loss(disc rim in at or(real sample), real_label) errD_real.backward()fake $=$ generator(get_noise()) errD_fake $=$ loss(disc rim in at or(fake.detach(), fake_label)errD_fake.backward() optimD.step() # (2) Update Generator errG $=$ loss(disc rim in at or(fake), real_label)errG.backward() optimG.step()
def step(real_sample): # (1) 更新判别器
errD_real = loss(discriminator(real_sample), real_label)
errD_real.backward()
fake = generator(get_noise())
errD_fake = loss(discriminator(fake.detach()), fake_label)
errD_fake.backward()
optimD.step() # (2) 更新生成器
errG = loss(discriminator(fake), real_label)
errG.backward()
optimG.step()
Since PyTorch programs execute eagerly, all the features of Python are available throughout the whole design process. Print statements, standard debuggers, and common visualization tools like matplotlib all work as expected. Users do not have to wait for lengthy compilation before they can start running their programs, and more importantly intermediate computations can be observed to understand how a model works and whether its results are correct.
由于 PyTorch 程序是即时执行的,因此在设计过程中可以充分利用 Python 的所有特性。打印语句、标准调试器以及常见的可视化工具(如 matplotlib)都能按预期工作。用户无需等待漫长的编译过程即可开始运行程序,更重要的是,可以观察中间计算过程,以理解模型的工作原理并验证其结果的正确性。
4.2 Interoperability and extensibility
4.2 互操作性和可扩展性
Easy and efficient interoperability is one of the top priorities for PyTorch because it opens the possibility to leverage the rich ecosystem of Python libraries as part of user programs. Hence, PyTorch allows for bidirectional exchange of data with external libraries. For example, it provides a mechanism to convert between NumPy arrays and PyTorch tensors using the torch. from_numpy() function and . numpy() tensor method. Similar functionality is also available to exchange data stored using the DLPack [29] format. Note that this exchange happens in both cases without any data copying - objects on both sides only describe how to interpret a memory region which is shared among them. Hence, those operations are actually extremely cheap, and take constant time no matter how large the converted arrays are.
轻松高效的互操作性是 PyTorch 的首要任务之一,因为它为用户程序提供了利用丰富的 Python 库生态系统的可能性。因此,PyTorch 允许与外部库进行数据的双向交换。例如,它提供了使用 torch.from_numpy() 函数和 .numpy() 张量方法在 NumPy 数组和 PyTorch 张量之间进行转换的机制。类似的功能也可用于交换使用 DLPack [29] 格式存储的数据。请注意,这两种情况下的交换都不涉及任何数据复制——双方的对象仅描述如何解释共享的内存区域。因此,这些操作实际上非常高效,无论转换的数组有多大,所需的时间都是恒定的。
Moreover, many of the critical systems are designed specifically to be extensible. For instance, the automatic differentiation system allows users to add support for custom differentiable functions. To do that users can define a new subclass of torch.autograd. Function that implements forward() and backward() methods, which specify the function and its derivative (or more formally the vectorJacobian product). Similarly new datasets can be added by subclass ing torch.utils. data. Dataset and implementing two methods: --getitem_- (the indexing operator) and --len_- (the length operator), making datasets behave like (possibly lazy) lists. How these work is completely up to the implementer, and many users leverage other Python packages for data loading. The Dataloader class consumes objects conforming to this interface and provides an iterator over the data which takes care of shuffing, batching, parallel iz ation, and management of pinned CUDA memory to improve throughput.
此外,许多关键系统专门设计为可扩展的。例如,自动微分系统允许用户添加对自定义可微函数的支持。为此,用户可以定义一个新的 torch.autograd.Function 子类,并实现 forward() 和 backward() 方法,这些方法指定了函数及其导数(或更正式地说,向量雅可比积)。类似地,可以通过子类化 torch.utils.data.Dataset 并实现两个方法来添加新的数据集:__getitem__(索引操作符)和 __len__(长度操作符),使数据集表现得像(可能是惰性的)列表。这些方法的具体实现完全由开发者决定,许多用户利用其他 Python 包进行数据加载。Dataloader 类消费符合此接口的对象,并提供数据上的迭代器,负责数据的打乱、批处理、并行化以及固定 CUDA 内存的管理,以提高吞吐量。
Most importantly, users are free to replace any component of PyTorch that does not meet the needs or performance requirements of their project. They are all designed to be completely interchangeable, and PyTorch takes great care not to impose any particular solution.
最重要的是,用户可以自由替换 PyTorch 中任何不符合项目需求或性能要求的组件。所有组件都被设计为完全可互换的,PyTorch 非常注意不强制使用任何特定的解决方案。
4.3 Automatic differentiation
4.3 自动微分
Since gradient based optimization is vital to deep learning, PyTorch must be able to automatically compute gradients of models specified by our users, and those can be arbitrary Python programs. However, Python is a dynamic programming language that allows changing most behaviors at runtime, making ahead of time source-to-source differentiation cumbersome. Instead, PyTorch uses the operator overloading approach, which builds up a representation of the computed function every time it is executed. In its current implementation [30], PyTorch performs reverse-mode automatic differentiation, which computes the gradient of a scalar output with respect to a multivariate input. Differentiating functions with more outputs than inputs is more efficiently executed using forwardmode automatic differentiation, but this use case is less common for machine learning applications. PyTorch can be easily extended to perform forward-mode differentiation using array-level dual numbers [31, 32].
由于基于梯度的优化对深度学习至关重要,PyTorch 必须能够自动计算用户指定模型的梯度,而这些模型可以是任意的 Python 程序。然而,Python 是一种动态编程语言,允许在运行时改变大多数行为,这使得提前进行源到源的微分变得繁琐。因此,PyTorch 采用了运算符重载的方法,每次执行时都会构建计算函数的表示。在当前的实现中 [30],PyTorch 执行反向模式自动微分,计算标量输出相对于多元输入的梯度。对于输出多于输入的函数,使用前向模式自动微分更高效,但这种用例在机器学习应用中较少见。PyTorch 可以轻松扩展以使用数组级双数执行前向模式微分 [31, 32]。
Another interesting and uncommon feature of our system is that it can differentiate through code employing mutation on tensors, which is one of the basic building blocks of imperative programs. To ensure safety, we have implemented a versioning system for tensors, which lets us track their modifications and ensure that we always use the data we expect. One interesting tradeoff is that while we could utilize techniques like copy-on-write to support arbitrary programs, we chose to not go down this path, as performance-wise it is usually beneficial for the users to rewrite their code to ensure that no copies have to be performed. Hence, while most mutations are benign and can be handled automatically, the really complicated cases result in a user error, which lets them know that they likely want to restructure the program. This allows us to avoid introducing subtle and hard-to-find performance cliffs.
我们系统的另一个有趣且不常见的特性是,它能够通过代码对张量进行突变操作进行微分,这是命令式程序的基本构建块之一。为了确保安全性,我们为张量实现了一个版本控制系统,该系统可以跟踪它们的修改,并确保我们始终使用预期的数据。一个有趣的权衡是,虽然我们可以利用写时复制等技术来支持任意程序,但我们选择不走这条路,因为在性能方面,用户通常可以通过重写代码来确保不需要执行复制操作,从而获得更好的性能。因此,虽然大多数突变是良性的并且可以自动处理,但在真正复杂的情况下会导致用户错误,这让他们知道他们可能需要重新构建程序。这使我们能够避免引入微妙且难以发现的性能瓶颈。
5 Performance focused implementation
5 性能导向的实现
Running deep learning algorithms efficiently from a Python interpreter is notoriously challenging: for instance, the global interpreter lock [33] effectively ensures that only one of any number of concurrent threads is running at any given time. Deep learning frameworks based on the construction of a static data-fow graph sidestep this problem by deferring the evaluation of the computation to a custom interpreter.
在 Python 语言解释器中高效运行深度学习算法是众所周知的挑战:例如,全局解释器锁 [33] 有效地确保了在任何给定时间内,多个并发线程中只有一个在运行。基于静态数据流图构建的深度学习框架通过将计算评估推迟到自定义解释器来规避这个问题。
PyTorch solved the problem differently, by carefully optimizing every aspect of its execution while simultaneously empowering its users to easily leverage additional optimization strategies.
PyTorch 通过精心优化其执行的每个方面,同时使用户能够轻松利用额外的优化策略,以不同的方式解决了这个问题。
5.1An efficient $\mathbf+$ core
5.1 高效的 $\mathbf+$ 核心
Despite being closely integrated in the Python ecosystem, most of PyTorch is written in $\mathrm{C}{+}{+}$ to achieve high performance. This core libtorch library implements the tensor data structure, the GPU and CPU operators, and basic parallel primitives. It also provides the automatic differentiation system, including the gradient formulas for most built-in functions. This ensures that the computation of the derivatives of functions composed of core PyTorch operators is executed entirely in a multi threaded evaluator which does not require holding the Python global interpreter lock [33]. Python bindings are generated using YAML meta-data files. An interesting side-effect of this approach is that it allowed our community to quickly create bindings to multiple other languages resulting in projects like NimTorch [34], hasktorch [35] and others.
尽管与 Python 生态系统紧密集成,但 PyTorch 的大部分代码是用 $\mathrm{C}{+}{+}$ 编写的,以实现高性能。这个核心的 libtorch 库实现了张量数据结构、GPU 和 CPU 操作符以及基本的并行原语。它还提供了自动微分系统,包括大多数内置函数的梯度公式。这确保了由核心 PyTorch 操作符组成的函数的导数计算完全在一个多线程评估器中执行,该评估器不需要持有 Python 全局解释器锁 [33]。Python 绑定是使用 YAML 元数据文件生成的。这种方法的一个有趣的副作用是,它允许我们的社区快速创建与其他多种语言的绑定,从而产生了诸如 NimTorch [34]、hasktorch [35] 等项目。
This design also allowed us to create first-class $\mathrm{C}{+}{+}$ bindings and modeling libraries that can be used in places where Python is inconvenient, such as the game engine for Starcraft [36] or on mobile platforms. It is even possible to take the Python code describing a PyTorch model and run it without Python using the Torch Script engine [37].
这种设计还使我们能够创建一流的 $\mathrm{C}{+}{+}$ 绑定和建模库,这些库可以在 Python 不方便使用的地方使用,例如《星际争霸》的游戏引擎 [36] 或移动平台。甚至可以使用 Torch Script 引擎 [37] 来运行描述 PyTorch 模型的 Python 代码,而无需 Python。
5.2 Separate control and data flow
5.2 分离控制流与数据流
PyTorch maintains a strict separation between its control (i.e. program branches, loops) and data flow (i.e. tensors and the operations performed on them). The resolution of the control fow is handled byPython and optimized $\mathrm{C}{+}{+}$ code executed on the host CPU, and result in a linear sequence of operator invocations on the device. Operators can be run either on CPU or on GPU.
PyTorch 严格区分其控制流(即程序分支、循环)和数据流(即张量及其上的操作)。控制流的解析由 Python 语言和运行在主机 CPU 上的优化 $\mathrm{C}{+}{+}$ 代码处理,并最终在设备上生成一系列线性操作符调用。操作符可以在 CPU 或 GPU 上运行。
PyTorch is designed to execute operators asynchronously on GPU by leveraging the CUDA stream mechanism [38] to queue CUDA kernel invocations to the GPUs hardware FIFO. This allows the system to overlap the execution of Python code on CPU with tensor operators on GPU. Because the tensor operations usually take a significant amount of time, this lets us saturate the GPU and reach peak performance even in an interpreted language with fairly high overhead like Python. Note that this mechanism is nearly invisible to the user. Unless they implement their own multi-stream primitives all of the CPU-GPU synchronization is handled by the library.
PyTorch 通过利用 CUDA 流机制 [38] 将 CUDA 内核调用排队到 GPU 的硬件 FIFO 中,从而设计为在 GPU 上异步执行操作符。这使得系统能够在 CPU 上执行 Python 代码的同时,与 GPU 上的张量操作符重叠执行。由于张量操作通常需要大量时间,这使得我们即使在像 Python 这样开销较高的解释型语言中也能充分利用 GPU 并达到峰值性能。需要注意的是,这种机制对用户几乎是透明的。除非用户自己实现多流原语,否则所有的 CPU-GPU 同步都由库处理。
PyTorch could leverage a similar mechanism to also execute operators asynchronously on the CPU. However the costs of cross-thread communication and synchronization would negate the performance benefit of such an optimization.
PyTorch 可以利用类似的机制在 CPU 上异步执行操作符。然而,跨线程通信和同步的成本将抵消这种优化的性能优势。
5.3 Custom caching tensor allocator
5.3 自定义缓存张量分配器
Almost every operator must dynamically allocate an output tensor to hold the result of its execution. It is therefore critical to optimize the speed of the dynamic memory allocators. PyTorch can rely on optimized libraries [39-41] to handle this task on CPU. However, on GPU the cudaFree routine may block its caller until all previously queued work on all GPUs completes. To avoid this bottleneck, PyTorch implements a custom allocator which increment ally builds up a cache of CUDA memory and reassigns it to later allocations without further use of CUDA APIs. The incremental allocation is also crucial for better interoperability, because taking up all GPU memory ahead of time would prevent the user from utilizing other GPU-enabled Python packages.
几乎每个操作符都必须动态分配一个输出张量来保存其执行结果。因此,优化动态内存分配器的速度至关重要。PyTorch 可以依赖优化库 [39-41] 来处理 CPU 上的这一任务。然而,在 GPU 上,cudaFree 例程可能会阻塞其调用者,直到所有 GPU 上先前排队的工作完成。为了避免这一瓶颈,PyTorch 实现了一个自定义分配器,该分配器逐步构建 CUDA 内存缓存,并将其重新分配给后续的分配,而无需进一步使用 CUDA API。增量分配对于更好的互操作性也至关重要,因为提前占用所有 GPU 内存会阻止用户使用其他支持 GPU 的 Python 包。
To further improve its effectiveness, this allocator was tuned for the specific memory usage patterns of deep learning. For example, it rounds up allocations to multiples of 512 bytes to avoid fragmentation issues. Moreover, it maintains a distinct pool of memory for every CUDA stream (work queue).
为了进一步提高其有效性,该分配器针对深度学习的内存使用模式进行了调整。例如,它将分配大小向上舍入到512字节的倍数,以避免碎片化问题。此外,它为每个CUDA流(工作队列)维护了一个独立的内存池。
The one-pool-per-stream design assumption simplifies the implementation and improves the performance of the allocator: because the CPU runs ahead of the GPU, memory is freed on the CPU before its last use on the GPU finishes. Since streams serialize execution, if the free precedes the reallocation on the CPU, the same order will occur on the GPU. So the allocator can reallocate memory freed on the CPU immediately as long as the new allocation is used on the same stream as the freed region. However, if an allocation was last used on one stream and then allocated on another, additional synchronization is needed.
每个流一个池的设计假设简化了分配器的实现并提高了其性能:由于 CPU 运行在 GPU 之前,内存会在 GPU 最后一次使用完成之前在 CPU 上被释放。由于流会序列化执行,如果在 CPU 上释放操作先于重新分配操作,那么在 GPU 上也会发生相同的顺序。因此,只要新分配的内存与释放的内存区域在同一个流上使用,分配器就可以立即重新分配在 CPU 上释放的内存。然而,如果某个分配最后一次在一个流上使用,然后在另一个流上分配,则需要额外的同步操作。
The one-pool-per-stream design seems limiting since the allocations end up fragmented per stream, but in practice PyTorch almost never uses multiple streams. It is notoriously hard to write CUDA kernels in a way that would let them cooperatively share the GPU because exact scheduling is hardware controlled. In practice, kernel writers usually resort to monolithic kernels that combine multiple tasks. Data loading and distributed computing utilities are exceptions to the one stream design, and they carefully insert additional synchronization to avoid bad interactions with the allocator.
每个流一个池的设计看似限制了内存分配,因为分配最终会按流分散,但实际上 PyTorch 几乎从不使用多个流。众所周知,编写 CUDA 内核以使其能够协作共享 GPU 是非常困难的,因为精确的调度是由硬件控制的。实际上,内核编写者通常采用将多个任务结合在一起的单一内核。数据加载和分布式计算工具是单流设计的例外,它们会谨慎地插入额外的同步机制,以避免与分配器产生不良交互。
While this design is susceptible to certain corner cases, it almost never exhibits unwanted behaviors in practical code. Most of our users are not aware of its existence.
虽然这种设计在某些极端情况下可能会出现问题,但在实际代码中几乎不会表现出不良行为。大多数用户并未意识到它的存在。
5.4 Multiprocessing
5.4 多进程处理
Due to the global interpreter lock (GIL) Python's default implementation does not allow concurrent threads to execute in parallel. To alleviate this problem, the Python community has established a standard multiprocessing module, containing a number of utilities that allow users to easily spawn child processes and implement basic inter-process communication primitives.
由于全局解释器锁 (GIL) ,Python语言的默认实现不允许并发线程并行执行。为了缓解这个问题,Python社区建立了一个标准的多进程模块,其中包含了许多实用工具,允许用户轻松生成子进程并实现基本的进程间通信原语。
However, the implementation of the primitives uses the same form of serialization used for on-disk persistence, which is inefficient when dealing with large arrays. Hence, PyTorch extends the Python multiprocessing module into torch.multiprocessing, which is a drop-in replacement for the built in package and automatically moves the data of tensors sent to other processes to shared memory instead of sending it over the communication channel.
然而,原语的实现使用了与磁盘持久化相同的序列化形式,这在处理大型数组时效率低下。因此,PyTorch 将 Python 语言的 multiprocessing 模块扩展为 torch.multiprocessing,这是一个内置包的替代品,并自动将发送到其他进程的张量数据移动到共享内存中,而不是通过通信通道发送。
This design greatly improves performance and makes the process isolation weaker, resulting in a programming model which more closely resembles regular threaded programs. Users can easily implement heavily parallel programs that operate on independent GPUs but later synchronize gradients using all-reduce style primitives.
这种设计极大地提高了性能,并使进程隔离变得更弱,从而形成了一个更接近常规线程程序的编程模型。用户可以轻松实现高度并行的程序,这些程序在独立的 GPU 上运行,但随后使用 all-reduce 风格的原语同步梯度。
Another unique feature of this system is that it transparently handles sharing of CUDA tensors, making it easy to implement techniques like Hogwild [42].
该系统的另一个独特功能是它透明地处理 CUDA 张量的共享,使得实现像 Hogwild [42] 这样的技术变得容易。
5.5 Reference counting
5.5 引用计数
Users often design their models to utilize all memory available during training, and increasing batch sizes is a common technique of speeding up the process. Therefore, to deliver great performance, PyTorch has to treat memory as a scarce resource that it needs to manage carefully.
用户在设计模型时,通常会利用训练期间所有可用的内存,而增加批量大小是加速训练过程的常见技术。因此,为了提供卓越的性能,PyTorch 必须将内存视为需要谨慎管理的稀缺资源。
Libraries with eager semantics have to manage tensor memory without knowing how it will be used in the future. Garbage collection is the typical way to handle this automatically because it has good amortized performance. In this approach, the runtime periodically investigates the state of the system, enumerates used objects and frees everything else. However, by deferring the deal location, it causes the program to use more memory overall [43]. Given the scarcity of GPU memory, these overheads are unacceptable. In fact, Torch7 utilized the garbage collector built into Lua, and a common antipattern among the users was to sprinkle the program with explicit triggers to the garbage collector, hoping that the memory errors go away.
具有急切语义的库必须管理张量内存,而不知道它将来会如何使用。垃圾回收是自动处理此问题的典型方法,因为它具有良好的摊销性能。在这种方法中,运行时会定期检查系统的状态,枚举使用的对象并释放其他所有内容。然而,通过延迟释放位置,它会导致程序总体上使用更多的内存 [43]。鉴于 GPU 内存的稀缺性,这些开销是不可接受的。事实上,Torch7 使用了 Lua 内置的垃圾回收器,用户中常见的一种反模式是在程序中显式触发垃圾回收器,希望内存错误消失。
PyTorch takes a different approach: it relies on a reference counting scheme to track the number of uses of each tensor, and frees the underlying memory immediately once this count reaches zero. Note that PyTorch tracks both references internal to the libtorch library and external references made by users in their Python code by integrating with Python's own reference counting mechanism. This ensures that memory is released exactly when tensors become unneeded.
PyTorch 采用了不同的方法:它依赖于引用计数方案来跟踪每个张量的使用次数,一旦计数达到零,立即释放底层内存。需要注意的是,PyTorch 通过与 Python 自身的引用计数机制集成,跟踪 libtorch 库内部的引用以及用户在 Python 代码中创建的外部引用。这确保了在张量不再需要时,内存能够被准确释放。
One notable caveat is that we can only guarante the desired performance characteristics in implementations of languages that either already utilize reference counting (CPython, Swift, but not PyPy or many scripting languages such as Lua), and those that allow for user-defined behavior for assignment, copies, and moves (e.g. $\mathrm{C}{+}{+}$ , Rust). Bindings to implementations that do not satisfy those criteria will have to implement their own specialized memory management on top of PyTorch.
一个值得注意的注意事项是,我们只能保证在使用引用计数(如 CPython、Swift,但不包括 PyPy 或许多脚本语言如 Lua)的语言实现中,以及那些允许用户自定义赋值、复制和移动行为的语言(例如 C++、Rust)中,才能实现所需的性能特性。对于不满足这些标准的实现,绑定层必须在 PyTorch 之上实现自己的专用内存管理。
6 Evaluation
6 评估
In this section we compare the performance of PyTorch with several other commonly-used deep learning libraries, and find that it achieves competitive performance across a range of tasks. All experiments were performed on a workstation with two Intel Xeon E5-2698 v4 CPUs and one NVIDIA Quadro GP100 GPU.
在本节中,我们将 PyTorch 与几种其他常用的深度学习库进行性能比较,发现其在一系列任务中均表现出色。所有实验均在一台配备两个 Intel Xeon E5-2698 v4 CPU 和一个 NVIDIA Quadro GP100 GPU 的工作站上进行。
6.1 Asynchronous dataflow
6.1 异步数据流
We start by quantifying the ability of PyTorch to asynchronously execute dataflow on GPU. We use the built-in profiler [44] to instrument various benchmarks and record a timeline of the execution of a single training step.
我们首先量化了 PyTorch 在 GPU 上异步执行数据流的能力。我们使用内置的分析器 [44] 来检测各种基准测试,并记录单个训练步骤执行的时间线。
Figure 1 shows a representative timeline of execution for the first few operations of a ResNet-50 model. The host CPU which queues the work quickly outpaces the execution of the operators on the GPU. This allows PyTorch to achieve almost perfect device utilization. In this example, GPU execution takes around three times longer than CPU scheduling. The exact ratio depends on the relative performance of the host CPU and the GPU, as well as the number of elements in each tensor and the average arithmetic complexity of the floating point computations to be performed on the GPU.
图 1 展示了 ResNet-50 模型前几个操作的典型执行时间线。主机 CPU 在排队工作时迅速超过了 GPU 上操作符的执行速度。这使得 PyTorch 能够实现几乎完美的设备利用率。在这个例子中,GPU 执行时间大约是 CPU 调度的三倍。具体比例取决于主机 CPU 和 GPU 的相对性能,以及每个张量中的元素数量和 GPU 上要执行的浮点计算的平均算术复杂度。

Figure 1: A trace of the first few operators of Resnet-50. The top row depicts the execution of the control flow running on the host CPU. The gray areas are Python code executed by its interpreter. The colored areas correspond to the work done on the host CPU to queue various operators (convolution, batch normalization, and so on). The bottom row shows the corresponding execution of those operators on the GPU. The arrows pair the two events in time.
图 1: Resnet-50 前几个操作符的执行轨迹。顶部行展示了在主机 CPU 上运行的控制流执行情况。灰色区域是由 Python 解释器执行的 Python 代码。彩色区域对应于在主机 CPU 上为各种操作符(卷积、批量归一化等)排队所做的工作。底部行展示了这些操作符在 GPU 上的相应执行情况。箭头将这两个事件在时间上配对。
6.2 Memory management
6.2 内存管理
We used the NVIDIA profiler to trace the execution of the CUDA runtime as well as the execution of the CUDA kernels launched during one training iteration of the ResNet-50 model. As shown in Figure 2, the behavior of the first iteration differs significantly from that of subsequent ones. At first, calls to the CUDA memory management functions (cudaMalloc and cudaF ree) slow down the execution quite dramatically by blocking the CPU thread for long periods of time, hence lowering the utilization of the GPU. This effect disappears in subsequent iterations as the PyTorch caching memory allocator starts reusing previously allocated regions.
我们使用 NVIDIA 的性能分析工具来跟踪 CUDA 运行时的执行情况,以及在 ResNet-50 模型的一次训练迭代中启动的 CUDA 内核的执行情况。如图 2 所示,第一次迭代的行为与后续迭代有显著差异。最初,对 CUDA 内存管理函数(cudaMalloc 和 cudaFree)的调用会显著减慢执行速度,因为 CPU 线程会被长时间阻塞,从而降低了 GPU 的利用率。随着 PyTorch 缓存内存分配器开始重用之前分配的内存区域,这种影响在后续迭代中消失。

Figure 2: Annotated traces of the execution of ResNet-50 on GPU.
图 2: GPU 上 ResNet-50 执行的注释轨迹。
6.3 Benchmarks
6.3 基准测试
Finally, we can get an overall sense of single-machine eager mode performance of PyTorch by comparing it to three popular graph-based deep learning frameworks (CNTK, MXNet and TensorFlow), a define-by-run framework (Chainer), and production oriented platform (Paddle Paddle). The Appendix details all the steps needed to reproduce our setup.
最后,我们可以通过将 PyTorch 与三个流行的基于图的深度学习框架(CNTK、MXNet 和 TensorFlow)、一个定义即运行框架(Chainer)以及一个面向生产的平台(Paddle Paddle)进行比较,来全面了解 PyTorch 的单机即时模式性能。附录详细介绍了重现我们设置所需的所有步骤。
Our results are summarized in Table 1. On all the benchmarks, the performance of PyTorch is within $17%$ of that of of the fastest framework. We attribute this result to the fact that these tools offload most of the computation to the same version of the cuDNN and cuBLAS libraries.
我们的结果总结在表 1 中。在所有基准测试中,PyTorch 的性能与最快框架的性能差距在 $17%$ 以内。我们将这一结果归因于这些工具将大部分计算卸载到相同版本的 cuDNN 和 cuBLAS 库中。
Table 1: Training speed for 6 models using 32bit foats. Throughput is measured in images per second for the AlexNet, VGG-19, ResNet-50, and MobileNet models, in tokens per second for the GNMTv2 model, and in samples per second for the NCF model. The fastest speed for each model is shown in bold.
| 框架 | AlexNet | VGG-19 | ResNet-50 | MobileNet | GNMTv2 | NCF |
|---|---|---|---|---|---|---|
| Chainer | 778±15 | N/A | 219±1 | N/A | N/A | N/A |
| CNTK | 845±8 | 84±3 | 210±1 | N/A | N/A | N/A |
| MXNet | 1554±22 | 113±1 | 218±2 | 444±2 | N/A | N/A |
| PaddlePaddle | 933±123 | 112±2 | 192±4 | 557±24 | N/A | N/A |
| TensorFlow | 1422±27 | 66±2 | 200±1 | 216±15 | 9631±1.3% | 4.8e6±2.9% |
| PyTorch | 1547±316 | 119±1 | 212±2 | 463±17 | 15512±4.8% | 5.4e6±3.4% |
表 1: 使用 32 位浮点数训练的 6 个模型的速度。AlexNet、VGG-19、ResNet-50 和 MobileNet 模型的吞吐量以每秒图像数衡量,GNMTv2 模型的吞吐量以每秒 token 数衡量,NCF 模型的吞吐量以每秒样本数衡量。每个模型的最快速度以粗体显示。
6.4 Adoption
6.4 采用
The validity of design decisions and their impact on ease-of-use is hard to measure. As a proxy, we tried to quantify how well the machine learning community received PyTorch by counting how often various machine learning tools (including Caffe, Chainer, CNTK, Keras, MXNet, PyTorch, TensorFlow, and Theano) are mentioned on arXiv e-Prints since the initial release of PyTorch in January 2017. In Figure 3 we report the monthly number of mentions of the word "PyTorch" as a percentage of all mentions among these deep learning frameworks. We counted tools mentioned multiple times in a given paper only once, and made the search case insensitive to account for various spellings.
设计决策的有效性及其对易用性的影响难以衡量。作为替代方案,我们尝试通过统计自2017年1月PyTorch首次发布以来,arXiv e-Prints上各种机器学习工具(包括Caffe、Chainer、CNTK、Keras、MXNet、PyTorch、TensorFlow和Theano)被提及的频率,来量化机器学习社区对PyTorch的接受程度。在图3中,我们报告了“PyTorch”一词每月被提及的次数占这些深度学习框架总提及次数的百分比。对于每篇论文中多次提及的工具,我们只计一次,并且使搜索不区分大小写以应对不同的拼写方式。

Figure 3: Among arXiv papers each month that mention common deep learning frameworks, percentage of them that mention PyTorch.
图 3: 每月在 arXiv 论文中提及常见深度学习框架的论文中,提及 PyTorch 的百分比。
7 Conclusion and future work
7 结论与未来工作
PyTorch has become a popular tool in the deep learning research community by combining a focus on usability with careful performance considerations. In addition to continuing to support the latest trends and advances in deep learning, in the future we plan to continue to improve the speed and s cal ability of PyTorch. Most notably, we are working on the PyTorch JIT: a suite of tools that allow PyTorch programs to be executed outside of the Python interpreter where they can be further optimized. We also intend to improve support for distributed computation by providing efficient primitives for data parallelism as well as a Pythonic library for model parallelism based around remote procedure calls.
PyTorch 通过将易用性与性能考虑相结合,已成为深度学习研究社区中的流行工具。除了继续支持深度学习的最新趋势和进展外,我们计划在未来继续提升 PyTorch 的速度和可扩展性。最值得注意的是,我们正在开发 PyTorch JIT:一套允许 PyTorch 程序在 Python 解释器之外执行的工具,从而可以进一步优化。我们还计划通过提供高效的数据并行原语以及基于远程过程调用的 Pythonic 库来改进对分布式计算的支持。
8 Acknowledgements
8 致谢
We are grateful to the PyTorch community for their feedback and contributions that greatly influenced the design and implementation of PyTorch. We thank all the PyTorch core team members, contributors and package maintainers including Ailing Zhang, Alex Suhan, Alfredo Mendoza, Alican Bozkurt, Andrew Tulloch, Ansha Yu, Anthony Shoumikhin, Bram Wasti, Brian Vaughan, Christian Puhrsch, David Reiss, David Riazati, Davide Libenzi, Dmytro Dzhulgakov, Dwaraj Rajagopal, Edward Yang, Elias Ellison, Fritz Obermeyer, George Zhang, Hao Lu, Hong Xu, Hung Duong, Igor Fedan, Ilia C hernia v ski, Iuri Zdebskyi, Ivan Kobzarev, James Reed, Jeff Smith, Jerry Chen, Jerry Zhang, Jiakai Liu, Johannes M. Dieterich, Karl Ostmo, Lin Qiao, Martin Yuan, Michael Suo, Mike Ruberry, Mikhail Zo lot hu khin, Mingzhe Li, Neeraj Pradhan, Nick Korovaiko, Owen Anderson, Pavel Belevich, Peter Johnson, Pritam Damania, Raghuraman Krishna moor thi, Richard Zou, Roy Li, Rui Zhu, Sebastian Messmer, Shen Li, Simon Wang, Supriya Rao, Tao Xu, Thomas Viehmann, Vincent Que n nevilleBelair, Vishwak Srinivasan, Vitaly Fedyunin, Wanchao Liang, Wei Yang, Will Feng, Xiaomeng Yang, Xiaoqiang Zheng, Xintao Chen, Yangqing Jia, Yanli Zhao, Yinghai Lu and Zafar Takhirov.
我们感谢 PyTorch 社区提供的反馈和贡献,这些反馈和贡献极大地影响了 PyTorch 的设计和实现。我们感谢所有 PyTorch 核心团队成员、贡献者和包维护者,包括 Ailing Zhang、Alex Suhan、Alfredo Mendoza、Alican Bozkurt、Andrew Tulloch、Ansha Yu、Anthony Shoumikhin、Bram Wasti、Brian Vaughan、Christian Puhrsch、David Reiss、David Riazati、Davide Libenzi、Dmytro Dzhulgakov、Dwaraj Rajagopal、Edward Yang、Elias Ellison、Fritz Obermeyer、George Zhang、Hao Lu、Hong Xu、Hung Duong、Igor Fedan、Ilia Cherniavski、Iuri Zdebskyi、Ivan Kobzarev、James Reed、Jeff Smith、Jerry Chen、Jerry Zhang、Jiakai Liu、Johannes M. Dieterich、Karl Ostmo、Lin Qiao、Martin Yuan、Michael Suo、Mike Ruberry、Mikhail Zolotukhin、Mingzhe Li、Neeraj Pradhan、Nick Korovaiko、Owen Anderson、Pavel Belevich、Peter Johnson、Pritam Damania、Raghuraman Krishnamoorthy、Richard Zou、Roy Li、Rui Zhu、Sebastian Messmer、Shen Li、Simon Wang、Supriya Rao、Tao Xu、Thomas Viehmann、Vincent Quenneville-Belair、Vishwak Srinivasan、Vitaly Fedyunin、Wanchao Liang、Wei Yang、Will Feng、Xiaomeng Yang、Xiaoqiang Zheng、Xintao Chen、Yangqing Jia、Yanli Zhao、Yinghai Lu 和 Zafar Takhirov。
References
参考文献
[3] Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek IMiulay, Llms Uiall, Ivuke Schustel, Jouauon Smens, Denoit Stemel, lya Sutskevel, Nunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vigas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorfow.org.
[3] Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek IMiulay, Llms Uiall, Ivuke Schustel, Jouauon Smens, Denoit Stemel, lya Sutskevel, Nunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vigas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: 异构系统上的大规模机器学习, 2015. 软件可从 tensorfow.org 获取。
[22] Sharan Chetlur, Cliff Woolley, Philippe Vande rmer sch, Jonathan D. Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Effcient primitives for deep learning. CoRR, abs/1410.0759, 2014.
[22] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan D. Cohen, John Tran, Bryan Catanzaro, 和 Evan Shelhamer. cudnn: 深度学习的高效原语. CoRR, abs/1410.0759, 2014.
