Deep Learning Recommendation Model for Personalization and Recommendation Systems
author
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang,
Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta†, Carole-Jean Wu,
Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu,
Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira,
Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong and Misha Smelyanskiy
Facebook, 1 Hacker Way, Menlo Park, CA 94065
Northwestern University, †Harvard University, work done while at Facebook.
ABSTRACT
With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design.
摘要
随着深度学习的到来,基于神经网络的推荐模型已经成为解决个性化和推荐任务的重要工具。这些网络由于需要处理类别特征而与其他深度学习网络明显不同,因此并未得到充分的研究或理解。在本文中,我们开发了最新的深度学习推荐模型(DLRM),并提供了在PyTorch和Caffe2框架中的实现。此外,我们设计了一种特殊的并行化方案,该方案利用嵌入表上的模型并行性来减轻内存限制,同时利用数据并行性来从完全连接的层进行横向扩展计算。我们将DLRM与现有推荐模型进行了比较,并描述了其在Big Basin AI平台上的性能,
1INTRODUCTION
Personalization and recommendation systems are currently deployed for a variety of tasks at large internet companies, including ad click-through rate (CTR) prediction and rankings. Although these methods have had long histories, these approaches have only recently embraced neural networks. Two primary perspectives contributed towards the architectural design of deep learning models for personalization and recommendation.
当前,个性化和推荐系统已在大型互联网公司中用于各种任务,包括广告点击率(CTR)预测和排名。尽管这些方法已有很长的历史,但这些方法直到最近才包含神经网络。有两个主要观点对用于个性化和推荐的深度学习模型的体系结构设计做出了贡献。
The first comes from the view of recommendation systems. These systems initially employed content filtering where a set of experts classified products into categories, while users selected their preferred categories and were matched based on their preferences mgp . The field subsequently evolved to use collaborative filtering, where recommendations are based on past user behaviors, such as prior ratings given to products. Neighborhood methods ning2015 that provide recommendations by grouping users and products together and latent factor methods that characterize users and products by certain implicit factors via matrix factorization techniques frolov2017tensor ; koren2009matrix were later deployed with success.
首先是推荐系统。这些系统最初采用内容过滤,其中一组专家将产品划分为类别,而用户选择了他们喜欢的类别并根据他们的偏好mgp 进行了匹配。该领域随后发展为使用协作过滤,其中推荐基于过去的用户行为,例如对产品给予的先前评级。ning2015 邻域方法(通过将用户和产品分组来提供建议)和潜在因子方法(通过矩阵分解技术frolov2017tensor 通过某些隐性因子对用户和产品进行表征) ; 后来成功部署了koren2009matrix。
The second view comes from predictive analytics, which relies on statistical models to classify or predict the probability of events based on the given data Devroye1996 . Predictive models shifted from using simple models such as linear and logistic regression walker1967 to models that incorporate deep networks. In order to process categorical data, these models adopted the use of embeddings, which transform the one- and multi-hot vectors into dense representations in an abstract space naumov2019 . This abstract space may be interpreted as the space of the latent factors found by recommendation systems.
第二种观点是预测分析,它依靠统计模型基于给定的数据Devroye1996 来分类或预测事件的概率。预测模型从使用简单模型(例如线性和逻辑回归Walker1967)转变 为包含深度网络的模型。为了处理分类数据,这些模型采用了嵌入技术,将单热和多热矢量转换为抽象空间naumov2019 中的密集表示。该抽象空间可以被解释为推荐系统发现的潜在因素的空间。
In this paper, we introduce a personalization model that was conceived by the union of the two perspectives described above. The model uses embeddings to process sparse features that represent categorical data and a multilayer perceptron (MLP) to process dense features, then interacts these features explicitly using the statistical techniques proposed in rendle2010 . Finally, it finds the event probability by post-processing the interactions with another MLP. We refer to this model as a deep learning recommendation model (DLRM); see Fig. 1. A PyTorch and Caffe2 implementation of this model will be released for testing and experimentation with the publication of this manuscript.
在本文中,我们介绍了一种个性化模型,该模型是上述两种观点的结合所构想的。该模型使用嵌入来处理代表分类数据的稀疏特征,并使用多层感知器(MLP)来处理密集特征,然后使用rendle2010中 提出的统计技术明确地交互这些特征。最后,它通过对与另一个MLP的交互进行后处理来找到事件概率。我们将此模型称为深度学习推荐模型(DLRM);参见图1。该模型的PyTorch和Caffe2实现将随该手稿的发布而发布,以进行测试和试验。
2MODEL DESIGN AND ARCHITECTURE 模型设计与架构
In this section, we will describe the design of DLRM. We will begin with the high level components of the network and explain how and why they have been assembled together in a particular way, with implications for future model design, then characterize the low level operators and primitives that make up the model, with implications for future hardware and system design.
在本节中,我们将描述DLRM的设计。我们将从网络的高级组件开始,并说明如何以及为什么将它们以特定方式组装在一起,并对将来的模型设计产生影响,然后描述构成模型的低级运算符和原语,并为它们提供影响。未来的硬件和系统设计。
2.1COMPONENTS OF DLRM
Figure 1: A deep learning recommendation model
The high-level components of the DLRM can be more easily understood by reviewing early models. We will avoid the full scientific literature review and focus instead on the four techniques used in early models that can be interpreted as salient high-level components of the DLRM.
通过回顾早期模型,可以更轻松地理解DLRM的高级组件。我们将避免进行全面的科学文献综述,而将注意力集中在早期模型中使用的四种可被解释为DLRM的重要高级组件的技术上。
Embeddings
In order to handle categorical data, embeddings map each category to a dense representation in an abstract space. In particular, each embedding lookup may be interpreted as using a one-hot vector ei (with the i-th position being 1 while others are 0, where index i corresponds to i-th category) to obtain the corresponding row vector of the embedding table $W\in\mathbb{R}^{m\times d}
$ as follows
为了处理分类数据,嵌入将每个类别映射到抽象空间中的密集表示。每个嵌入查找都可以解释为使用one-hot 向量(ei 就相当于是 embedding lookup ( 对位 i 是1,其他为0 ),embedding table
In more complex scenarios, an embedding can also represent a weighted combination of multiple items, with a multi-hot vector of weights
, with elements ai≠0 for i=i1,...,ik and 0 everywhere else, where i1,...,ik index the corresponding items. Note that a mini-batch of t embedding lookups can hence be written as
embedding 也可以被理解为是多个 items 的带权组合,mulit-hot vector,index ik 对应 items。一个大小为 t 的 mini-batch embedding lookups 可以写为:
where sparse matrix A=[a1,...,at] naumov2019 .
DLRMs will utilize embedding tables for mapping categorical features to dense representations. However, even after these embeddings are meaningfully devised, how are they to be exploited to produce accurate predictions? To answer this, we return to latent factor methods.
DLRMs 利用 embedding tables 将类别特征映射到稠密表示。但即使 embeddings 是有意义的,但我们要如何利用它们进行准确的预测呢?latent factor methods。
Matrix Factorization 矩阵分解
Recall that in the typical formulation of the recommendation problem, we are given a set S of users that have rated some products. We would like to represent the i-th product by a vector wi∈Rd for i=1,...,n and j-th user by a vector vj∈Rd for j=1,...,m to find all the ratings, where n and m denote the total number of products and users, respectively. More rigorously, the set S consists of tuples (i,j) indexing when the i-th product has been rated by the j-th user.
回顾推荐问题的典型场景,有一组用户S已经给一些产品进行了评价。我们将产品表示为 vector wi 将用户表示为 vector vj ,共 n 个产品,m 个用户。( *i,j ) S*
The matrix factorization approach solves this problem by minimizing
where rij∈R is the rating of the i-th product by the j-th user for i=1,...,m and j=1,...,n. Then, letting WT=[w1,...,wm] and VT=[v1,...,vn], we may approximate the full matrix of ratings R=[rij] as the matrix product R≈WVT. Note that W and V may be interpreted as two embedding tables, where each row represents a user/product in a latent factor space11This problem is different from low-rank approximation, which can be solved by SVD golub1996 , because not all entries of matrix R are known. koren2009matrix . The dot product of these embedding vectors yields a meaningful prediction of the subsequent rating, a key observation to the design of factorization machines and DLRM.
R 其实就是 Reward matrix,W,V 是两个 embedding tables,每一行分别表示着 laten factor space 里的 user/product。
vectors 的点积代表对 rating 的预测。
Factorization Machine
In classification problems, we want to define a prediction function ϕ:Rn→T from an input datapoint x∈Rn to a target label y∈T. As an example, we can predict the click-through rate by defining T={+1,−1} with +1 denoting the presence of a click and −1 as the absence of a click.
在分类问题中,我们会定一个预测函数:输入 x 预测 label y。我们定义 T={+1, -1},+1代表有点击,-1代表每点击,去预测 click-through rate。FM 在线性模型上引入了二阶交叉。
Factorization machines (FM) incorporate second-order interactions into a linear model with categorical data by defining a model of the form
where V∈Rn×d, w∈Rn, and b∈R are the parameters with d≪n, and upper selects the strictly upper triangular part of the matrix rendle2010 .
FMs are notably distinct from support vector machines (SVMs) with polynomial kernels Corinna1995 because they factorize the second-order interaction matrix into its latent factors (or embedding vectors) as in matrix factorization, which more effectively handles sparse data. This significantly reduces the complexity of the second-order interactions by only capturing interactions between pairs of distinct embedding vectors, yielding linear computational complexity.
FM 明显不同于多项式核函数 SVM 的是,它将分解二阶交叉矩阵到 latent factors ( or embedding vectors ) 就像矩阵分解似的,更高效的处理了稀疏数据。通过仅用不同 embedding vectors 交叉显著减少了二阶交叉的复杂度。转为了线性计算复杂度。
Multilayer Perceptrons
Simultaneously, much recent success in machine learning has been due to the rise of deep learning. The most fundamental model of these is the multilayer perceptron (MLP), a prediction function composed of an interleaving sequence of fully connected (FC) layers and an activation function σ:R→R applied componentwise as shown below
当然,当前许多在机器学习上的成功是源于深度学习的兴起。这里边最基础的模型就是 MLP 了,一个由 FC layers 和激活函数组成的预测函数。
where weight matrix Wl∈Rnl×nl−1, bias bl∈Rnl for layer l=1,...,k.
These methods have been used to capture more complex interactions. It has been shown, for example, that given enough parameters, MLPs with sufficient depth and width can fit data to arbitrary precision bishop1995 . Variations of these methods have been widely used in various applications including computer vision and natural language processing. One specific case, Neural Collaborative Filtering (NCF) he2017neural ; sedhain2015autorec used as part of the MLPerf benchmark mlperf , uses an MLP rather than dot product to compute interactions between embeddings in matrix factorization.
这些方法被用来捕获更复杂的 interactions。比如在有足够的参数下,MLPs 有足够的深度和广度来适应数据到任意精度。各种方法的变种被广泛应用到了各种场景下,包括 cv 和 nlp。有一个特别的 case,NCF 被用来做 MLPerf benchmark 的一部分,它使用 MLP 来计算矩阵分解中 embeddings 直接的 interaction,而非 dot product。
2.2DLRM ARCHITECTURE
So far, we have described different models used in recommendation systems and predictive analytics. Let us now combine their intuitions to build a state-of-the-art personalization model.
上边描述了推荐系统和预测分析使用的不同模型,现在我们将其组合起来,构建一个 state-of-the-art 的个性化模型:
Let the users and products be described by many continuous and categorical features. To process the categorical features, each categorical feature will be represented by an embedding vector of the same dimension, generalizing the concept of latent factors used in matrix factorization (3). To handle the continuous features, the continuous features will be transformed by an MLP (which we call the bottom or dense MLP) which will yield a dense representation of the same length as the embedding vectors (5).
让users 和 products 通过许多连续和分类的功能来描述。为了处理分类特征,每个分类特征将由相同维数的嵌入向量表示,从而归纳了矩阵分解中使用的潜在因子的概念(3)。为了处理连续特征,将通过MLP(我们称为底部或密集MLP)对连续特征进行变换,该MLP将产生与embedding vector(5)相同长度的稠密向量
We will compute second-order interaction of different features explicitly, following the intuition for handling sparse data provided in FMs (4), optionally passing them through MLPs. This is done by taking the dot product between all pairs of embedding vectors and processed dense features. These dot products are concatenated with the original processed dense features and post-processed with another MLP (the top or output MLP) (5), and fed into a sigmoid function to give a probability.
我们按照 FM 的方式处理稀疏特征,显示地计算不同特征的二阶交叉,可选的将其传给 MLPs。将这些 dot products 与原始的 dense features 拼接起来,传给另一个 MLP ( the top or output MLP ),最后馈入Sigmoid以给出概率。
We refer to the resulting model as DLRM, shown in Fig. 1. We show some of the operators used in DLRM in PyTorch paszke2017pytorch and Caffe2 caffe2 frameworks in Table 1.
我们将结果模型称为DLRM,如图1所示。我们在表1中显示了PyTorch paszke2017pytorch 和Caffe2 caffe2 框架中的DLRM中使用的一些运算符。
Embedding | MLP | Interactions | Loss | |
---|---|---|---|---|
PyTorch | nn.EmbeddingBag | nn.Linear/addmm | matmul/bmm | nn.CrossEntropyLoss |
Caffe2 | SparseLengthSum | FC | BatchMatMul | CrossEntropy |
Table 1: DLRM operators by framework
2.3COMPARISON WITH PRIOR MODELS
Many deep learning-based recommendation models cheng2016wide ; guo2017deepfm ; wang2017deep ; lian2018xdeepfm ; zhou2018deepi ; zhou2018deep use similar underlying ideas to generate higher-order terms to handle sparse features. Wide and Deep, Deep and Cross, DeepFM, and xDeepFM networks, for example, design specialized networks to systematically construct higher-order interactions. These networks then sum the results from both their specialized model and an MLP, passing this through a linear layer and sigmoid activation to yield a final probability. DLRM specifically interacts embeddings in a structured way that mimics factorization machines to significantly reduce the dimensionality of the model by only considering cross-terms produced by the dot-product between pairs of embeddings in the final MLP. We argue that higher-order interactions beyond second-order found in other networks may not necessarily be worth the additional computational/memory cost.
许多基于深度学习的推荐模型cheng2016wide ; guo2017deepfm ; wang2017deep ; lian2018xdeepfm ; zhou2018deepi ; 周2018深 使用类似的基本思想来生成处理稀疏特征的高阶项。例如,宽和深,深和跨,DeepFM和xDeepFM网络设计了专门的网络来系统地构建高阶交互。然后,这些网络将其专用模型和MLP的结果相加,并通过线性层和S形激活,以得出最终概率。DLRM通过模仿结构化机器的结构化方式专门地与嵌入交互,从而仅考虑最终MLP中嵌入对之间的点积所产生的交叉项,从而显着降低了模型的维数。我们认为,在其他网络中发现的超出第二阶的更高阶交互可能不一定值得额外的计算/内存成本。
A key difference between DLRM and other networks is in how these networks treat embedded feature vectors and their cross-terms. In particular, DLRM (and xDeepFM lian2018xdeepfm ) interpret each feature vector as a single unit representing a single category, whereas networks like Deep and Cross treat each element in the feature vector as a new unit that should yield different cross-terms. Hence, Deep and Cross networks will produce cross-terms not only between elements from different feature vectors as in DLRM via the dot product, but also produce cross-terms between elements within the same feature vector, resulting in higher dimensionality.
DLRM与其他网络之间的主要区别在于这些网络如何处理嵌入式特征向量及其交叉项。特别是,DLRM(和xDeepFM lian2018xdeepfm )将每个特征向量解释为代表一个类别的单个单元,而像Deep和Cross这样的网络则将特征向量中的每个元素视为应产生不同交叉项的新单元。因此,深层和交叉网络不仅会像通过点积在DLRM中那样在不同特征向量的元素之间产生交叉项,而且还会在同一特征向量内的元素之间产生交叉项,从而导致更高的维数。
3PARALLELISM 平行性
Modern personalization and recommendation systems require large and complex models to capitalize on vast amounts of data. DLRMs particularly contain a very large number of parameters, up to multiple orders of magnitude more than other common deep learning models like convolutional neural networks (CNN), transformer and recurrent networks (RNN), and generative networks (GAN). This results in training times up to several weeks or more. Hence, it is important to parallelize these models efficiently in order to solve these problems at practical scales.
现代的个性化和推荐系统需要庞大而复杂的模型来利用大量数据。DLRM特别包含大量参数,比其他常见的深度学习模型(例如卷积神经网络(CNN),变压器和递归网络(RNN)和生成网络(GAN))高多达多个数量级。这导致训练时间长达数周或更长时间。因此,重要的是有效地并行化这些模型,以便在实际规模上解决这些问题。
As described in the previous section, DLRMs process both categorical features (with embeddings) and continuous features (with the bottom MLP) in a coupled manner. Embeddings contribute the majority of the parameters, with several tables each requiring in excess of multiple GBs of memory, making DLRM memory-capacity and bandwidth intensive. The size of the embeddings makes it prohibitive to use data parallelism since it requires replicating large embeddings on every device. In many cases, this memory constraint necessitates the distribution of the model across multiple devices to be able satisfy memory capacity requirements.
如上一节所述,DLRM以耦合方式处理分类特征(具有嵌入)和连续特征(具有底部MLP)。嵌入是大多数参数的原因,其中几个表每个表都需要超过GB的内存,从而使DLRM的内存容量和带宽密集。嵌入的大小使其无法使用数据并行性,因为它需要在每个设备上复制大型嵌入。在许多情况下,此内存约束要求必须在多个设备之间分配模型才能满足内存容量要求。
On the other hand, the MLP parameters are smaller in memory but translate into sizeable amounts of compute. Hence, data-parallelism is preferred for MLPs since this enables concurrent processing of the samples on different devices and only requires communication when accumulating updates. Our parallelized DLRM will use a combination of model parallelism for the embeddings and data parallelism for the MLPs to mitigate the memory bottleneck produced by the embeddings while parallelizing the forward and backward propagations over the MLPs. Combined model and data parallelism is a unique requirement of DLRM as a result of its architecture and large model sizes. Such combined parallelism is not supported in either Caffe2 or PyTorch (as well as other popular deep learning frameworks), therefore we design a custom implementation. We plan to provide its detailed performance study in forthcoming work.
另一方面,MLP参数在内存中较小,但是可以转换为可观的计算量。因此,对于MLP而言,数据并行性是优选的,因为这可以在不同设备上并发处理样本,并且仅在累积更新时才需要通信。我们的并行化DLRM将针对嵌入的模型并行性与针对MLP的数据并行性结合使用,以缓解嵌入产生的内存瓶颈,同时并行化MLP上的前向和后向传播。由于DLRM的体系结构和较大的模型规模,因此将模型和数据并行性相结合是DLRM的独特要求。Caffe2或PyTorch(以及其他流行的深度学习框架)均不支持这种组合的并行性,因此我们设计了一个自定义实现。
Figure 2: Butterfly shuffle for the all-to-all (personalized) communicationIn our setup, the top MLP and the interaction operator require access to part of the mini-batch from the bottom MLP and all of the embeddings. Since model parallelism has been used to distribute the embeddings across devices, this requires a personalized all-to-all communication commsref . At the end of the embedding lookup, each device has a vector for the embedding tables resident on those devices for all the samples in the mini-batch, which needs to be split along the mini-batch dimension and communicated to the appropriate devices, as shown in Fig. 2. Neither PyTorch nor Caffe2 provide native support for model parallelism; therefore, we have implemented it by explicitly mapping the embedding operators (nn.EmbeddingBag for PyTorch, SparseLengthSum for Caffe2) to different devices. Then personalized all-to-all communication is implemented using the butterfly shuffle operator, which appropriately slices the resulting embedding vectors and transfers them to the target devices. In the current version, these transfers are explicit copies, but we intend to further optimize this using the available communication primitives (such as all-gather and send-recv).
在我们的设置中,顶部MLP和交互操作符要求从底部MLP和所有嵌入访问部分迷你批处理。由于已使用模型并行性在设备之间分布嵌入,因此这需要个性化的全方位通信commsref 。在嵌入查找的最后,每个设备都有一个用于微型批处理中所有样本的驻留在这些设备上的嵌入表的向量,该向量需要沿着微型批处理维进行拆分,并传递给适当的设备,如下所示:如图2所示。PyTorch和Caffe2都不为模型并行性提供本地支持。因此,我们已经明确地映射嵌入运营商(中实现它nn.EmbeddingBag为PyTorch,Caffe2的SparseLengthSum)到不同的设备。然后,使用Butterfly shuffle运算符实现个性化的全部通信,该运算符将生成的嵌入矢量适当地切片,并将其传输到目标设备。在当前版本中,这些传输是显式副本,但是我们打算使用可用的通信原语(例如all-gather和send-recv)进一步优化此传输。
We note that for the data parallel MLPs, the parameter updates in the backward pass are accumulated with an allreduce22Optimized implementations for the allreduce op. include Nvidia’s NCCL nccl and Facebook’s gloo gloo . and applied to the replicated parameters on each device commsref in a synchronous fashion, ensuring the updated parameters on each device are consistent before every iteration. In PyTorch, data parallelism is enabled through the nn.DistributedDataParallel and nn.DataParallel modules that replicate the model on each device and insert allreduce with the necessary dependencies. In Caffe2, we manually insert allreduce before the gradient update.
我们注意到,对于数据并行MLP,向后传递中的参数更新以allreduce 2的形式累积。(2个优化了allreduce op的实现。包括Nvidia的NCCL nccl 和Facebook的gloo gloo 。)并以同步方式应用于每个设备commsref 上的复制参数,以确保每个设备上的更新参数在每次迭代