Deep Learning Recommendation Model for Personalization and Recommendation Systems


Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang,
Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta†, Carole-Jean Wu,
Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu,
Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira,
Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong and Misha Smelyanskiy
Facebook, 1 Hacker Way, Menlo Park, CA 94065

Northwestern University, †Harvard University, work done while at Facebook.


With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design.


随着深度学习的到来,基于神经网络的推荐模型已经成为解决个性化和推荐任务的重​​要工具。这些网络由于需要处理类别特征而与其他深度学习网络明显不同,因此并未得到充分的研究或理解。在本文中,我们开发了最新的深度学习推荐模型(DLRM),并提供了在PyTorch和Caffe2框架中的实现。此外,我们设计了一种特殊的并行化方案,该方案利用嵌入表上的模型并行性来减轻内存限制,同时利用数据并行性来从完全连接的层进行横向扩展计算。我们将DLRM与现有推荐模型进行了比较,并描述了其在Big Basin AI平台上的性能,


Personalization and recommendation systems are currently deployed for a variety of tasks at large internet companies, including ad click-through rate (CTR) prediction and rankings. Although these methods have had long histories, these approaches have only recently embraced neural networks. Two primary perspectives contributed towards the architectural design of deep learning models for personalization and recommendation.

The first comes from the view of recommendation systems. These systems initially employed content filtering where a set of experts classified products into categories, while users selected their preferred categories and were matched based on their preferences mgp . The field subsequently evolved to use collaborative filtering, where recommendations are based on past user behaviors, such as prior ratings given to products. Neighborhood methods ning2015 that provide recommendations by grouping users and products together and latent factor methods that characterize users and products by certain implicit factors via matrix factorization techniques frolov2017tensor ; koren2009matrix were later deployed with success.

首先是推荐系统。这些系统最初采用内容过滤,其中一组专家将产品划分为类别,而用户选择了他们喜欢的类别并根据他们的偏好mgp 进行了匹配。该领域随后发展为使用协作过滤,其中推荐基于过去的用户行为,例如对产品给予的先前评级。ning2015 邻域方法(通过将用户和产品分组来提供建议)和潜在因子方法(通过矩阵分解技术frolov2017tensor 通过某些隐性因子对用户和产品进行表征) ; 后来成功部署了koren2009matrix

The second view comes from predictive analytics, which relies on statistical models to classify or predict the probability of events based on the given data Devroye1996 . Predictive models shifted from using simple models such as linear and logistic regression walker1967 to models that incorporate deep networks. In order to process categorical data, these models adopted the use of embeddings, which transform the one- and multi-hot vectors into dense representations in an abstract space naumov2019 . This abstract space may be interpreted as the space of the latent factors found by recommendation systems.
第二种观点是预测分析,它依靠统计模型基于给定的数据Devroye1996 来分类或预测事件的概率。预测模型从使用简单模型(例如线性和逻辑回归Walker1967)转变 为包含深度网络的模型。为了处理分类数据,这些模型采用了嵌入技术,将单热和多热矢量转换为抽象空间naumov2019 中的密集表示。该抽象空间可以被解释为推荐系统发现的潜在因素的空间。

In this paper, we introduce a personalization model that was conceived by the union of the two perspectives described above. The model uses embeddings to process sparse features that represent categorical data and a multilayer perceptron (MLP) to process dense features, then interacts these features explicitly using the statistical techniques proposed in rendle2010 . Finally, it finds the event probability by post-processing the interactions with another MLP. We refer to this model as a deep learning recommendation model (DLRM); see Fig. 1. A PyTorch and Caffe2 implementation of this model will be released for testing and experimentation with the publication of this manuscript.

在本文中,我们介绍了一种个性化模型,该模型是上述两种观点的结合所构想的。该模型使用嵌入来处理代表分类数据的稀疏特征,并使用多层感知器(MLP)来处理密集特征,然后使用rendle2010中 提出的统计技术明确地交互这些特征。最后,它通过对与另一个MLP的交互进行后处理来找到事件概率。我们将此模型称为深度学习推荐模型(DLRM);参见图1。该模型的PyTorch和Caffe2实现将随该手稿的发布而发布,以进行测试和试验。


In this section, we will describe the design of DLRM. We will begin with the high level components of the network and explain how and why they have been assembled together in a particular way, with implications for future model design, then characterize the low level operators and primitives that make up the model, with implications for future hardware and system design.


Figure 1: A deep learning recommendation model

The high-level components of the DLRM can be more easily understood by reviewing early models. We will avoid the full scientific literature review and focus instead on the four techniques used in early models that can be interpreted as salient high-level components of the DLRM.


In order to handle categorical data, embeddings map each category to a dense representation in an abstract space. In particular, each embedding lookup may be interpreted as using a one-hot vector ei (with the i-th position being 1 while others are 0, where index i corresponds to i-th category) to obtain the corresponding row vector of the embedding table $W\in\mathbb{R}^{m\times d}$ as follows
为了处理分类数据,嵌入将每个类别映射到抽象空间中的密集表示。每个嵌入查找都可以解释为使用one-hot 向量(ei 就相当于是 embedding lookup ( 对位 i 是1,其他为0 ),embedding table image.png


In more complex scenarios, an embedding can also represent a weighted combination of multiple items, with a multi-hot vector of weightsimage.png
, with elements ai≠0 for i=i1,...,ik and 0 everywhere else, where i1,...,ik index the corresponding items. Note that a mini-batch of t embedding lookups can hence be written as

embedding 也可以被理解为是多个 items 的带权组合,mulit-hot vector,index ik 对应 items。一个大小为 t 的 mini-batch embedding lookups 可以写为:

where sparse matrix A=[a1,...,at] naumov2019 .

DLRMs will utilize embedding tables for mapping categorical features to dense representations. However, even after these embeddings are meaningfully devised, how are they to be exploited to produce accurate predictions? To answer this, we return to latent factor methods.
DLRMs 利用 embedding tables 将类别特征映射到稠密表示。但即使 embeddings 是有意义的,但我们要如何利用它们进行准确的预测呢?latent factor methods。

Matrix Factorization 矩阵分解

Recall that in the typical formulation of the recommendation problem, we are given a set S of users that have rated some products. We would like to represent the i-th product by a vector wi∈Rd for i=1,...,n and j-th user by a vector vj∈Rd for j=1,...,m to find all the ratings, where n and m denote the total number of products and users, respectively. More rigorously, the set S consists of tuples (i,j) indexing when the i-th product has been rated by the j-th user.
回顾推荐问题的典型场景,有一组用户S已经给一些产品进行了评价。我们将产品表示为 vector wi 将用户表示为 vector vj ,共 n 个产品,m 个用户。( *i,j ) S*

The matrix factorization approach solves this problem by minimizing


where rij∈R is the rating of the i-th product by the j-th user for i=1,...,m and j=1,...,n. Then, letting WT=[w1,...,wm] and VT=[v1,...,vn], we may approximate the full matrix of ratings R=[rij] as the matrix product R≈WVT. Note that W and V may be interpreted as two embedding tables, where each row represents a user/product in a latent factor space11This problem is different from low-rank approximation, which can be solved by SVD golub1996 , because not all entries of matrix R are known. koren2009matrix . The dot product of these embedding vectors yields a meaningful prediction of the subsequent rating, a key observation to the design of factorization machines and DLRM.

R 其实就是 Reward matrix,W,V 是两个 embedding tables,每一行分别表示着 laten factor space 里的 user/product。

vectors 的点积代表对 rating 的预测。

Factorization Machine

In classification problems, we want to define a prediction function ϕ:Rn→T from an input datapoint x∈Rn to a target label y∈T. As an example, we can predict the click-through rate by defining T={+1,−1} with +1 denoting the presence of a click and −1 as the absence of a click.
在分类问题中,我们会定一个预测函数:输入 x 预测 label y。我们定义 T={+1, -1},+1代表有点击,-1代表每点击,去预测 click-through rate。FM 在线性模型上引入了二阶交叉。

Factorization machines (FM) incorporate second-order interactions into a linear model with categorical data by defining a model of the form


where V∈Rn×d, w∈Rn, and b∈R are the parameters with d≪n, and upper selects the strictly upper triangular part of the matrix rendle2010 .

FMs are notably distinct from support vector machines (SVMs) with polynomial kernels Corinna1995 because they factorize the second-order interaction matrix into its latent factors (or embedding vectors) as in matrix factorization, which more effectively handles sparse data. This significantly reduces the complexity of the second-order interactions by only capturing interactions between pairs of distinct embedding vectors, yielding linear computational complexity.
FM 明显不同于多项式核函数 SVM 的是,它将分解二阶交叉矩阵到 latent factors ( or embedding vectors ) 就像矩阵分解似的,更高效的处理了稀疏数据。通过仅用不同 embedding vectors 交叉显著减少了二阶交叉的复杂度。转为了线性计算复杂度。

Multilayer Perceptrons

Simultaneously, much recent success in machine learning has been due to the rise of deep learning. The most fundamental model of these is the multilayer perceptron (MLP), a prediction function composed of an interleaving sequence of fully connected (FC) layers and an activation function σ:R→R applied componentwise as shown below
当然,当前许多在机器学习上的成功是源于深度学习的兴起。这里边最基础的模型就是 MLP 了,一个由 FC layers 和激活函数组成的预测函数。


where weight matrix Wl∈Rnl×nl−1, bias bl∈Rnl for layer l=1,...,k.

These methods have been used to capture more complex interactions. It has been shown, for example, that given enough parameters, MLPs with sufficient depth and width can fit data to arbitrary precision bishop1995 . Variations of these methods have been widely used in various applications including computer vision and natural language processing. One specific case, Neural Collaborative Filtering (NCF) he2017neural ; sedhain2015autorec used as part of the MLPerf benchmark mlperf , uses an MLP rather than dot product to compute interactions between embeddings in matrix factorization.
这些方法被用来捕获更复杂的 interactions。比如在有足够的参数下,MLPs 有足够的深度和广度来适应数据到任意精度。各种方法的变种被广泛应用到了各种场景下,包括 cv 和 nlp。有一个特别的 case,NCF 被用来做 MLPerf benchmark 的一部分,它使用 MLP 来计算矩阵分解中 embeddings 直接的 interaction,而非 dot product。


So far, we have described different models used in recommendation systems and predictive analytics. Let us now combine their intuitions to build a state-of-the-art personalization model.
上边描述了推荐系统和预测分析使用的不同模型,现在我们将其组合起来,构建一个 state-of-the-art 的个性化模型:

Let the users and products be described by many continuous and categorical features. To process the categorical features, each categorical feature will be represented by an embedding vector of the same dimension, generalizing the concept of latent factors used in matrix factorization (3). To handle the continuous features, the continuous features will be transformed by an MLP (which we call the bottom or dense MLP) which will yield a dense representation of the same length as the embedding vectors (5).

让users 和 products 通过许多连续和分类的功能来描述。为了处理分类特征,每个分类特征将由相同维数的嵌入向量表示,从而归纳了矩阵分解中使用的潜在因子的概念(3)。为了处理连续特征,将通过MLP(我们称为底部或密集MLP)对连续特征进行变换,该MLP将产生与embedding vector(5)相同长度的稠密向量

We will compute second-order interaction of different features explicitly, following the intuition for handling sparse data provided in FMs (4), optionally passing them through MLPs. This is done by taking the dot product between all pairs of embedding vectors and processed dense features. These dot products are concatenated with the original processed dense features and post-processed with another MLP (the top or output MLP) (5), and fed into a sigmoid function to give a probability.

我们按照 FM 的方式处理稀疏特征,显示地计算不同特征的二阶交叉,可选的将其传给 MLPs。将这些 dot products 与原始的 dense features 拼接起来,传给另一个 MLP ( the top or output MLP ),最后馈入Sigmoid以给出概率。

We refer to the resulting model as DLRM, shown in Fig. 1. We show some of the operators used in DLRM in PyTorch paszke2017pytorch and Caffe2 caffe2 frameworks in Table 1.
我们将结果模型称为DLRM,如图1所示。我们在表1中显示了PyTorch paszke2017pytorch 和Caffe2 caffe2 框架中的DLRM中使用的一些运算符。

Embedding MLP Interactions Loss
PyTorch nn.EmbeddingBag nn.Linear/addmm matmul/bmm nn.CrossEntropyLoss
Caffe2 SparseLengthSum FC BatchMatMul CrossEntropy

Table 1: DLRM operators by framework


Many deep learning-based recommendation models cheng2016wide ; guo2017deepfm ; wang2017deep ; lian2018xdeepfm ; zhou2018deepi ; zhou2018deep use similar underlying ideas to generate higher-order terms to handle sparse features. Wide and Deep, Deep and Cross, DeepFM, and xDeepFM networks, for example, design specialized networks to systematically construct higher-order interactions. These networks then sum the results from both their specialized model and an MLP, passing this through a linear layer and sigmoid activation to yield a final probability. DLRM specifically interacts embeddings in a structured way that mimics factorization machines to significantly reduce the dimensionality of the model by only considering cross-terms produced by the dot-product between pairs of embeddings in the final MLP. We argue that higher-order interactions beyond second-order found in other networks may not necessarily be worth the additional computational/memory cost.
许多基于深度学习的推荐模型cheng2016wide ; guo2017deepfm ; wang2017deep ; lian2018xdeepfm ; zhou2018deepi ; 周2018深 使用类似的基本思想来生成处理稀疏特征的高阶项。例如,宽和深,深和跨,DeepFM和xDeepFM网络设计了专门的网络来系统地构建高阶交互。然后,这些网络将其专用模型和MLP的结果相加,并通过线性层和S形激活,以得出最终概率。DLRM通过模仿结构化机器的结构化方式专门地与嵌入交互,从而仅考虑最终MLP中嵌入对之间的点积所产生的交叉项,从而显着降低了模型的维数。我们认为,在其他网络中发现的超出第二阶的更高阶交互可能不一定值得额外的计算/内存成本。

A key difference between DLRM and other networks is in how these networks treat embedded feature vectors and their cross-terms. In particular, DLRM (and xDeepFM lian2018xdeepfm ) interpret each feature vector as a single unit representing a single category, whereas networks like Deep and Cross treat each element in the feature vector as a new unit that should yield different cross-terms. Hence, Deep and Cross networks will produce cross-terms not only between elements from different feature vectors as in DLRM via the dot product, but also produce cross-terms between elements within the same feature vector, resulting in higher dimensionality.
DLRM与其他网络之间的主要区别在于这些网络如何处理嵌入式特征向量及其交叉项。特别是,DLRM(和xDeepFM lian2018xdeepfm )将每个特征向量解释为代表一个类别的单个单元,而像Deep和Cross这样的网络则将特征向量中的每个元素视为应产生不同交叉项的新单元。因此,深层和交叉网络不仅会像通过点积在DLRM中那样在不同特征向量的元素之间产生交叉项,而且还会在同一特征向量内的元素之间产生交叉项,从而导致更高的维数。


Modern personalization and recommendation systems require large and complex models to capitalize on vast amounts of data. DLRMs particularly contain a very large number of parameters, up to multiple orders of magnitude more than other common deep learning models like convolutional neural networks (CNN), transformer and recurrent networks (RNN), and generative networks (GAN). This results in training times up to several weeks or more. Hence, it is important to parallelize these models efficiently in order to solve these problems at practical scales.

As described in the previous section, DLRMs process both categorical features (with embeddings) and continuous features (with the bottom MLP) in a coupled manner. Embeddings contribute the majority of the parameters, with several tables each requiring in excess of multiple GBs of memory, making DLRM memory-capacity and bandwidth intensive. The size of the embeddings makes it prohibitive to use data parallelism since it requires replicating large embeddings on every device. In many cases, this memory constraint necessitates the distribution of the model across multiple devices to be able satisfy memory capacity requirements.

On the other hand, the MLP parameters are smaller in memory but translate into sizeable amounts of compute. Hence, data-parallelism is preferred for MLPs since this enables concurrent processing of the samples on different devices and only requires communication when accumulating updates. Our parallelized DLRM will use a combination of model parallelism for the embeddings and data parallelism for the MLPs to mitigate the memory bottleneck produced by the embeddings while parallelizing the forward and backward propagations over the MLPs. Combined model and data parallelism is a unique requirement of DLRM as a result of its architecture and large model sizes. Such combined parallelism is not supported in either Caffe2 or PyTorch (as well as other popular deep learning frameworks), therefore we design a custom implementation. We plan to provide its detailed performance study in forthcoming work.


Figure 2: Butterfly shuffle for the all-to-all (personalized) communicationIn our setup, the top MLP and the interaction operator require access to part of the mini-batch from the bottom MLP and all of the embeddings. Since model parallelism has been used to distribute the embeddings across devices, this requires a personalized all-to-all communication commsref . At the end of the embedding lookup, each device has a vector for the embedding tables resident on those devices for all the samples in the mini-batch, which needs to be split along the mini-batch dimension and communicated to the appropriate devices, as shown in Fig. 2. Neither PyTorch nor Caffe2 provide native support for model parallelism; therefore, we have implemented it by explicitly mapping the embedding operators (nn.EmbeddingBag for PyTorch, SparseLengthSum for Caffe2) to different devices. Then personalized all-to-all communication is implemented using the butterfly shuffle operator, which appropriately slices the resulting embedding vectors and transfers them to the target devices. In the current version, these transfers are explicit copies, but we intend to further optimize this using the available communication primitives (such as all-gather and send-recv).
在我们的设置中,顶部MLP和交互操作符要求从底部MLP和所有嵌入访问部分迷你批处理。由于已使用模型并行性在设备之间分布嵌入,因此这需要个性化的全方位通信commsref 。在嵌入查找的最后,每个设备都有一个用于微型批处理中所有样本的驻留在这些设备上的嵌入表的向量,该向量需要沿着微型批处理维进行拆分,并传递给适当的设备,如下所示:如图2所示。PyTorch和Caffe2都不为模型并行性提供本地支持。因此,我们已经明确地映射嵌入运营商(中实现它nn.EmbeddingBag为PyTorch,Caffe2的SparseLengthSum)到不同的设备。然后,使用Butterfly shuffle运算符实现个性化的全部通信,该运算符将生成的嵌入矢量适当地切片,并将其传输到目标设备。在当前版本中,这些传输是显式副本,但是我们打算使用可用的通信原语(例如all-gather和send-recv)进一步优化此传输。

We note that for the data parallel MLPs, the parameter updates in the backward pass are accumulated with an allreduce22Optimized implementations for the allreduce op. include Nvidia’s NCCL nccl and Facebook’s gloo gloo . and applied to the replicated parameters on each device commsref in a synchronous fashion, ensuring the updated parameters on each device are consistent before every iteration. In PyTorch, data parallelism is enabled through the nn.DistributedDataParallel and nn.DataParallel modules that replicate the model on each device and insert allreduce with the necessary dependencies. In Caffe2, we manually insert allreduce before the gradient update.
我们注意到,对于数据并行MLP,向后传递中的参数更新以allreduce 2的形式累积。(2个优化了allreduce op的实现。包括Nvidia的NCCL nccl 和Facebook的gloo gloo 。)并以同步方式应用于每个设备commsref 上的复制参数,以确保每个设备上的更新参数在每次迭代之前都是一致的。在PyTorch中,通过nn.DistributedDataParallel和nn.DataParallel模块启用数据并行性,这些模块在每个设备上复制模型并插入具有必要依赖项的allreduce。在Caffe2中,我们在渐变更新之前手动插入allreduce。

4DATA 数据

In order to measure the accuracy of the model, test its overall performance, and characterize the individual operators, we need to create or obtain a data set for our implementation. Our current implementation of the model supplies three types of data sets: random, synthetic and public data sets.

The former two data sets are useful in experimenting with the model from the systems perspective. In particular, it permits us to exercise different hardware properties and bottlenecks by generating data on the fly while removing dependencies on data storage systems. The latter allows us to perform experiments on real data and measure the accuracy of the model.


Recall that DLRM accepts continuous and categorical features as inputs. The former can be modeled by generating a vector of random numbers using either a uniform or normal (Gaussian) distributions with the numpy.random package rand or randn calls with default parameters. Then a mini-batch of inputs can be obtained by generating a matrix where each row corresponds to an element in the mini-batch.

To generate categorical features, we need to determine how many non-zero elements we would like have in a given multi-hot vector. The benchmark allows this number to be either fixed or random within a range33see options --num-indices-per-lookup=k and --num-indices-per-lookup-fixed [1,k]. Then, we generate the corresponding number of integer indices, within a range [1,m], where m is the number of rows in the embedding W in (2). Finally, in order to create a mini-batch of lookups, we concatenate the above indices and delineate each individual lookup with lengths (SparseLengthsSum) or offsets (nn.EmbeddingBag)44For instance, in order to represent three embedding lookups, with indices {0,2}, {0,1,5} and {3} we use


Note that this format resembles Compressed-Sparse Row (CSR) often used for sparse matrices in linear algebra..


There are many reasons to support custom generation of indices corresponding to categorical features. For instance, if our application uses a particular data set, but we would not like to share it for privacy purposes, then we may choose to express the categorical features through distributions. This could potentially serve as an alternative to the privacy preserving techniques used in applications such as federated learning bonawitz2019federated ; gentry2009 . Also, if we would like to exercise system components, such as studying memory behavior, we may want to capture fundamental locality of accesses of original trace within synthetic trace.
有很多理由支持自定义生成与分类特征相对应的索引。例如,如果我们的应用程序使用特定的数据集,但出于隐私目的我们不希望共享它,那么我们可以选择通过分布表示分类特征。这可能可以替代诸如联邦学习bonawitz2019federated 等应用中使用的隐私保护技术; gentry2009。另外,如果我们想练习系统组件(例如研究内存行为),我们可能希望捕获合成跟踪中对原始跟踪的访问的基本局部性。

Let us now illustrate how we can use a synthetic data set. Assume that we have a trace of indices that correspond to embedding lookups for a single categorical feature (and repeat the process for all features). We can record the unique accesses and frequency of distances between repeated accesses in this trace (Alg. 1) and then generate a synthetic trace (Alg. 2) as proposed in hassan2007synthetic .
现在让我们说明如何使用综合数据集。假设我们有一个索引的索引,这些索引与单个分类特征的嵌入查找相对应(并对所有特征重复该过程)。我们可以在此迹线(图1)中记录唯一访问和重复访问之间的距离频率,然后生成一个综合迹线(图2),如hassan2007synthetic中 所建议。

1:  Let tr be input sequence, s stack of distances, u list of unique accesses and p probability distribution

2:  Let s.position_from_the_top return d=0 if the index is not found, and d>0 otherwise.

3:  for i=0; i<length(tr); i++ do

4:     a = tr[i]

5:     d = s.position_from_the_top(a)

6:     if  d == 0 then

7:        u.append(a)

8:     else

9:        s.remove_from_the_top_at_position(d)

10:     end if

11:     p[d] += 1.0/length(tr)

12:     s.push_to_the_top(a)

13:  end for

Algorithm 1 Profile (Original) Trace1: Let u be input list of unique accesses and p probability distribution of distances, while tr output trace.

2:  for s=0, i=0; i<length; i++ do

3:     d = p.sample_from_distribution_with_support(0,s)

4:     if d == 0 then

5:        a = u.remove_from_front()

6:        s++

7:     else

8:        a = u.remove_from_the_back_at_position(d)

9:     end if

10:     u.append(a)

11:     tr[i] = a

12:  end for

Algorithm 2 Generate (Synthetic) Trace

Note that we can only generate a stack distance up to s number of unique accesses we have seen so far, therefore s is used to control the support of the distribution p in Alg. 2. Given a fixed number of unique accesses, the longer input trace will result in lower probability being assigned to them in Alg. 1, which will lead to longer time to achieve full distribution support in Alg. 2. In order to address this problem, we increase the probability for the unique accesses up to a minimum threshold and adjust support to remove unique accesses from it once all have been seen. A visual comparison of probability distribution p based on original and synthetic traces is shown in Fig. 3. In our experiments original and adjusted synthetic traces produce similar cache hit/miss rates.

(a) original (b) synthetic trace (c) adjusted synthetic trace

Figure 3: Probability distribution p based on a sample trace tr = random.uniform(1,100,100K)

Alg. 1 and 2 were designed for more accurate cache simulations, but they illustrate a general idea of how probability distributions can be used to generate synthetic traces with desired properties.


Few public data sets are available for recommendation and personalization systems. The Criteo AI Labs Ad Kaggle55https://www.kaggle.com/c/criteo-display-ad-challenge and Terabyte66https://labs.criteo.com/2013/12/download-terabyte-click-logs/ data sets are open-sourced data sets consisting of click logs for ad CTR prediction. Each data set contains 13 continuous and 26 categorical features. Typically the continuous features are pre-processed with a simple log transform log(1+x). The categorical feature are mapped to its corresponding embedding index, with unlabeled categorical features or labels mapped to 0 or NULL.

The Criteo Ad Kaggle data set contains approximately 45 million samples over 7 days. In experiments, typically the 7th day is split into a validation and test set while the first 6 days are used as the training set. The Criteo Ad Terabyte data set is sampled over 24 days, where the 24th day is split into a validation and test set and the first 23 days is used as a training set. Note that there are an approximately equal number of samples from each day.
很少有公共数据集可用于推荐和个性化系统。Criteo AI Labs Ad Kaggle 55https://www.kaggle.com/c/criteo-display-ad-challenge和TB 66https://labs.criteo.com/2013/12/download-terabyte-click-logs/数据集是开源数据集,由用于广告点击率预测的点击日志组成。每个数据集包含13个连续和26个分类特征。通常,使用简单的对数转换对连续特征进行预处理 log(1+x)。类别特征映射到其相应的嵌入索引,未标记的类别特征或标签映射到0或NULL。

Criteo Ad Kaggle数据集在7天内包含大约4,500万个样本。在实验中,通常将第7天分为验证和测试集,而前6天用作训练集。Criteo Ad Terabyte数据集在24天中进行了采样,其中第24天被分为验证和测试集,而前23天被用作训练集。请注意,每天有大约相等数量的样本。


Figure 4: Big Basin AI platformLet us now illustrate the performance and accuracy of DLRM. The model is implemented in PyTorch and Caffe2 frameworks and is available on GitHub77https://github.com/facebookresearch/dlrm. It uses fp32 floating point and int32(Caffe2)/int64(PyTorch) types for model parameters and indices, respectively. The experiments are performed on the Big Basin platform with Dual Socket Intel Xeon 6138 CPU @ 2.00GHz and eight Nvidia Tesla V100 16GB GPUs, publicly available through the Open Compute Project88https://www.opencompute.org, shown in Fig. 4.
现在让我们说明DLRM的性能和准确性。该模型在PyTorch和Caffe2框架中实现,并在GitHub 7上可用7https://github.com/facebookresearch/dlrm。它分别将fp32浮点和int32(Caffe2)/ int64(PyTorch)类型用于模型参数和索引。实验是在Big Basin平台上进行的,该平台配备了双插槽Intel Xeon 6138 CPU @ 2.00GHz和八个Nvidia Tesla V100 16GB GPU,可通过Open Compute Project 8公开获得。8https://www.opencompute.org,如图4所示


We evaluate the accuracy of the model on Criteo Ad Kaggle data set and compare the performance of DLRM against a Deep and Cross network (DCN) as-is without extensive tuning wang2017deep . We compare with DCN because it is one of the few models that has comprehensive results on the same data set. Notice that in this case the models are sized to accommodate the number of features present in the data set. In particular, DLRM consists of both a bottom MLP for processing dense features consisting of three hidden layers with 512, 256 and 64 nodes, respectively, and a top MLP consisting of two hidden layers with 512 and 256 nodes. On the other hand DCN consists of six cross layers and a deep network with 512 and 256 nodes. An embedding dimension of 16 is used. Note that this yields a DLRM and DCN both with approximately 540M parameters.

我们评估了Criteo Ad Kaggle数据集上模型的准确性,并在没有进行大量调整wang2017deep的 情况下按现状比较了DLRM与深度和跨网络(DCN)的性能。我们将DCN与之比较,因为它是在同一数据集上具有综合结果的少数模型之一。请注意,在这种情况下,模型的大小可容纳数据集中存在的要素数量。尤其是,DLRM包括两个底部MLP,用于处理由三个隐藏层组成的密集特征,512, 256 和 64 节点,以及由两个隐藏层组成的顶部MLP, 512 和 256节点。另一方面,DCN由六个交叉层和一个深层网络组成,其中512 和 256节点。的嵌入维度16用来。请注意,这会产生DLRM和DCN,两者的近似值540M 参数。

我们画出了训练集和验证集在单 epoch 的准确率,每个模型都使用了 SGD 和 Adagrad 优化器,没有使用正则化。实验中 DLRM 在训练集和验证集的准确率都略高一些,如图5。值得注意的是,这里边没有进行额外的调参。

SGD\thesubsubfigure SGD Adagrad\thesubsubfigure Adagrad

Figure 5: Comparison of training (solid) and validation (dashed) accuracies of DLRM and DCNWe plot both the training (solid) and validation (dashed) accuracies over a full single epoch of training for both models with SGD and Adagrad optimizers duchi2011 . No regularization is used. In this experiment, DLRM obtains slightly higher training and validation accuracy, as shown in Fig. 5. We emphasize that this is without extensive tuning of model hyperparameters.


To profile the performance of our model on a single socket device, we consider a sample model with 8 categorical features and 512 continuous features. Each categorical feature is processed through an embedding table with 1M vectors, with vector dimension 64, while the continuous features are assembled into a vector of dimension 512. Let the bottom MLP have two layers, while the top MLP has four layers. We profile this model on a data set with 2048K randomly generated samples organized into 1K mini-batches99For instance, this configuration can be achieved with the following command line arguments
--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --arch-sparse-feature-size=64 --arch-mlp-bot=512-512-64 --arch-mlp-top=1024-1024-1024-1 --data-generation=random --mini-batch-size=2048 --num-batches=1000 --num-indices-per-lookup=100 [--use-gpu] [--enable-profiling].
这里使用8个离散特征+512个连续特征的模型去测我们模型在单 socket device 上的性能。每个离散特征被处理为有 1M vectors 的 embedding table,vector dimension 是64,这些连续特征等价一个有512维的 vector。让 bottom MLP 有两层,top MLP 有四层。我们在一个 2048K 的随机生成样本数据集即 1K 个 mini-batches 上评估模型。

Caffe2\thesubsubfigure Caffe2 PyTorch\thesubsubfigure PyTorch

Figure 6: Profiling of a sample DLRM on a single socket/device

This model implementation in Caffe2 runs in around 256 seconds on the CPU and 62 seconds on the GPU, with profiling of individual operators shown in Fig. 6. As expected, the majority of time is spent performing embedding lookups and fully connected layers. On the CPU, fully connected layers take a significant portion of the computation, while on the GPU they are almost negligible.


In this paper, we have proposed and open-sourced a novel deep learning-based recommendation model that exploits categorical data. Although recommendation and personalization systems still drive much practical success of deep learning within industry today, these networks continue to receive little attention in the academic community. By providing a detailed description of a state-of-the-art recommendation system and its open-source implementation, we hope to draw attention to the unique challenges that this class of networks present in an accessible way for the purpose of further algorithmic experimentation, modeling, system co-design, and benchmarking.


The authors would like to acknowledge AI Systems Co-Design, Caffe2, PyTorch and AML team members for their help in reviewing this document.