[论文翻译]Facebook:个性化推荐系统的深度学习模型


原文地址:https://arxiv.org/pdf/1906.00091.pdf

代码地址:https://github.com/facebookresearch/dlrm

Deep Learning Recommendation Model for Personalization and Recommendation Systems

author

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang,
Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta†, Carole-Jean Wu,
Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu,
Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira,
Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong and Misha Smelyanskiy
Facebook, 1 Hacker Way, Menlo Park, CA 94065

Northwestern University, †Harvard University, work done while at Facebook.

ABSTRACT

With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design.

摘要

随着深度学习的到来,基于神经网络的推荐模型已经成为解决个性化和推荐任务的重​​要工具。这些网络由于需要处理类别特征而与其他深度学习网络明显不同,因此并未得到充分的研究或理解。在本文中,我们开发了最新的深度学习推荐模型(DLRM),并提供了在PyTorch和Caffe2框架中的实现。此外,我们设计了一种特殊的并行化方案,该方案利用嵌入表上的模型并行性来减轻内存限制,同时利用数据并行性来从完全连接的层进行横向扩展计算。我们将DLRM与现有推荐模型进行了比较,并描述了其在Big Basin AI平台上的性能,

1INTRODUCTION

Personalization and recommendation systems are currently deployed for a variety of tasks at large internet companies, including ad click-through rate (CTR) prediction and rankings. Although these methods have had long histories, these approaches have only recently embraced neural networks. Two primary perspectives contributed towards the architectural design of deep learning models for personalization and recommendation.
当前,个性化和推荐系统已在大型互联网公司中用于各种任务,包括广告点击率(CTR)预测和排名。尽管这些方法已有很长的历史,但这些方法直到最近才包含神经网络。有两个主要观点对用于个性化和推荐的深度学习模型的体系结构设计做出了贡献。

The first comes from the view of recommendation systems. These systems initially employed content filtering where a set of experts classified products into categories, while users selected their preferred categories and were matched based on their preferences mgp . The field subsequently evolved to use collaborative filtering, where recommendations are based on past user behaviors, such as prior ratings given to products. Neighborhood methods ning2015 that provide recommendations by grouping users and products together and latent factor methods that characterize users and products by certain implicit factors via matrix factorization techniques frolov2017tensor ; koren2009matrix were later deployed with success.

首先是推荐系统。这些系统最初采用内容过滤,其中一组专家将产品划分为类别,而用户选择了他们喜欢的类别并根据他们的偏好mgp 进行了匹配。该领域随后发展为使用协作过滤,其中推荐基于过去的用户行为,例如对产品给予的先前评级。ning2015 邻域方法(通过将用户和产品分组来提供建议)和潜在因子方法(通过矩阵分解技术frolov2017tensor 通过某些隐性因子对用户和产品进行表征) ; 后来成功部署了koren2009matrix

The second view comes from predictive analytics, which relies on statistical models to classify or predict the probability of events based on the given data Devroye1996 . Predictive models shifted from using simple models such as linear and logistic regression walker1967 to models that incorporate deep networks. In order to process categorical data, these models adopted the use of embeddings, which transform the one- and multi-hot vectors into dense representations in an abstract space naumov2019 . This abstract space may be interpreted as the space of the latent factors found by recommendation systems.
第二种观点是预测分析,它依靠统计模型基于给定的数据Devroye1996 来分类或预测事件的概率。预测模型从使用简单模型(例如线性和逻辑回归Walker1967)转变 为包含深度网络的模型。为了处理分类数据,这些模型采用了嵌入技术,将单热和多热矢量转换为抽象空间naumov2019 中的密集表示。该抽象空间可以被解释为推荐系统发现的潜在因素的空间。

In this paper, we introduce a personalization model that was conceived by the union of the two perspectives described above. The model uses embeddings to process sparse features that represent categorical data and a multilayer perceptron (MLP) to process dense features, then interacts these features explicitly using the statistical techniques proposed in rendle2010 . Finally, it finds the event probability by post-processing the interactions with another MLP. We refer to this model as a deep learning recommendation model (DLRM); see Fig. 1. A PyTorch and Caffe2 implementation of this model will be released for testing and experimentation with the publication of this manuscript.

在本文中,我们介绍了一种个性化模型,该模型是上述两种观点的结合所构想的。该模型使用嵌入来处理代表分类数据的稀疏特征,并使用多层感知器(MLP)来处理密集特征,然后使用rendle2010中 提出的统计技术明确地交互这些特征。最后,它通过对与另一个MLP的交互进行后处理来找到事件概率。我们将此模型称为深度学习推荐模型(DLRM);参见图1。该模型的PyTorch和Caffe2实现将随该手稿的发布而发布,以进行测试和试验。

2MODEL DESIGN AND ARCHITECTURE 模型设计与架构

In this section, we will describe the design of DLRM. We will begin with the high level components of the network and explain how and why they have been assembled together in a particular way, with implications for future model design, then characterize the low level operators and primitives that make up the model, with implications for future hardware and system design.
在本节中,我们将描述DLRM的设计。我们将从网络的高级组件开始,并说明如何以及为什么将它们以特定方式组装在一起,并对将来的模型设计产生影响,然后描述构成模型的低级运算符和原语,并为它们提供影响。未来的硬件和系统设计。

2.1COMPONENTS OF DLRM


Figure 1: A deep learning recommendation model

The high-level components of the DLRM can be more easily understood by reviewing early models. We will avoid the full scientific literature review and focus instead on the four techniques used in early models that can be interpreted as salient high-level components of the DLRM.
通过回顾早期模型,可以更轻松地理解DLRM的高级组件。我们将避免进行全面的科学文献综述,而将注意力集中在早期模型中使用的四种可被解释为DLRM的重要高级组件的技术上。

Embeddings

In order to handle categorical data, embeddings map each category to a dense representation in an abstract space. In particular, each embedding lookup may be interpreted as using a one-hot vector ei (with the i-th position being 1 while others are 0, where index i corresponds to i-th category) to obtain the corresponding row vector of the embedding table $W\in\mathbb{R}^{m\times d}$ as follows
为了处理分类数据,嵌入将每个类别映射到抽象空间中的密集表示。每个嵌入查找都可以解释为使用one-hot 向量(ei 就相当于是 embedding lookup ( 对位 i 是1,其他为0 ),embedding table image.png

image.png

In more complex scenarios, an embedding can also represent a weighted combination of multiple items, with a multi-hot vector of weightsimage.png
, with elements ai≠0 for i=i1,...,ik and 0 everywhere else, where i1,...,ik index the corresponding items. Note that a mini-batch of t embedding lookups can hence be written as

embedding 也可以被理解为是多个 items 的带权组合,mulit-hot vector,index ik 对应 items。一个大小为 t 的 mini-batch embedding lookups 可以写为:
image.png

where sparse matrix A=[a1,...,at] naumov2019 .

DLRMs will utilize embedding tables for mapping categorical features to dense representations. However, even after these embeddings are meaningfully devised, how are they to be exploited to produce accurate predictions? To answer this, we return to latent factor methods.
DLRMs 利用 embedding tables 将类别特征映射到稠密表示。但即使 embeddings 是有意义的,但我们要如何利用它们进行准确的预测呢?latent factor methods。

Matrix Factorization 矩阵分解

Recall that in the typical formulation of the recommendation problem, we are given a set S of users that have rated some products. We would like to represent the i-th product by a vector wi∈Rd for i=1,...,n and j-th user by a vector vj∈Rd for j=1,...,m to find all the ratings, where n and m denote the total number of products and users, respectively. More rigorously, the set S consists of tuples (i,j) indexing when the i-th product has been rated by the j-th user.
回顾推荐问题的典型场景,有一组用户S已经给一些产品进行了评价。我们将产品表示为 vector wi 将用户表示为 vector vj ,共 n 个产品,m 个用户。( *i,j ) S*

The matrix factorization approach solves this problem by minimizing

image.png

where rij∈R is the rating of the i-th product by the j-th user for i=1,...,m and j=1,...,n. Then, letting WT=[w1,...,wm] and VT=[v1,...,vn], we may approximate the full matrix of ratings R=[rij] as the matrix product R≈WVT. Note that W and V may be interpreted as two embedding tables, where each row represents a user/product in a latent factor space11This problem is different from low-rank approximation, which can be solved by SVD golub1996 , because not all entries of matrix R are known. koren2009matrix . The dot product of these embedding vectors yields a meaningful prediction of the subsequent rating, a key observation to the design of factorization machines and DLRM.

R 其实就是 Reward matrix,W,V 是两个 embedding tables,每一行分别表示着 laten factor space 里的 user/product。

vectors 的点积代表对 rating 的预测。

Factorization Machine

In classification problems, we want to define a prediction function ϕ:Rn→T from an input datapoint x∈Rn to a target label y∈T. As an example, we can predict the click-through rate by defining T={+1,−1} with +1 denoting the presence of a click and −1 as the absence of a click.
在分类问题中,我们会定一个预测函数:输入 x 预测 label y。我们定义 T={+1, -1},+1代表有点击,-1代表每点击,去预测 click-through rate。FM 在线性模型上引入了二阶交叉。

Factorization machines (FM) incorporate second-order interactions into a linear model with categorical data by defining a model of the form

image.png

where V∈Rn×d, w∈Rn, and b∈R are the parameters with d≪n, and upper selects the strictly upper triangular part of the matrix rendle2010 .

FMs are notably distinct from support vector machines (SVMs) with polynomial kernels Corinna1995 because they factorize the second-order interaction matrix into its latent factors (or embedding vectors) as in matrix factorization, which more effectively handles sparse data. This significantly reduces the complexity of the second-order interactions by only capturing interactions between pairs of distinct embedding vectors, yielding linear computational complexity.
FM 明显不同于多项式核函数 SVM 的是,它将分解二阶交叉矩阵到 latent factors ( or embedding vectors ) 就像矩阵分解似的,更高效的处理了稀疏数据。通过仅用不同 embedding vectors 交叉显著减少了二阶交叉的复杂度。转为了线性计算复杂度。

Multilayer Perceptrons

Simultaneously, much recent success in machine learning has been due to the rise of deep learning. The most fundamental model of these is the multilayer perceptron (MLP), a prediction function composed of an interleaving sequence of fully connected (FC) layers and an activation function σ:R→R applied componentwise as shown below
当然,当前许多在机器学习上的成功是源于深度学习的兴起。这里边最基础的模型就是 MLP 了,一个由 FC layers 和激活函数组成的预测函数。

image.png

where weight matrix Wl∈Rnl×nl−1, bias bl∈Rnl for layer l=1,...,k.

These methods have been used to capture more complex interactions. It has been shown, for example, that given enough parameters, MLPs with sufficient depth and width can fit data to arbitrary precision bishop1995 . Variations of these methods have been widely used in various applications including computer vision and natural language processing. One specific case, Neural Collaborative Filtering (NCF) he2017neural ; sedhain2015autorec used as part of the MLPerf benchmark mlperf , uses an MLP rather than dot product to compute interactions between embeddings in matrix factorization.
这些方法被用来捕获更复杂的 interactions。比如在有足够的参数下,MLPs 有足够的深度和广度来适应数据到任意精度。各种方法的变种被广泛应用到了各种场景下,包括 cv 和 nlp。有一个特别的 case,NCF 被用来做 MLPerf benchmark 的一部分,它使用 MLP 来计算矩阵分解中 embeddings 直接的 interaction,而非 dot product。

2.2DLRM ARCHITECTURE

So far, we have described different models used in recommendation systems and predictive analytics. Let us now combine their intuitions to build a state-of-the-art personalization model.
上边描述了推荐系统和预测分析使用的不同模型,现在我们将其组合起来,构建一个 state-of-the-art 的个性化模型:

Let the users and products be described by many continuous and categorical features. To process the categorical features, each categorical feature will be represented by an embedding vector of the same dimension, generalizing the concept of latent factors used in matrix factorization (3). To handle the continuous features, the continuous features will be transformed by an MLP (which we call the bottom or dense MLP) which will yield a dense representation of the same length as the embedding vectors (5).

让users 和 products 通过许多连续和分类的功能来描述。为了处理分类特征,每个分类特征将由相同维数的嵌入向量表示,从而归纳了矩阵分解中使用的潜在因子的概念(3)。为了处理连续特征,将通过MLP(我们称为底部或密集MLP)对连续特征进行变换,该MLP将产生与embedding vector(5)相同长度的稠密向量

We will compute second-order interaction of different features explicitly, following the intuition for handling sparse data provided in FMs (4), optionally passing them through MLPs. This is done by taking the dot product between all pairs of embedding vectors and processed dense features. These dot products are concatenated with the original processed dense features and post-processed with another MLP (the top or output MLP) (5), and fed into a sigmoid function to give a probability.

我们按照 FM 的方式处理稀疏特征,显示地计算不同特征的二阶交叉,可选的将其传给 MLPs。将这些 dot products 与原始的 dense features 拼接起来,传给另一个 MLP ( the top or output MLP ),最后馈入Sigmoid以给出概率。

We refer to the resulting model as DLRM, shown in Fig. 1. We show some of the operators used in DLRM in PyTorch paszke2017pytorch and Caffe2 caffe2 frameworks in Table 1.
我们将结果模型称为DLRM,如图1所示。我们在表1中显示了PyTorch paszke2017pytorch 和Caffe2 caffe2 框架中的DLRM中使用的一些运算符。

Embedding MLP Interactions Loss
PyTorch nn.EmbeddingBag nn.Linear/addmm matmul/bmm nn.CrossEntropyLoss
Caffe2 SparseLengthSum FC BatchMatMul CrossEntropy

Table 1: DLRM operators by framework

2.3COMPARISON WITH PRIOR MODELS

Many deep learning-based recommendation models cheng2016wide ; guo2017deepfm ; wang2017deep ; lian2018xdeepfm ; zhou2018deepi ; zhou2018deep use similar underlying ideas to generate higher-order terms to handle sparse features. Wide and Deep, Deep and Cross, DeepFM, and xDeepFM networks, for example, design specialized networks to systematically construct higher-order interactions. These networks then sum the results from both their specialized model and an MLP, passing this through a linear layer and sigmoid activation to yield a final probability. DLRM specifically interacts embeddings in a structured way that mimics factorization machines to significantly reduce the dimensionality of the model by only considering cross-terms produced by the dot-product between pairs of embeddings in the final MLP. We argue that higher-order interactions beyond second-order found in other networks may not necessarily be worth the additional computational/memory cost.
许多基于深度学习的推荐模型cheng2016wide ; guo2017deepfm ; wang2017deep ; lian2018xdeepfm ; zhou2018deepi ; 周2018深 使用类似的基本思想来生成处理稀疏特征的高阶项。例如,宽和深,深和跨,DeepFM和xDeepFM网络设计了专门的网络来系统地构建高阶交互。然后,这些网络将其专用模型和MLP的结果相加,并通过线性层和S形激活,以得出最终概率。DLRM通过模仿结构化机器的结构化方式专门地与嵌入交互,从而仅考虑最终MLP中嵌入对之间的点积所产生的交叉项,从而显着降低了模型的维数。我们认为,在其他网络中发现的超出第二阶的更高阶交互可能不一定值得额外的计算/内存成本。

A key difference between DLRM and other networks is in how these networks treat embedded feature vectors and their cross-terms. In particular, DLRM (and xDeepFM lian2018xdeepfm ) interpret each feature vector as a single unit representing a single category, whereas networks like Deep and Cross treat each element in the feature vector as a new unit that should yield different cross-terms. Hence, Deep and Cross networks will produce cross-terms not only between elements from different feature vectors as in DLRM via the dot product, but also produce cross-terms between elements within the same feature vector, resulting in higher dimensionality.
DLRM与其他网络之间的主要区别在于这些网络如何处理嵌入式特征向量及其交叉项。特别是,DLRM(和xDeepFM lian2018xdeepfm )将每个特征向量解释为代表一个类别的单个单元,而像Deep和Cross这样的网络则将特征向量中的每个元素视为应产生不同交叉项的新单元。因此,深层和交叉网络不仅会像通过点积在DLRM中那样在不同特征向量的元素之间产生交叉项,而且还会在同一特征向量内的元素之间产生交叉项,从而导致更高的维数。

3PARALLELISM 平行性

Modern personalization and recommendation systems require large and complex models to capitalize on vast amounts of data. DLRMs particularly contain a very large number of parameters, up to multiple orders of magnitude more than other common deep learning models like convolutional neural networks (CNN), transformer and recurrent networks (RNN), and generative networks (GAN). This results in training times up to several weeks or more. Hence, it is important to parallelize these models efficiently in order to solve these problems at practical scales.
现代的个性化和推荐系统需要庞大而复杂的模型来利用大量数据。DLRM特别包含大量参数,比其他常见的深度学习模型(例如卷积神经网络(CNN),变压器和递归网络(RNN)和生成网络(GAN))高多达多个数量级。这导致训练时间长达数周或更长时间。因此,重要的是有效地并行化这些模型,以便在实际规模上解决这些问题。

As described in the previous section, DLRMs process both categorical features (with embeddings) and continuous features (with the bottom MLP) in a coupled manner. Embeddings contribute the majority of the parameters, with several tables each requiring in excess of multiple GBs of memory, making DLRM memory-capacity and bandwidth intensive. The size of the embeddings makes it prohibitive to use data parallelism since it requires replicating large embeddings on every device. In many cases, this memory constraint necessitates the distribution of the model across multiple devices to be able satisfy memory capacity requirements.
如上一节所述,DLRM以耦合方式处理分类特征(具有嵌入)和连续特征(具有底部MLP)。嵌入是大多数参数的原因,其中几个表每个表都需要超过GB的内存,从而使DLRM的内存容量和带宽密集。嵌入的大小使其无法使用数据并行性,因为它需要在每个设备上复制大型嵌入。在许多情况下,此内存约束要求必须在多个设备之间分配模型才能满足内存容量要求。

On the other hand, the MLP parameters are smaller in memory but translate into sizeable amounts of compute. Hence, data-parallelism is preferred for MLPs since this enables concurrent processing of the samples on different devices and only requires communication when accumulating updates. Our parallelized DLRM will use a combination of model parallelism for the embeddings and data parallelism for the MLPs to mitigate the memory bottleneck produced by the embeddings while parallelizing the forward and backward propagations over the MLPs. Combined model and data parallelism is a unique requirement of DLRM as a result of its architecture and large model sizes. Such combined parallelism is not supported in either Caffe2 or PyTorch (as well as other popular deep learning frameworks), therefore we design a custom implementation. We plan to provide its detailed performance study in forthcoming work.

另一方面,MLP参数在内存中较小,但是可以转换为可观的计算量。因此,对于MLP而言,数据并行性是优选的,因为这可以在不同设备上并发处理样本,并且仅在累积更新时才需要通信。我们的并行化DLRM将针对嵌入的模型并行性与针对MLP的数据并行性结合使用,以缓解嵌入产生的内存瓶颈,同时并行化MLP上的前向和后向传播。由于DLRM的体系结构和较大的模型规模,因此将模型和数据并行性相结合是DLRM的独特要求。Caffe2或PyTorch(以及其他流行的深度学习框架)均不支持这种组合的并行性,因此我们设计了一个自定义实现。

Figure 2: Butterfly shuffle for the all-to-all (personalized) communicationIn our setup, the top MLP and the interaction operator require access to part of the mini-batch from the bottom MLP and all of the embeddings. Since model parallelism has been used to distribute the embeddings across devices, this requires a personalized all-to-all communication commsref . At the end of the embedding lookup, each device has a vector for the embedding tables resident on those devices for all the samples in the mini-batch, which needs to be split along the mini-batch dimension and communicated to the appropriate devices, as shown in Fig. 2. Neither PyTorch nor Caffe2 provide native support for model parallelism; therefore, we have implemented it by explicitly mapping the embedding operators (nn.EmbeddingBag for PyTorch, SparseLengthSum for Caffe2) to different devices. Then personalized all-to-all communication is implemented using the butterfly shuffle operator, which appropriately slices the resulting embedding vectors and transfers them to the target devices. In the current version, these transfers are explicit copies, but we intend to further optimize this using the available communication primitives (such as all-gather and send-recv).
在我们的设置中,顶部MLP和交互操作符要求从底部MLP和所有嵌入访问部分迷你批处理。由于已使用模型并行性在设备之间分布嵌入,因此这需要个性化的全方位通信commsref 。在嵌入查找的最后,每个设备都有一个用于微型批处理中所有样本的驻留在这些设备上的嵌入表的向量,该向量需要沿着微型批处理维进行拆分,并传递给适当的设备,如下所示:如图2所示。PyTorch和Caffe2都不为模型并行性提供本地支持。因此,我们已经明确地映射嵌入运营商(中实现它nn.EmbeddingBag为PyTorch,Caffe2的SparseLengthSum)到不同的设备。然后,使用Butterfly shuffle运算符实现个性化的全部通信,该运算符将生成的嵌入矢量适当地切片,并将其传输到目标设备。在当前版本中,这些传输是显式副本,但是我们打算使用可用的通信原语(例如all-gather和send-recv)进一步优化此传输。

We note that for the data parallel MLPs, the parameter updates in the backward pass are accumulated with an allreduce22Optimized implementations for the allreduce op. include Nvidia’s NCCL nccl and Facebook’s gloo gloo . and applied to the replicated parameters on each device commsref in a synchronous fashion, ensuring the updated parameters on each device are consistent before every iteration. In PyTorch, data parallelism is enabled through the nn.DistributedDataParallel and nn.DataParallel modules that replicate the model on each device and insert allreduce with the necessary dependencies. In Caffe2, we manually insert allreduce before the gradient update.
我们注意到,对于数据并行MLP,向后传递中的参数更新以allreduce 2的形式累积。(2个优化了allreduce op的实现。包括Nvidia的NCCL nccl 和Facebook的gloo gloo 。)并以同步方式应用于每个设备commsref 上的复制参数,以确保每个设备上的更新参数在每次迭代之前都是一致的。在PyTorch中,通过nn.DistributedDataParallel和nn.DataParallel模块启用数据并行性,这些模块在每个设备上复制模型并插入具有必要依赖项的allreduce。在Caffe2中,我们在渐变更新之前手动插入allreduce。

4DATA 数据

In order to measure the accuracy of the model, test its overall performance, and characterize the individual operators, we need to create or obtain a data set for our implementation. Our current implementation of the model supplies three types of data sets: random, synthetic and public data sets.
为了衡量模型的准确性,测试模型的整体性能并表征各个操作员,我们需要为实现创建或获取数据集。我们当前对模型的实现提供了三种类型的数据集:随机,综合和公共数据集。

The former two data sets are useful in experimenting with the model from the systems perspective. In particular, it permits us to exercise different hardware properties and bottlenecks by generating data on the fly while removing dependencies on data storage systems. The latter allows us to perform experiments on real data and measure the accuracy of the model.
从系统的角度来看,前两个数据集可用于试验模型。特别是,它允许我们通过动态生成数据同时消除对数据存储系统的依赖来行使不同的硬件属性和瓶颈。后者使我们能够对真实数据进行实验并测量模型的准确性。

4.1RANDOM

Recall that DLRM accepts continuous and categorical features as inputs. The former can be modeled by generating a vector of random numbers using either a uniform or normal (Gaussian) distributions with the numpy.random package rand or randn calls with default parameters. Then a mini-batch of inputs can be obtained by generating a matrix where each row corresponds to an element in the mini-batch.

To generate categorical features, we need to determine how many non-zero elements we would like have in a given multi-hot vector. The benchmark allows this number to be either fixed or random within a range33see options --num-indices-per-lookup=k and --num-indices-per-lookup-fixed [1,k]. Then, we generate the corresponding number of integer indices, within a range [1,m], where m is the number of rows in the embedding W in (2). Finally, in order to create a mini-batch of lookups, we concatenate the above indices and delineate each individual lookup with lengths (SparseLengthsSum) or offsets (nn.EmbeddingBag)44For instance, in order to represent three embedding lookups, with indices {0,2}, {0,1,5} and {3} we use

lengths/offsets={2,3,1}/{0,2,5}indices={0,2,0,1,5,3}

Note that this format resembles Compressed-Sparse Row (CSR) often used for sparse matrices in linear algebra..

4.2SYNTHETIC

There are many reasons to support custom generation of indices corresponding to categorical features. For instance, if our application uses a particular data set, but we would not like to share it for privacy purposes, then we may choose to express the categorical features through distributions. This could potentially serve as an alternative to the privacy preserving techniques used in applications such as federated learning bonawitz2019federated ; gentry2009 . Also, if we would like to exercise system components, such as studying memory behavior, we may want to capture fundamental locality of accesses of original trace within synthetic trace.
有很多理由支持自定义生成与分类特征相对应的索引。例如,如果我们的应用程序使用特定的数据集,但出于隐私目的我们不希望共享它,那么我们可以选择通过分布表示分类特征。这可能可以替代诸如联邦学习bonawitz2019federated 等应用中使用的隐私保护技术; gentry2009。另外,如果我们想练习系统组件(例如研究内存行为),我们可能希望捕获合成跟踪中对原始跟踪的访问的基本局部性。

Let us now illustrate how we can use a synthetic data set. Assume that we have a trace of indices that correspond to embedding lookups for a single categorical feature (and repeat the process for all features). We can record the unique accesses and frequency of distances between repeated accesses in this trace (Alg. 1) and then generate a synthetic trace (Alg. 2) as proposed in hassan2007synthetic .
现在让我们说明如何使用综合数据集。假设我们有一个索引的索引,这些索引与单个分类特征的嵌入查找相对应(并对所有特征重复该过程)。我们可以在此迹线(图1)中记录唯一访问和重复访问之间的距离频率,然后生成一个综合迹线(图2),如hassan2007synthetic中 所建议。

1:  Let tr be input sequence, s stack of distances, u list of unique accesses and p probability distribution

2:  Let s.position_from_the_top return d=0 if the index is not found, and d>0 otherwise.

3:  for i=0; i<length(tr); i++ do

4:     a = tr[i]

5:     d = s.position_from_the_top(a)

6:     if  d == 0 then

7:        u.append(a)

8:     else

9:        s.remove_from_the_top_at_position(d)

10:     end if

11:     p[d] += 1.0/length(tr)

12:     s.push_to_the_top(a)

13:  end for

Algorithm 1 Profile (Original) Trace1: Let u be input list of unique accesses and p probability distribution of distances, while tr output trace.

2:  for s=0, i=0; i<length; i++ do

3:     d = p.sample_from_distribution_with_support(0,s)

4:     if d == 0 then

5:        a = u.remove_from_front()

6:        s++

7:     else

8:        a = u.remove_from_the_back_at_position(d)

9:     end if

10:     u.append(a)

11:     tr[i] = a

12:  end for

Algorithm 2 Generate (Synthetic) Trace

Note that we can only generate a stack distance up to s number of unique accesses we have seen so far, therefore s is used to control the support of the distribution p in Alg. 2. Given a fixed number of unique accesses, the longer input trace will result in lower probability being assigned to them in Alg. 1, which will lead to longer time to achieve full distribution support in Alg. 2. In order to address this problem, we increase the probability for the unique accesses up to a minimum threshold and adjust support to remove unique accesses from it once all have been seen. A visual comparison of probability distribution p based on original and synthetic traces is shown in Fig. 3. In our experiments original and adjusted synthetic traces produce similar cache hit/miss rates.
请注意,我们最多只能生成一个堆栈距离,该堆栈距离目前为止已看到的唯一访问次数为s,因此s用于控制Alg中分布p的支持。2。给定固定数量的唯一访问,较长的输入跟踪将导致在Alg中分配给它们的概率较低。1,这将导致在Alg中获得完整发行支持的时间更长。2。为了解决此问题,我们将唯一访问的可能性提高到最小阈值,并调整支持以从中删除唯一访问。概率分布p的直观比较基于原始迹线和合成迹线的结果如图3所示。在我们的实验中,原始和调整后的合成迹线会产生相似的缓存命中/未命中率。

(a) original (b) synthetic trace (c) adjusted synthetic trace

Figure 3: Probability distribution p based on a sample trace tr = random.uniform(1,100,100K)

Alg. 1 and 2 were designed for more accurate cache simulations, but they illustrate a general idea of how probability distributions can be used to generate synthetic traces with desired properties.
图12是为更精确的缓存模拟而设计的,但是它们说明了如何使用概率分布来生成具有所需属性的合成迹线的一般思路。

4.3PUBLIC

Few public data sets are available for recommendation and personalization systems. The Criteo AI Labs Ad Kaggle55https://www.kaggle.com/c/criteo-display-ad-challenge and Terabyte66https://labs.criteo.com/2013/12/download-terabyte-click-logs/ data sets are open-sourced data sets consisting of click logs for ad CTR prediction. Each data set contains 13 continuous and 26 categorical features. Typically the continuous features are pre-processed with a simple log transform log(1+x). The categorical feature are mapped to its corresponding embedding index, with unlabeled categorical features or labels mapped to 0 or NULL.

The Criteo Ad Kaggle data set contains approximately 45 million samples over 7 days. In experiments, typically the 7th day is split into a validation and test set while the first 6 days are used as the training set. The Criteo Ad Terabyte data set is sampled over 24 days, where the 24th day is split into a validation and test set and the first 23 days is used as a training set. Note that there are an approximately equal number of samples from each day.
很少有公共数据集可用于推荐和个性化系统。Criteo AI Labs Ad Kaggle 55https://www.kaggle.com/c/criteo-display-ad-challenge和TB 66https://labs.criteo.com/2013/12/download-terabyte-click-logs/数据集是开源数据集,由用于广告点击率预测的点击日志组成。每个数据集包含13个连续和26个分类特征。通常,使用简单的对数转换对连续特征进行预处理 log(1+x)。类别特征映射到其相应的嵌入索引,未标记的类别特征或标签映射到0或NULL。

Criteo Ad Kaggle数据集在7天内包含大约4,500万个样本。在实验中,通常将第7天分为验证和测试集,而前6天用作训练集。Criteo Ad Terabyte数据集在24天中进行了采样,其中第24天被分为验证和测试集,而前23天被用作训练集。请注意,每天有大约相等数量的样本。

5EXPERIMENTS

Figure 4: Big Basin AI platformLet us now illustrate the performance and accuracy of DLRM. The model is implemented in PyTorch and Caffe2 frameworks and is available on GitHub77https://github.com/facebookresearch/dlrm. It uses fp32 floating point and int32(Caffe2)/int64(PyTorch) types for model parameters and indices, respectively. The experiments are performed on the Big Basin platform with Dual Socket Intel Xeon 6138 CPU @ 2.00GHz and eight Nvidia Tesla V100 16GB GPUs, publicly available through the Open Compute Project88https://www.opencompute.org, shown in Fig. 4.
现在让我们说明DLRM的性能和准确性。该模型在PyTorch和Caffe2框架中实现,并在GitHub 7上可用7https://github.com/facebookresearch/dlrm。它分别将fp32浮点和int32(Caffe2)/ int64(PyTorch)类型用于模型参数和索引。实验是在Big Basin平台上进行的,该平台配备了双插槽Intel Xeon 6138 CPU @ 2.00GHz和八个Nvidia Tesla V100 16GB GPU,可通过Open Compute Project 8公开获得。8https://www.opencompute.org,如图4所示

5.1MODEL ACCURACY ON PUBLIC DATA SETS 公共数据集的模型准确性

We evaluate the accuracy of the model on Criteo Ad Kaggle data set and compare the performance of DLRM against a Deep and Cross network (DCN) as-is without extensive tuning wang2017deep . We compare with DCN because it is one of the few models that has comprehensive results on the same data set. Notice that in this case the models are sized to accommodate the number of features present in the data set. In particular, DLRM consists of both a bottom MLP for processing dense features consisting of three hidden layers with 512, 256 and 64 nodes, respectively, and a top MLP consisting of two hidden layers with 512 and 256 nodes. On the other hand DCN consists of six cross layers and a deep network with 512 and 256 nodes. An embedding dimension of 16 is used. Note that this yields a DLRM and DCN both with approximately 540M parameters.

我们评估了Criteo Ad Kaggle数据集上模型的准确性,并在没有进行大量调整wang2017deep的 情况下按现状比较了DLRM与深度和跨网络(DCN)的性能。我们将DCN与之比较,因为它是在同一数据集上具有综合结果的少数模型之一。请注意,在这种情况下,模型的大小可容纳数据集中存在的要素数量。尤其是,DLRM包括两个底部MLP,用于处理由三个隐藏层组成的密集特征,512, 256 和 64 节点,以及由两个隐藏层组成的顶部MLP, 512 和 256节点。另一方面,DCN由六个交叉层和一个深层网络组成,其中512 和 256节点。的嵌入维度16用来。请注意,这会产生DLRM和DCN,两者的近似值540M 参数。

我们画出了训练集和验证集在单 epoch 的准确率,每个模型都使用了 SGD 和 Adagrad 优化器,没有使用正则化。实验中 DLRM 在训练集和验证集的准确率都略高一些,如图5。值得注意的是,这里边没有进行额外的调参。

SGD\thesubsubfigure SGD Adagrad\thesubsubfigure Adagrad

Figure 5: Comparison of training (solid) and validation (dashed) accuracies of DLRM and DCNWe plot both the training (solid) and validation (dashed) accuracies over a full single epoch of training for both models with SGD and Adagrad optimizers duchi2011 . No regularization is used. In this experiment, DLRM obtains slightly higher training and validation accuracy, as shown in Fig. 5. We emphasize that this is without extensive tuning of model hyperparameters.

5.2MODEL PERFORMANCE ON A SINGLE SOCKET/DEVICE

To profile the performance of our model on a single socket device, we consider a sample model with 8 categorical features and 512 continuous features. Each categorical feature is processed through an embedding table with 1M vectors, with vector dimension 64, while the continuous features are assembled into a vector of dimension 512. Let the bottom MLP have two layers, while the top MLP has four layers. We profile this model on a data set with 2048K randomly generated samples organized into 1K mini-batches99For instance, this configuration can be achieved with the following command line arguments
--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --arch-sparse-feature-size=64 --arch-mlp-bot=512-512-64 --arch-mlp-top=1024-1024-1024-1 --data-generation=random --mini-batch-size=2048 --num-batches=1000 --num-indices-per-lookup=100 [--use-gpu] [--enable-profiling].
这里使用8个离散特征+512个连续特征的模型去测我们模型在单 socket device 上的性能。每个离散特征被处理为有 1M vectors 的 embedding table,vector dimension 是64,这些连续特征等价一个有512维的 vector。让 bottom MLP 有两层,top MLP 有四层。我们在一个 2048K 的随机生成样本数据集即 1K 个 mini-batches 上评估模型。

Caffe2\thesubsubfigure Caffe2 PyTorch\thesubsubfigure PyTorch

Figure 6: Profiling of a sample DLRM on a single socket/device

This model implementation in Caffe2 runs in around 256 seconds on the CPU and 62 seconds on the GPU, with profiling of individual operators shown in Fig. 6. As expected, the majority of time is spent performing embedding lookups and fully connected layers. On the CPU, fully connected layers take a significant portion of the computation, while on the GPU they are almost negligible.
Caffe2中的这种模型实现在CPU上运行大约256秒,在GPU上运行大约62秒,其中各个运算符的性能分析如图6所示。如预期的那样,大部分时间都花在执行嵌入查找和完全连接的层上。在CPU上,完全连接的层占据了很大一部分计算量,而在GPU上,它们几乎可以忽略不计。

6CONCLUSION

In this paper, we have proposed and open-sourced a novel deep learning-based recommendation model that exploits categorical data. Although recommendation and personalization systems still drive much practical success of deep learning within industry today, these networks continue to receive little attention in the academic community. By providing a detailed description of a state-of-the-art recommendation system and its open-source implementation, we hope to draw attention to the unique challenges that this class of networks present in an accessible way for the purpose of further algorithmic experimentation, modeling, system co-design, and benchmarking.
在本文中,我们提出并开源了一种利用分类数据的新型基于深度学习的推荐模型。尽管推荐和个性化系统仍然在当今行业内推动深度学习取得许多实际成功,但是这些网络在学术界仍然很少受到关注。通过提供最先进的推荐系统及其开源实现的详细说明,我们希望引起人们对此类网络以可访问的方式提出的独特挑战的关注,以进行进一步的算法实验,建模,系统协同设计和基准测试。

Acknowledgments

The authors would like to acknowledge AI Systems Co-Design, Caffe2, PyTorch and AML team members for their help in reviewing this document.