Multi-Task Feature Learning for Knowledge Graph Enhanced Recommendation



Collaborative filtering often suffers from sparsity and cold start problems in real recommendation scenarios, therefore, researchers and engineers usually use side information to address the issues and improve the performance of recommender systems. In this paper, we consider knowledge graphs as the source of side information.

We propose MKR, a Multi-task feature learning approach for Knowledge graph enhanced Recommendation.

MKR is a deep end-to-end framework that utilizes knowledge graph embedding task to assist recommendation task. The two tasks are associated by cross & compress units, which automatically share latent features and learn high-order interactions between items in recommender systems and entities in the knowledge graph. We prove that cross & compress units have sufficient capability of polynomial approximation, and show that MKR is a generalized framework over several representative methods of recommender systems and multi-task learning.

Through extensive experiments on real-world datasets, we demonstrate that MKR achieves substantial gains in movie, book, music, and news recommendation, over state-of-the-art baselines.

MKR is also shown to be able to maintain a decent performance even if user-item interactions are sparse.


我们提出了MKR,一个Multi-Task 的学习方式进行知识图谱增强推荐。MKR是一个深层的端到端框架,它利用知识图嵌入任务来辅助推荐任务。这两个任务通过交叉关联和压缩单元,自动共享潜在特征并学习推荐器系统中的项目与知识图中的实体之间的高级交互。我们证明交叉关联和压缩单元具有足够的多项式逼近能力,并且表明MKR是在推荐系统和多任务学习的几种代表性方法上的通用框架。通过在现实世界数据集上进行的广泛实验,我们证明了MKR在最新的基线之上,在电影,书籍,音乐和新闻推荐方面均取得了可观的收益。即使用户与项目之间的交互很少,MKR也能保持良好的性能。


Recommender systems (RS) aims to address the information explosion and meet users personalized interests. One of the most popular recommendation techniques is collaborative filtering CF , which utilizes users' historical interactions and makes recommendations based on their common preferences. However, CF-based methods usually suffer from the sparsity of user-item interactions and the cold start problem. Therefore, researchers propose using in recommender systems, including social networks , attributes , and multimedia (e.g., texts , images ). KGs are one type of side information for RS, which usually contain fruitful facts and connections about items. Recently, researchers have proposed several academic and commercial KGs, such as NELL http://rtw.ml.cmu.edu/rtw/ , DBpedia http://wiki.dbpedia.org/ , Google Knowledge Graph https://developers.google.com/knowledge-graph/ and Microsoft Satori https://searchengineland.com/library/bing/bing-satori . Due to its high dimensionality and heterogeneity, a KG is usually pre-processed by KGE methods , which embeds entities and relations into low-dimensional vector spaces while preserving its inherent structure. Inspired by the success of applying KG in a wide variety of tasks, researchers have recently tried to utilize KG to improve the performance of recommender systems . Personalized Entity Recommendation (PER) and Factorization Machine with Group lasso (FMG) treat KG as a heterogeneous information network, and extract meta-path/meta-graph based latent features to represent the connectivity between users and items along different types of relation paths/graphs.


推荐系统(RS)旨在解决信息爆炸问题并满足用户的个性化兴趣。最受欢迎的推荐技术之一是协作过滤(CF)(Koren et al。,2009),它利用用户的历史互动并根据他们的共同偏好进行推荐。但是,基于CF的方法通常会遇到用户与项目交互的稀疏性以及冷启动问题。因此,研究人员建议在推荐系统中使用辅助信息,包括社交网络(Jamali和Ester,2010),属性(Wang等,2018b)和多媒体(例如文本,Wang等,2015)。),图片(Zhang et al。,2016))。 知识图谱(KGs)是RS的一种附带信息,通常包含丰硕的事实和与项目有关的联系。最近,研究人员提出了一些学术和商业性的KG,例如NELL 11个http://rtw.ml.cmu.edu/rtw/,DBpedia 22个http://wiki.dbpedia.org/,Google Knowledge Graph 33https://developers.google.com/knowledge-graph/和Microsoft Satori 44https://searchengineland.com/library/bing/bing-satori。由于KG的高维性和异质性,通常通过知识图嵌入(KGE)方法对其进行预处理(Wang等,2018a),该方法将实体和关系嵌入到低维向量空间中,同时保留其固有结构。

Existing KG-aware methods

Inspired by the success of applying KG in a wide variety of tasks, researchers have recently tried to utilize KG to improve the performance of recommender systems (Yu et al., 2014; Zhao et al., 2017; Wang et al., 2018d; Wang et al., 2018c; Zhang et al., 2016). Personalized Entity Recommendation (PER) (Yu et al., 2014) and Factorization Machine with Group lasso (FMG) (Zhao et al., 2017) treat KG as a heterogeneous information network, and extract meta-path/meta-graph based latent features to represent the connectivity between users and items along different types of relation paths/graphs. It should be noted that PER and FMG rely heavily on manually designed meta-paths/meta-graphs, which limits its application in generic recommendation scenarios. Deep Knowledge-aware Network (DKN) (Wang et al., 2018d) designs a CNN framework to combine entity embeddings with word embeddings for news recommendation. However, the entity embeddings are required in advance of using DKN, causing DKN to lack an end-to-end way of training. Another concern about DKN is that it can hardly incorporate side information other than texts. RippleNet (Wang et al., 2018c) is a memory-network-like model that propagates users’ potential preferences in the KG and explores their hierarchical interests. But the importance of relations is weakly characterized in RippleNet, because the embedding matrix of a relation R can hardly be trained to capture the sense of importance in the quadratic form$ { v}^\top{ R}{ h} $ (v and h are embedding vectors of two entities).

Collaborative Knowledge base Embedding (CKE) (Zhang et al., 2016) combines CF with structural knowledge, textual knowledge, and visual knowledge in a unified framework. However, the KGE module in CKE (i.e., TransR (Lin et al., 2015)) is more suitable for in-graph applications (such as KG completion and link prediction) rather than recommendation. In addition, the CF module and the KGE module are loosely coupled in CKE under a Bayesian framework, making the supervision from KG less obvious for recommender systems.


受到在各种任务中成功应用KG的启发,研究人员最近尝试利用KG来提高推荐系统的性能(Yu等人,2014; Zhao等人,2017; Wang等人,2018d; Wang et al。,2018c ; Zhang et al。,2016)。个性化实体推荐(PER)(Yu等人,2014)和带组套索的分解机(FMG)(Zhao等人,2017)将KG视为异构信息网络,并提取基于元路径/元图的潜在特征来表示用户与沿着不同类型的关系路径/图的项目之间的连通性。应当注意,PER和FMG严重依赖于手动设计的元路径/元图,这限制了其在通用推荐方案中的应用。深度知识感知网络(DKN)(Wang等人,2018d)设计了一个CNN框架,将实体嵌入与单词嵌入相结合以进行新闻推荐。但是,在使用DKN之前需要实体嵌入,这导致DKN缺乏端到端的培训方式。关于DKN的另一个问题是,它几乎不能包含文字以外的辅助信息。RippleNet (Wang等人,2018c)是一种类似于内存网络的模型,可在KG中传播用户的潜在偏好并探索其层次兴趣。但是关系的重要性在RippleNet中微弱地体现出来,因为关系的嵌入矩阵[R 很难训练以二次形式捕捉重要性感$ { v}^\top{ R}{ h} $ (v 和 H是两个实体的嵌入向量)。协同知识库嵌入(CKE)(Zhang et al。,2016)将CF与结构知识,文本知识和视觉知识结合在一个统一的框架中。然而,CKE中的KGE模块(即TransR (Lin等人,2015))更适合于图内应用(例如KG完成和链接预测)而不是推荐。另外,在贝叶斯框架下,CF模块和KGE模块在CKE中是松散耦合的,因此对于推荐系统,来自KG的监督不太明显。
The proposed approach

To address the limitations of previous work, we propose MKR, a multi-task learning (MTL) approach for knowledge graph enhanced recommendation. MKR is a generic, end-to-end deep recommendation framework, which aims to utilize KGE task to assist recommendation task(KGE task can also benefit from recommendation task empirically as shown in the experiments section.). Note that the two tasks are not mutually independent, but are highly correlated since an item in RS may associate with one or more entities in KG. Therefore, an item and its corresponding entity are likely to have a similar proximity structure in RS and KG, and share similar features in low-level and non-task-specific latent feature spaces (Long et al., 2017). We will further validate the similarity in the experiments section. To model the shared features between items and entities, we design a cross&compress unit in MKR. The cross&compress unit explicitly models high-order interactions between item and entity features, and automatically control the cross knowledge transfer for both tasks. Through cross&compress units, representations of items and entities can complement each other, assisting both tasks in avoiding fitting noises and improving generalization. The whole framework can be trained by alternately optimizing the two tasks with different frequencies, which endows MKR with high flexibility and adaptability in real recommendation scenarios.



We probe the expressive capability of MKR and show, through theoretical analysis, that the cross&compress unit is capable of approximating sufficiently high order feature interactions between items and entities. We also show that MKR is a generalized framework over several representative methods of recommender systems and multi-task learning, including factorization machines (Rendle, 2010, 2012), deep&cross network (Wang et al., 2017a), and cross-stitch network (Misra et al., 2016). Empirically, we evaluate our method in four recommendation scenarios, i.e., movie, book, music, and news recommendations. The results demonstrate that MKR achieves substantial gains over state-of-the-art baselines in both click-through rate (CTR) prediction (e.g., 11.6% AUC improvements on average for movies) and top-K recommendation (e.g., 66.4% Recall@10 improvements on average for books). MKR can also maintain a decent performance in sparse scenarios.

我们探讨了MKR的表达能力,并通过理论分析证明了和压缩单元能够近似项与实体之间的高阶特征交互。我们还表明,MKR是在推荐系统和多任务学习,包括因式分解机的几个代表性的方法提供广义的框架(Rendle,20102012),深和跨网(Wang等人,2017a)和deep&cross network(Misra等人,2016)。根据经验,我们在四个推荐方案中评估我们的方法,即电影,书籍,音乐和新闻推荐。结果表明,在点击率(CTR)预测(例如,电影有11.6% AUC 的平均改善)和ķ 推荐(例如, 66.4% Recall@10 书籍的平均改善程度)。在稀疏情况下,MKR还可保持良好的性能。


It is worth noticing that the problem studied in this paper can also be modelled as cross-domain recommendation (Tang et al., 2012) or transfer learning (Pan et al., 2010), since we care more about the performance of recommendation task. However, the key observation is that though cross-domain recommendation and transfer learning have single objective for the target domain, their loss functions still contain constraint terms for measuring data distribution in the source domain or similarity between two domains. In our proposed MKR, the KGE task serves as the constraint term explicitly to provide regularization for recommender systems. We would like to emphasize that the major contribution of this paper is exactly modeling the problem as multi-task learning: We go a step further than cross-domain recommendation and transfer learning by finding that the inter-task similarity is helpful to not only recommender systems but also knowledge graph embedding, as shown in theoretical analysis and experiment results.


值得注意的是,本文中研究的问题也可以建模为跨域推荐 (Tang等,2012)或转移学习 (Pan等,2010),因为我们更加关注推荐任务的性能。 。但是,关键的观察结果是,尽管跨域推荐和转移学习对目标域具有单一目标,但它们的损失函数仍包含用于测量源域中数据分布或两个域之间相似度的约束项。在我们提出的MKR中,KGE任务明确用作约束项为推荐系统提供正则化。我们想强调的是,本文的主要贡献是将问题准确地建模为多任务学习:我们发现跨任务的相似性不仅有助于推荐者,而且比跨域推荐和转移学习更进一步。系统以及知识图的嵌入,如理论分析和实验结果所示。

Our Approach

In this section, we first formulate the knowledge graph enhanced recommendation problem, then introduce the framework of MKR and present the design of the cross & compress unit, recommendation module and KGE module in detail. We lastly discuss the learning algorithm for MKR.

Figure 1. (a) The framework of MKR. The left and right part illustrate the recommendation module and the KGE module, respectively, which are bridged by the cross&compress units. (b) Illustration of a cross&compress unit. The cross&compress unit generates a cross feature matrix from item and entity vectors by cross operation, and outputs their vectors for the next layer by compress operation.



在本节中,首先制定知识图形增强的推荐问题,然后介绍了MKR的框架,并详细介绍了Cross & Compress单元,推荐模块和KGE模块的设计。我们最后讨论了MKR的学习算法。

2.1 Problem Formulation 问题表述

We formulate the knowledge graph enhanced recommendation problem in this paper as follows. In a typical recommendation scenario, we have a set of $ M $ users $ \mathcal U = {u_1, u_2, ..., u_M} $ and a set of $ N $ items $ \mathcal V = {v_1, v_2, ..., v_N} $ . The user-item interaction matrix $ { Y}\in\mathbb R^{M \times N} $ is defined according to users' implicit feedback, where $ y_{uv} = 1 $ indicates that user $ u $ engaged with item $ v $ , such as behaviors of clicking, watching, browsing, or purchasing; otherwise $ y_{uv} = 0 $ . Additionally, we also have access to a knowledge graph $ \mathcal G $ , which is comprised of entity-relation-entity triples $ (h, r, t) $ . Here $ h $ , $ r $ , and $ t $ denote the head, relation, and tail of a knowledge triple, respectively. For example, the triple (, , ) states the fact that Quentin Tarantino directs the film Pulp Fiction. In many recommendation scenarios, an item $ v \in\mathcal V $ may associate with one or more entities in $ \mathcal G $ . For example, in movie recommendation, the item "Pulp Fiction" is linked with its namesake in a KG, while in news recommendation, news with the title "Trump pledges aid to Silicon Valley during tech meeting" is linked with entities "Donald Trump" and "Silicon Valley" in a KG. Given the user-item interaction matrix $ Y $ as well as the knowledge graph $ \mathcal G $ , we aim to predict whether user $ u $ has potential interest in item $ v $ with which he has had no interaction before. Our goal is to learn a prediction function $ {\hat y}{uv} = \mathcal F(u, v | \Theta, Y, \mathcal G) $ , where $ {\hat y}{uv} $ denotes the probability that user $ u $ will engage with item $ v $ , and $ \Theta $ is the model parameters of function $ \mathcal F $ .

我们在本文中制定了知识图表提高了推荐问题,如下所示。在典型的推荐方案中,我们有一组$ M $用户$\mathcal U = {u_1, u_2, ..., u_M} $和一组$ N $项$\mathcal V = {v_1, v_2, ..., v_N} $。根据用户的隐式反馈定义了用户项交互矩阵${ Y} R^{M \times N} $,其中$ y_{uv} = 1 $表示用户$ u $与项目$ v $接合,例如单击,观看,浏览的行为或购买;否则$ y_{uv} = 0 $。此外,我们还可以访问知识图形$\mathcal G $,其由实体关系实体三元符$(h, r, t) $组成。这里$ h $,$ r $和$ t $分别表示知识三倍的头部,关系和尾部。例如,三倍说明昆汀塔兰蒂诺引导薄膜纸浆小说。在许多推荐场景中,项目$ v \in\mathcal V $可以与$\mathcal G $中的一个或多个实体相关联。例如,在电影推荐中,项目“纸浆小说”与其名称中的名称联系在一起,而在新闻建议中,与标题“特朗普承诺在科技会议期间对硅谷的辅助”的新闻与实体“唐纳德特朗普”相关联。和“硅谷”在kg中。给定用户 - 项目交互矩阵$ Y $以及知识图形$\mathcal G $,我们的目标是预测用户$ u $是否具有诸如$ v $的潜在兴趣,并且他之前没有交互。我们的目标是学习预测函数${\hat y}{uv} = \mathcal F(u, v | \Theta, Y, \mathcal, Y, \mathcal G) $,其中${\hat y}{uv} $表示用户$ u $将与项目$ v $与$\Theta $与$\Theta $表示概率函数参数$\mathcal F $。

2.2 Framework 框架

The framework of MKR is illustrated in Figure framework.

MKR consists of three main components: recommendation module, KGE module, and cross & compress units.

(1) The recommendation module on the left takes a user and an item as input, and uses a multi-layer perceptron (MLP) and cross & compress units to extract short and dense features for the user and the item, respectively.

The extracted features are then fed into another MLP together to output the predicted probability.

(2) Similar to the left part, the KGE module in the right part also uses multiple layers to extract features from the head and relation of a knowledge triple, and outputs the representation of the predicted tail under the supervision of a score function $ f $ and the real tail.

(3) The recommendation module and the KGE module are bridged by specially designed cross & compress units.

The proposed unit can automatically learn high-order feature interactions of items in recommender systems and entities in the KG.

MKR的框架在图中展示。 MKR由三个主要组件组成:推荐模块,KGE模块和Cross & Compress单元。
(1)左侧的推荐模块带有用户和项目作为输入,并使用多层的Perceptron(MLP)和Cross & Compress单元,分别为用户和项目提取短致密功能。然后将提取的特征馈入另一MLP以输出预测的概率。
(2)与左侧部分类似,右侧部分中的KGE模块也使用多个层来从Read和知识三倍的关系中提取特征,并在得分函数$ f $的监督下输出预测尾部的表示和真正的尾巴。
(3)推荐模块和KGE模块通过专门设计的Cross & Compress单元桥接。建议的单位可以自动学习kg中推荐系统和实体中项目的高阶功能交互。

2.3 Cross & compress Unit 交叉和压缩单元

To model feature interactions between items and entities, we design a cross & compress unit in MKR framework. As shown in Figure cross_feature_sharing_unit, for item $ v $ and one of its associated entities $ e $ , we first construct $ d \times d $ pairwise interactions of their latent feature $ { v}_l \in\mathbb R^d $ and $ { e}_l \in\mathbb R^d $ from layer $ l $ :

{ C}_l = { v}_l { e}_l^\top = \begin{bmatrix} v_l^{(1)} e_l^{(1)} & \cdots & v_l^{(1)} e_l^{(d)} \cdots & & \cdots v_l^{(d)} e_l^{(1)} & \cdots & v_l^{(d)} e_l^{(d)} \end{bmatrix},

where $ { C}_l \in\mathbb R^{d \times d} $ is the cross feature matrix of layer $ l $ , and $ d $ is the dimension of hidden layers. This is called the operation, since each possible feature interaction $ v_l^{(i)} e_l^{(j)}, \forall(i, j) \in{1, ..., d{(i)}0{(i)}1 $ between item $ v $ and its associated entity $ e $ is modeled explicitly in the cross feature matrix. We then output the feature vectors of items and entities for the next layer by projecting the cross feature matrix into their latent representation spaces: where $ { w}_l^{\cdot \cdot}\in\mathbb R^d $ and $ { b}_l^\cdot\in\mathbb R^d $ are trainable weight and bias vectors. This is called the operation, since the weight vectors project the cross feature matrix from $ \mathbb R^{d \times d} $ space back to the feature spaces $ \mathbb R^d $ . Note that in Eq. (compress), the cross feature matrix is compressed along both horizontal and vertical directions (by operating on $ { C}_l $ and $ { C}_l^\top $ ) for the sake of symmetry, but we will provide more insights of the design in Section unified_view. For simplicity, the cross & compress unit is denoted as:

$$ [{ v}{l+1}, { e}{l+1}] = \mathcal C ({ v}_l, { e}_l), $$

and we use a suffix $ [{ v}] $ or $ [{ e}] $ to distinguish its two outputs in the following of this paper.

以模型功能交互在项目和实体之间,我们在MKR框架中设计了一个Cross & Compress单元。如图cross_feature_sharing_unit,为项目$ v $及其相关实体的一个$ e $,我们首先构造$ d \times d $成对他们的潜在部件的相互作用${ v}_l \in\mathbb R^d $和${ e}_l \in\mathbb R^d $从层$ l $:
{ C}_l = { v}_l { e}_l^\top = \begin{bmatrix} v_l^{(1)} e_l^{(1)} & \cdots & v_l^{(1)} e_l^{(d)} \cdots & & \cdots v_l^{(d)} e_l^{(1)} & \cdots & v_l^{(d)} e_l^{(d)} \end{bmatrix},

,其中${ C}_l \in\mathbb R^{d \times d} $是$ l $的横向特征矩阵,而$ d $是隐藏图层的尺寸。这被称为操作,因为每个可能的特征交互$ v_l^{(j)} e_l^{(j)}, {(j)}, \forall(i, j) \forall{(i)}1$在项目$ v $和其相关的实体$ e $之间明确地建模在交叉特征矩阵中。然后,我们通过将横向特征矩阵投影到其潜在表示空间中输出项目和实体的特征向量:其中${ w}_l^\mathbb R^d $和${ b}\in\cdot R^d $是可培养的重量和偏置矢量。这被称为操作,因为权重向量将交叉特征矩阵从$\mathbb R^{d \times d} $空间投影回特征空间$\mathbb R^d $。请注意,在EQ中(Compress),跨功能矩阵沿水平和垂直方向压缩(通过在${ C}_l $和${ C}_l^\top$上操作,以便对称性,但我们将在统一_VIVIE中提供更多的设计见解。为简单起见,Cross & Compress单元表示为:

$$ [{ v}{l+1}, { e}{l+1}] = \mathcal C ({ v}_l, { e}_l), $$

和我们使用后缀$[{ v}] $或$[{ e}] $以将其两个输出区分开来区分其两个纸张。

Through cross & compress units, MKR can adaptively adjust the weights of knowledge transfer and learn the relevance between the two tasks. It should be noted that cross & compress units should only exist in low-level layers of MKR, as shown in Figure framework. This is because:

(1) In deep architectures, features usually transform from general to specific along the network, and feature transferability drops significantly in higher layers with increasing task dissimilarity . Therefore, sharing high-level layers risks to possible negative transfer, especially for the heterogeneous tasks in MKR.

(2) In high-level layers of MKR, item features are mixed with user features, and entity features are mixed with relation features.

The mixed features are not suitable for sharing since they have no explicit association.

通过Cross & Compress单元,MKR可以自适应地调整知识转移的权重,并学习两个任务之间的相关性。应该注意的是,Cross & Compress单元应仅存在MKR的低级别层中,如图框架所示。这是因为:



2.4 Recommendation Module 推荐模块

The input of the recommendation module in MKR consists of two raw feature vectors $ u $ and $ v $ that describe user $ u $ and item $ v $ , respectively. $ u $ and $ v $ can be customized as one-hot ID , attributes , bag-of-words , or their combinations, based on the application scenario. Given user $ u $ 's raw feature vector $ u $ , we use an $ L $ -layer MLP to extract his latent condensed feature We use the exponent notation L in Eq. (\ref{eq:mlp}) and following equations in the rest of this paper for simplicity, but note that the parameters of L layers are actually different. :

$$ { u}_L = \mathcal M (\mathcal M (\cdots\mathcal M ({ u}))) = \mathcal M^L ({ u}), $$

where $ \mathcal M ({ x}) = \sigma({ W}{ x} + { b} M (0 $ is a fully-connected neural network layer Exploring a more elaborate design of layers in the recommendation module is an important direction of future work. with weight $ W $ , bias $ b $ , and nonlinear activation function $ \sigma(\cdot) $ .

For item $ v $ , we use $ L $ cross & compress units to extract its feature:

$$ v_L = \mathbb E_{e \sim \mathcal S(v)}\left[\mathcal C^L ({ v}, { e})[{ v}]\right], $$

where $ \mathcal S(v) $ is the set of associated entities of item $ v $ .

After having user $ u $ 's latent feature $ {u}_L $ and item $ v $ 's latent feature
$ {v}L $ , we combine the two pathways by a predicting function $ f{RS} $ , for example, inner product or an $ H $ -layer MLP.

The final predicted probability of user $ u $ engaging item $ v $ is:

$$ \hat y_{uv} = \sigma\big(f_{RS}({ u}_L, { v}_L) \big). $$

MKR中推荐模块的输入包括两个原始特征向量$ u $和$ v $,它们分别描述了用户$ u $和ITEM $ v $。 $ u $和$ v $可根据应用方案根据应用方案作为单热门ID,属性,文字袋或其组合定制。给定用户$ u $的原始特征矢量$ u $,我们使用$ L $ -layer mlp提取他的潜在浓缩功能,我们使用指数表示法在eq中 l 。 (\ ref {eq:mlp})和以下纸张其余部分的方程式为简单起见,但请注意 l dlayers的参数实际上是不同的。 :
$$ { u}_L = \mathcal M (\mathcal M (\cdots\mathcal M ({ u}))) = \mathcal M^L ({ u}), $$

其中$\mathcal M ({ x}) = \sigma({ W}{ x} + { b} M (0 $是一个完全连接的神经网络层,探索推荐模块中的层层更精细设计是未来工作的重要方向。有了重量$ W $,偏置$ b $和非线性激活功能$\sigma(\cdot) $。对于Item $ v $,我们使用$ L $ Cross & Compress单元提取其功能:

$$ v_L = \mathbb E_{e \sim \mathcal S(v)}\left[\mathcal C^L ({ v}, { e})[{ v}]\right], $$

,其中$\mathcal S(v) $是项目$ v $的关联实体集。用户$ u $潜伏功能${ u}_L $和项目$ v $潜在特征${ v}L $,我们通过预测函数$ f{RS} $组合两个路径,例如,内部产品或$ H $ -Layer MLP。用户$ u $接合项$ v $的最终预测概率是:

$$ \hat y_{uv} = \sigma\big(f_{RS}({ u}_L, { v}_L) \big). $$

2.5 Knowledge Graph Embedding Module 知识图嵌入模块

Knowledge graph embedding is to embed entities and relations into continuous vector spaces while preserving their structure. Recently, researchers have proposed a great many KGE methods, including translational distance models and semantic matching models . In MKR, we propose a deep semantic matching architecture for KGE module. Similar to the recommendation module, for a given knowledge triple $ (h, r, t) $ , we first utilize multiple cross & compress units and nonlinear layers to process the raw feature vectors of head $ h $ and relation $ r $ (including ID , types , textual description , etc.), respectively. Their latent features are then concatenated together, followed by a $ K $ -layer MLP for predicting tail $ t $ : where $ \mathcal S(h) $ is the set of associated items of entity $ h $ , and $ \hat t $ is the predicted vector of tail $ t $ . Finally, the score of the triple $ (h, r, t) $ is calculated using a score (similarity) function $ f_{KG} $ :
知识图形嵌入是将实体和关系嵌入到连续向量空间中,同时保留其结构。最近,研究人员提出了许多KGE方法,包括翻译距离模型和语义匹配模型。在MKR中,我们提出了一个用于KGE模块的深度语义匹配架构。类似于推荐模块,对于给定的知识三重$(h, r, t) $,我们首先利用多个Cross & Compress单元和非线性层来处理头$ h $和关系$ r $的原始特征向量(包括ID,类型,文本描述等)分别。然后将其潜在的特征连接在一起,然后是$ K $ -Layer MLP用于预测尾$ t $:其中​​$\mathcal S(h) $是ENTERITY $ h $的关联项的集合,而$\hat t $是预测的向量尾部$ t $。最后,使用分数(相似性)函数$ f_{KG} $:

$$ score(h, r, t) = f_{KG}({ t}, { \hat t}), $$
where $ t $ is the real feature vector of $ t $ . In this paper, we use the normalized inner product $ f_{KG}({ t}, { \hat t}) = \sigma({ t}{KG}0{KG}1{KG}2 $ as the choice of score function , but other forms of (dis)similarity metrics can also be applied here such as Kullback–Leibler divergence.

其中$ t $是$ t $的真实特征向量。在本文中,我们使用归一化内部产品$ f_{KG}({ t}) = \sigma{KG}0{KG}1{KG}0{KG}1{KG}0 $作为得分函数的选择,但其他形式的(DIS)相似度量也可以在此处诸如Kullback - Leibler 分歧。

2.6 Learning Algorithm 学习算法

The complete loss function of MKR is as follows: In Eq. (loss), the first term measures loss in the recommendation module, where $ u $ and $ v $ traverse the set of users and the items, respectively, and $ \mathcal J $ is the cross-entropy function. The second term calculates the loss in the KGE module, in which we aim to increase the score for all true triples while reducing the score for all false triples. The last item is the regularization term for preventing over-fitting, $ \lambda_1 $ and $ \lambda_2 $ are the balancing parameters. \lambda_1 can be seen as the ratio of two learning rates for the two tasks. Note that the loss function in Eq. (loss) traverses all possible user-item pairs and knowledge triples. To make computation more efficient, following , we use a negative sampling strategy during training.

1:Interaction matrix Y, knowledge graph G

2:Prediction function F(u,v|Θ,Y,G)

3:Initialize all parameters

4:for number of training iteration do

5: // recommendation task

6:   for t steps do

7:      Sample minibatch of positive and negative interactions from Y;

8:      Sample e∼S(v) for each item v in the minibatch;

9:      Update parameters of F by gradient descent on Eq. (1)-(6), (9);

10:   end for

11: // knowledge graph embedding task

12:   Sample minibatch of true and false triples from G;

13:   Sample v∼S(h) for each head h in the minibatch;

14:   Update parameters of F by gradient descent on Eq. (1)-(3), (7)-(9);

15:end for

The learning algorithm of MKR is presented in Algorithm 1, in which a training epoch consists of two stages: recommendation task (line 3-7) and KGE task (line 8-10). In each iteration, we repeat training on recommendation task for $ t $ times ( $ t $ is a hyper-parameter and normally $ t > 1 $ ) before training on KGE task once in each epoch, since we are more focused on improving recommendation performance. We will discuss the choice of $ t $ in the experiments section.

MKR的完全损耗功能如下:在EQ中。 (丢失),推荐模块中的第一项措施损失,其中$ u $和$ v $分别遍历一组用户和项目,而$\mathcal J $是跨熵函数。第二个术语计算KGE模块中的损失,其中我们的目标是增加所有真正的三元组的分数,同时减少所有虚假三元组的得分。最后一项是用于防止过度拟合的正则化术语,$\lambda_1 $和$\lambda_2 $是平衡参数。 \ lambda_1 可以被视为两个任务的两个学习率的比率。请注意,EQ中的损耗功能。 (丢失)遍历所有可能的用户项目对和知识三元组。为了使计算更有效,遵循培训期间使用负面采样策略。 MKR的学习算法在算法1中呈现,其中训练时期由两个阶段组成:推荐任务(第3-7行)和KGE任务(第8-10行)。在每次迭代中,我们在每个时代训练后,我们对$ t $次的推荐任务进行培训我们将讨论实验部分中$ t $的选择。

Theoretical Analysis 理论分析

In this section, we prove that cross & compress units have sufficient capability of polynomial approximation. We also show that MKR is a generalized framework over several representative methods of recommender systems and multi-task learning.

在本节中,我们证明Cross & Compress单元具有足够的多项式近似能力。我们还表明MKR是几种推荐系统和多任务学习方法的推广框架。

3.1 Polynomial Approximation 多项式近似

According to the Weierstrass approximation theorem , any function under certain smoothness assumption can be approximated by a polynomial to an arbitrary accuracy. Therefore, we examine the ability of high-order interaction approximation of the cross & compress unit. We show that cross & compress units can model the order of item-entity feature interaction up to exponential degree:

`Theorem 1 () Denote the input of item and entity in MKR network as$ { v}=[v_{1}\cdots\ v_{d}]^{\top} $and ${ e}=[e_{1}\ \cdots\ e_{d}]^{\top}$ respectively. Then the cross terms about v and e in ∥vL∥1 and ∥eL∥1 (the L1-norm of vL and eL) with maximal degree is

$$ k_{{\alpha},{\beta}}v_{1}^{\alpha_{1}}\cdots v_{d}^{\alpha_{d}}e_{1}^{% \beta_{1}}\cdots e_{d}^{\beta_{d}}$$
$$ L\geq 1,{ v}{0}={ v},{ e}{0}={ e}$$

$ k_{\bm{\alpha},\bm{\beta}}\in\mathbb{R}$

In recommender systems, $ \prod_{i=1}^d v_i^{\alpha_i} e_i^{\beta_i} $ is also called feature, as it measures the interactions of multiple original features. Theorem 1 states that cross & compress units can automatically model the combinatorial features of items and entities for sufficiently high order, which demonstrates the superior approximation capacity of MKR as compared with existing work such as Wide & Deep , factorization machines and DCN . The proof of Theorem 1 is provided in the Appendix. Note that Theorem 1 gives a theoretical view of the polynomial approximation ability of the cross & compress unit rather than providing guarantees on its actual performance. We will empirically evaluate the cross & compress unit in the experiments section.

根据Weierstrass近似定理,在某些平滑度假下的任何功能可以近似多项式到任意精度。因此,我们研究了Cross & Compress单元的高阶交互近似的能力。我们展示Cross & Compress单元可以将项目实体特征交互的顺序绘制到指数学位:在推荐系统中,$\prod_{i=1}^d v_i^{\alpha_i} e_i^{\beta_i} $也称为功能,因为它测量多个原件的交互特征。定理1表示Cross & Compress单元可以自动模拟物品和实体的组合特征,以获得足够大的顺序,这表明了与现有工作相比,MKR的卓越近似容量,如宽$& $深,分解机和DCN相比。附录中提供定理1的证明。注意,定理1给出了Cross & Compress单元的多项式近似能力的理论观点,而不是在其实际性能上提供保证。我们将在实验部分中凭经验评估Cross & Compress单元。

3.2 Unified View of Representative Methods 统一视图的代表方法

In the following we provide a unified view of several representative models in recommender systems and multi-task learning, by showing that they are restricted versions of or theoretically related to MKR. This justifies the design of cross & compress unit and conceptually explains its strong empirical performance as compared to baselines.

在下文中,我们通过表示它们是与MKR的限制或理论相关的限制或理论相关的限制版本,提供了统一的几个代表性模型的统一视图。这证明了Cross & Compress单元的设计,并且与基线相比,概念性地解释了其强大的经验性能。

Factorization machines 分解机

Factorization machines are a generic method for recommender systems. Given an input feature vector, FMs model all interactions between variables in the input vector using factorized parameters, thus being able to estimate interactions in problems with huge sparsity such as recommender systems. The model equation for a 2-degree factorization machine is defined as
分解机是推荐系统的通用方法。给定输入特征向量,FMS模型使用分解参数的输入向量中的变量之间的所有相互作用,从而能够估计巨大的稀疏性等问题,例如推荐系统。 2度分解机的模型方程被定义为

$$ \hat y({ x}) = w_0 + \sum\nolimits_{i=1}^d w_i x_i + \sum\nolimits_{i=1}^d \sum\nolimits_{j=i+1}^d \langle{ v}_i, { v}_j \rangle x_i x_j, $$

where $ x_i $ is the $ i $ -th unit of input vector $ x $ , $ w_\cdot $ is weight scalar, $ { v}_\cdot $ is weight vector, and $ \langle\cdot, \cdot\rangle $ is dot product of two vectors. We show that the essence of FM is conceptually similar to an 1-layer cross & compress unit: It is interesting to notice that, instead of factorizing the weight parameter of $ x_i x_j $ into the dot product of two vectors as in FM, the weight of term $ v_i e_j $ is factorized into the sum of two scalars in cross & compress unit to reduce the number of parameters and increase robustness of the model.

,$ x_i $是输入载体$ x $的$ i $。 ,$ w_\cdot $是重量标量,${ v}_\cdot $是重量载体,$\langle\cdot, \cdot\rangle $是两个向量的点产品。我们表明FM的本质在概念上类似于1层Cross & Compress单元:注意到,请注意,而不是将$ x_i x_j $的重量参数分解为FM中的两个向量的点产品,术语$ v_i e_j $的权重被归档为Cross & Compress单元中的两个标量的总和,以减少参数的数量并增加模型的鲁棒性。

Deep \& Cross Network

DCN learns explicit and high-order cross features by introducing the layers:

$$ x_{l+1} = { x}_0 { x}_l^\top{ w}_l + { x}_l + { b}_l, $$

where $ { x}_l $ , $ { w}_l $ , and $ { b}_l $ are representation, weight, and bias of the $ l $ -th layer. We demonstrate the link between DCN and MKR by the following proposition: It can be proven that the polynomial approximation ability of the above DCN-equivalent version (i.e., the maximal degree of cross terms in $ { v}_l $ and $ { e}_l $ ) is $ O(l) $ , which is weaker than original cross & compress units with $ O(2^l) $ approximation ability.

,其中${ x}_l $,${ w}_l $和${ b}_l $是$ l $-t1-t1-第1层的表示,权重和偏置。我们通过以下命题展示DCN和MKR之间的链接:可以证明上述DCN等效版本的多项式近似能力(即,$ T4222_0_l $和${ e}_l $中的跨术率最大程度$ O(l) $比原始Cross & Compress单位弱,具有$ O(2^l) $近似能力。

Cross-stitch Networks Cross-stitch网络

Cross-stitch networks is a multi-task learning model in convolutional networks, in which the designed cross-stitch unit can learn a combination of shared and task-specific representations between two tasks. Specifically, given two activation maps $ x_A $ and $ x_B $ from layer $ l $ for both the tasks, cross-stitch networks learn linear combinations $ \tilde x_A $ and $ \tilde x_B $ of both the input activations and feed these combinations as input to the next layers' filters. The formula at location $ (i, j) $ in the activation map is
交叉针脚网络是卷积网络中的多任务学习模型,其中设计的Cross-stitch单元可以学习两个任务之间的共享和任务特定表示的组合。具体地,给定两个激活映射$ x_A $和$ x_B $来自层$ l $的任务,交叉针脚网络学习线性组合$\tilde x_A $和$\tilde x_B $的输入激活并将这些组合作为输入馈送到下一个图层过滤器。激活图中的位置$(i, j) $是

$$ [ x_A^{ij} x_B^{ij} ]= [ [\alpha_{AA} , \alpha_{AB} \alpha_{BA} , \alpha_{BB} ]
x_A^{ij} x_B^{ij} ] $$

where $ \alpha $ s are trainable transfer weights of representations between task A and task B. We show that the cross-stitch unit in Eq. (csn) is a simplified version of our cross & compress unit by the following proposition:
The transfer matrix in Eq. (cross-stitch) serves as the cross-stitch unit
$ [\alpha_{AA}\alpha_{AB}; \alpha0\alpha1\alpha2\alpha3\alpha4\alpha5\alpha6\alpha7 $ in Eq. (csn). Like cross-stitch networks, MKR network can decide to make certain layers task specific by setting $ { v}_l^\top{ w}l^{EV} $ ( $ \alpha{AB} $ ) or $ { e}_l^\top{ w}l^{VE} $ ( $ \alpha{BA} $ ) to zero, or choose a more shared representation by assigning a higher value to them. But the transfer matrix is more fine-grained in cross & compress unit, because the transfer weights are replaced from scalars to dot products of two vectors. It is rather interesting to notice that Eq. (cross-stitch) can also be regarded as an , as the computation of transfer weights involves the feature vectors $ { v}_l $ and $ { e}_l $ themselves.

其中$\alpha $是任务A和任务B之间的培训转移权重,我们展示了EQ中的Cross-stitch单元。 (CSN)是通过以下命题的Cross & Compress单元的简化版本:EQ中的传输矩阵。 (cross-stitch)用作Cross-stitch单元
$ [\alpha_{AA}\alpha_{AB}; \alpha0\alpha1\alpha2\alpha3\alpha4\alpha5\alpha6\alpha7 $
在EQ中。 (CSN)。像Cross-stitch网络,MKR网络可以决定通过设置使某些层任务特定${ v}_l^\top{ w}l^{EV} $($\alpha{AB} $)或${ e}_l^\top{ w}l^{VE} $($\alpha{BA} $)为零,或者通过为它们分配更高的值来选择更加共享的表示。但传输矩阵在cross & compress单元中更细粒度,因为转移权重被从标量取代到两个向量的点产品。注意到方面是相当有趣的。 (交叉缝合)也可以被视为一个,因为传输权重的计算涉及特征向量${ v}_l $和${ e}_l $本身。


In this section, we evaluate the performance of MKR in four real-world recommendation scenarios: movie, book, music, and news The source code is available at https://github.com/hwwang55/MKR.

\Dataset # users # items # interactions # KG triples Hyper-parameters
\MovieLens-1M 6,036 2,347 753,772 20,195 L = 1 , d = 8 , t = 3 , \lambda_1 = 0.5
Book-Crossing 17,860 14,910 139,746 19,793 L = 1 , d = 8 , t = 2 , \lambda_1 = 0.1
Last.FM 1,872 3,846 42,346 15,518 L = 2, d = 4 , t = 2 , \lambda_1 = 0.1
Bing-News 141,487 535,145 1,025,192 1,545,217 L = 3 , d = 16 , t = 5 , \lambda_1 = 0.2


在本节中,我们评估了MKR在四个现实世界推荐方案中的性能:电影,书籍,音乐和新闻源代码可在HTTPS://github.com/hwwang55/mkr提供。 。


We utilize the following four datasets in our experiments:

  • {MovieLens-1M}{https://grouplens.org/datasets/movielens/1m/} is a widely used benchmark dataset in movie recommendations, which consists of approximately 1 million explicit ratings (ranging from 1 to 5) on the MovieLens website.
  • {Book-Crossing}{http://www2.informatik.uni-freiburg.de/cziegler/ BX/} dataset contains 1,149,780 explicit ratings (ranging from 0 to 10) of books in the Book-Crossing community.
  • {Last.FM}{https://grouplens.org/datasets/hetrec-2011/} dataset contains musician listening information from a set of 2 thousand users from Last.fm online music system.
  • {Bing-News} dataset contains 1,025,192 pieces of implicit feedback collected from the server logs of Bing News\footnote{https://www.bing.com/news} from October 16, 2016 to August 11, 2017. Each piece of news has a title and a snippet.

Since MovieLens-1M, Book-Crossing, and Last.FM are explicit feedback data (Last.FM provides the listening count as weight for each user-item interaction), we transform them into implicit feedback where each entry is marked with 1 indicating that the user has rated the item positively, and sample an unwatched set marked as 0 for each user. The threshold of positive rating is 4 for MovieLens-1M, while no threshold is set for Book-Crossing and Last.FM due to their sparsity. We use Microsoft Satori to construct the KG for each dataset. We first select a subset of triples from the whole KG with a confidence level greater than 0.9. For MovieLens-1M and Book-Crossing, we additionally select a subset of triples from the sub-KG whose relation name contains "film" or "book" respectively to further reduce KG size. Given the sub-KGs, for MovieLens-1M, Book-Crossing, and Last.FM, we collect IDs of all valid movies, books, or musicians by matching their names with tail of triples (), , or , respectively. For simplicity, items with no matched or multiple matched entities are excluded. We then match the IDs with the head and tail of all KG triples and select all well-matched triples from the sub-KG. The constructing process is similar for Bing-News except that: (1) we use entity linking tools to extract entities in news titles; (2) we do not impose restrictions on the names of relations since the entities in news titles are not within one particular domain. The basic statistics of the four datasets are presented in Table statistics. Note that the number of users, items, and interactions are smaller than original datasets since we filtered out items with no corresponding entity in the KG.



  • {movielens-1m} \ socknote {https:///gouplens.org/datasets/movielens/1m/}是一个广泛使用的电影建议中的基准数据集,其中包括在内在Movielens网站上大约100万只明确的评级(范围从1到5)。
  • {book-crossing} \ opetnote {http://www2.informatik.uni-freiburg.de/cziegler/ bx /}数据集包含1,149,780个在书籍交叉社区中的书籍的明确评级(从0到10)。 - {last.fm} \ socknote {https://grouplens.org/datasets/hetrec-2011/}数据集包含来自Last.fm在线音乐系统的一组2,000个用户的音乐家侦听信息。
  • {Bing-News} DataSet包含从2016年10月16日至2017年8月11日的Bing新闻\脚注{https://www.bing.com/news}的服务器日志收集了1,025,192件隐式反馈。每件新闻有标题和片段。

由于Movielens-1M,book-crossing和Last.fm是明确的反馈数据(Last.fm为每个用户项交互提供侦听计数),我们将它们转换为隐式反馈,其中每个条目都标有1表示其中1表示用户肯定地评定了该项目,并且对每个用户标记为0的展开集。对于Movielens-1M,阳性额定值的阈值是4,而由于它们的稀疏性,没有设定阈值和Last.fm的阈值。我们使用Microsoft Satori为每个数据集构造KG。我们首先从整个kg选择一个三倍的子集,置信水平大于0.9。对于Movielens-1M和book-crossing来说,我们还在分别选择来自其关系名称的子知识图谱的三倍子集分别包含“胶片”或“书”进一步降低千克大小。鉴于Sub-KGS,对于Movielens-1M,Book-Crossing和Last.fm,我们通过将其名称与三元组()尾部匹配来收集所有有效电影,书籍或音乐家的ID。为简单起见,排除了没有匹配或多个匹配实体的项目。然后,我们将IDS与所有kg三元组的头部和尾部匹配,并从Sub-kg中选择所有匹配的三倍。构建过程类似于Bing新闻,除了:(1)我们使用实体链接工具来提取新闻标题中的实体; (2)由于新闻标题中的实体不在一个特定域名,我们不会对关系名称施加限制。表统计信息中介绍了四个数据集的基本统计信息。注意,用户,项目和交互的数量小于原始数据集,因为我们过滤了在kg中没有相应实体的项目。


We compare our proposed MKR with the following baselines. Unless otherwise specified, the hyper-parameter settings of baselines are the same as reported in their original papers or as default in their codes. - {PER} treats the KG as heterogeneous information networks and extracts meta-path based features to represent the connectivity between users and items. In this paper, we use manually designed user-item-attribute-item paths as features, i.e., "user-movie-director-movie", "user-movie-genre-movie", and "user-movie-star-movie" for MovieLens-20M; "user-book-author-book" and "user-book-genre-book" for Book-Crossing; "user-musician-genre-musician", "user-musician-country-musician", and "user-musician-age-musician" (age is discretized) for Last.FM. Note that PER cannot be applied to news recommendation because it's hard to pre-define meta-paths for entities in news.

  • {CKE} combines CF with structural, textual, and visual knowledge in a unified framework for recommendation. We implement CKE as CF plus structural knowledge module in this paper. The dimension of user and item embeddings for the four datasets are set as 64, 128, 32, 64, respectively. The dimension of entity embeddings is 32 .
  • {DKN} treats entity embedding and word embedding as multiple channels and combines them together in CNN for CTR prediction.In this paper, we use movie/book names and news titles as textual input for DKN. The dimension of word embedding and entity embedding is 64, and the number of filters is 128 for each window size 1, 2, 3. - {RippleNet} is a memory-network-like approach that propagates users’ preferences on the knowledge graph for recommendation. The hyper-parameter settings for Last.FM are d=8 , H=2 , \lambda_1 = 10^{-6} , \lambda_2=0.01 , \eta=0.02 .
  • {LibFM} is a widely used feature-based factorization model. We concatenate the raw features of users and items as well as the corresponding averaged entity embeddings learned from TransR as input for LibFM. The dimension is {1, 1, 8} and the number of training epochs is 50. The dimension of TransR is 32.
  • {Wide & Deep} is a deep recommendation model combining a (wide) linear channel with a (deep) nonlinear channel. The input for Wide & Deep is the same as in LibFM. The dimension of user, item, and entity is 64, and we use a two-layer deep channel with dimension of 100 and 50 as well as a wide channel.



\Model MovieLens-1M Book-Crossing Last.FM Bing-News
\PER 0.710 (-22.6%) 0.664 (-21.2%) 0.623 (-15.1%) 0.588 (-16.7%) 0.633 (-20.6%) 0.596 (-20.7%) - -
CKE 0.801 (-12.6%) 0.742 (-12.0%) 0.671 (-8.6%) 0.633 (-10.3%) 0.744 (-6.6%) 0.673 (-10.5%) 0.553 (-19.7%) 0.516 (-20.0%)
DKN 0.655 (-28.6%) 0.589 (-30.1%) 0.622 (-15.3%) 0.598 (-15.3%) 0.602 (-24.5%) 0.581 (-22.7%) 0.667 (-3.2%) 0.610 (-5.4%)
RippleNet 0.920 (+0.3%) 0.842 (-0.1%) 0.729 (-0.7%) 0.662 (-6.2%) 0.768 (-3.6%) 0.691 (-8.1%) 0.678 (-1.6%) 0.630 (-2.3%)
LibFM 0.892 (-2.7%) 0.812 (-3.7%) 0.685 (-6.7%) 0.640 (-9.3%) 0.777 (-2.5%) 0.709 (-5.7%) 0.640 (-7.1%) 0.591 (-8.4%)
Wide \| Deep 0.898 (-2.1%) 0.820 (-2.7%) 0.712 (-3.0%) 0.624 (-11.6%) 0.756 (-5.1%) 0.688 (-8.5%) 0.651 (-5.5%) 0.597 (-7.4%)
\MKR 0.917 0.843 0.734 0.704 0.797 0.752 0.689 0.645
MKR-1L - - - - 0.795 (-0.3%) 0.749 (-0.4%) 0.680 (-1.3%) 0.631 (-2.2%)
MKR-DCN 0.883 (-3.7%) 0.802 (-4.9%) 0.705 (-4.3%) 0.676 (-4.2%) 0.778 (-2.4%) 0.730 (-2.9%) 0.671 (-2.6%) 0.614 (-4.8%)
MKR-stitch 0.905 (-1.3%) 0.830 (-1.5%) 0.721 (-2.2%) 0.682 (-3.4%) 0.772 (-3.1%) 0.725 (-3.6%) 0.674 (-2.2%) 0.621 (-3.7%)


我们将建议的MKR与以下基准进行比较。除非另有说明,否则基线的超参数设置与其原始文件中报告的外部参数设置或默认在其代码中。 - {per}将kg视为异构信息网络,并提取基于元路径的特征来表示用户和项目之间的连接。在本文中,我们使用手动设计的用户 - 项目属性 - 项目路径作为功能,即“用户 - 电影导演 - 电影”,“用户 - 电影类型 - 电影”和“用户 - 电影 - 星形电影” “对于movielens-20m; “用户 - 书籍 - 书写”和“用户 - 书籍类型书”书籍 - 交叉路; “用户 - 音乐家 - 类型 - 音乐家”,“User-Musician-Country-Musician”和“用户 - 音乐家 - 音乐家”(年龄是离散化的)的最后一员。请注意,每个不能应用于新闻推荐,因为它很难预先定义新闻中的实体的元路径。

  • {CKE}将CF与结构,文本和视觉知识相结合,以统一框架进行推荐。我们在本文中实施CKE作为CF Plus结构知识模块。四个数据集的用户和项目嵌入的维度分别设置为64,128,32,64。实体嵌入的维度为32美元。
  • {DKN}将实体嵌入和单词嵌入为多个通道,并将它们组合在CNN中用于CTR预测。在本文中,我们使用电影/书籍名称和新闻标题作为DKN的文本输入。单词嵌入和实体嵌入的维度为64,每个窗口大小为1,2,3的滤波器的数量为128,3. - {ripplenet}是一种类似的内存网络类似的方法,可在知识图上传播用户的偏好推荐。 Last.fm的超参数设置为 d = 8 h = 2 \ lambda_1 = 10 ^ { - 6} \ lambda_2 = 0.01 \ eta = 0.02
  • {libfm}是一种广泛使用的基于功能的分解模型。我们连接用户和项目的原始功能以及从Transr中学到的相应平均实体嵌入物作为Libfm的输入。维度是\ {1,1,8 },训练时期的数量是50. Transr的维度为32.
  • {Wide & Deep} 是一个(宽)线性通道的深度推荐模型一个(深)非线性通道。广泛 \& Deep的输入与Libfm中的相同。用户,项目和实体的维度为64,我们使用具有100和50的尺寸以及宽通道的双层深度通道。

Experiments setup

In MKR, we set the number of high-level layers $ K = 1 $ , $ f_{RS} $ as inner product, and $ \lambda_2 = 10^{-6} $ for all three datasets, and other hyper-parameter are given in Table statistics. The settings of hyper-parameters are determined by optimizing $ AUC $ on a validation set. For each dataset, the ratio of training, validation, and test set is $ 6 : 2 : 2 $ . Each experiment is repeated $ 3 $ times, and the average performance is reported. We evaluate our method in two experiment scenarios: (1) In click-through rate (CTR) prediction, we apply the trained model to each piece of interactions in the test set and output the predicted click probability. We use $ AUC $ and $ Accuracy $ to evaluate the performance of CTR prediction. (2) In top- $ K $ recommendation, we use the trained model to select $ K $ items with highest predicted click probability for each user in the test set, and choose $ Precision@K $ and $ Recall@K $ to evaluate the recommended sets.










在MKR中,我们将高级层$ K = 1 $,$ f_{RS} $作为内部产品的数量设置为所有三个数据集的$\lambda_2 = 10^{-6} $,其他超参数在表统计中给出。通过在验证集上优化$ AUC $来确定超参数的设置。对于每个数据集,训练,验证和测试集比为$ 6 : 2 : 2 $。每次实验重复$ 3 $次,报告平均性能。我们在两个实验场景中评估我们的方法:(1)在点击率(CTR)预测中,我们将训练模型应用于测试集中的每条交互并输出预测的咔嗒声概率。我们使用$ AUC $和$ Accuracy $来评估CTR预测的性能。 (2)在TOP-$ K $推荐中,我们使用训练有素的模型选择$ K $项,为测试集中的每个用户选择最高的预测点击概率,并选择$ Precision@K $和$ Recall@K $以评估推荐的集合。

Empirical study

We conduct an empirical study to investigate the correlation of items in RS and their corresponding entities in KG. Specifically, we aim to reveal how the number of common neighbors of an item pair in KG changes with their number of common raters in RS. To this end, we first randomly sample 1 million item pairs from MovieLens-1M. We then classify each pair into 5 categories based on the number of their common raters in RS, and count their average number of common neighbors in KG for each category. The result is presented in Figure case_study_1, which clearly shows that . Figure case_study_2 shows the positive correlation from an opposite direction. The above findings empirically demonstrate that , thus the cross knowledge transfer of items benefits both recommendation and KGE tasks in MKR.





Comparison with baselines

The results of all methods in CTR prediction and top- $ K $ recommendation are presented in Table ctr and Figure precision, recall, respectively. We have the following observations: - PER performs poor on movie, book, and music recommendation because the user-defined meta-paths can hardly be optimal in reality. Moreover, PER cannot be applied to news recommendation. - CKE performs better in movie, book, and music recommendation than news. This may be because MovieLens-1M, Book-Crossing, and Last.FM are much denser than Bing-News, which is more favorable for the collaborative filtering part in CKE. - DKN performs best in news recommendation compared with other baselines, but performs worst in other scenarios. This is because movie, book, and musician names are too short and ambiguous to provide useful information. - RippleNet performs best among all baselines, and even outperforms MKR on MovieLens-1M. This demonstrates that RippleNet can precisely capture user interests, especially in the case where user-item interactions are dense. However, RippleNet is more sensitive to the density of datasets, as it performs worse than MKR in Book-Crossing, Last.FM, and Bing-News. We will further study their performance in sparse scenarios in Section \ref{sec:sparse}. - In general, our MKR performs best among all methods on the four datasets. Specifically, MKR achieves average Accuracy gains of 11.6% , 11.5% , 12.7% , and 8.7% in movie, book, music, and news recommendation, respectively, which demonstrates the efficacy of the multi-task learning framework in MKR. Note that the top- K metrics are much lower for Bing-News because the number of news is significantly larger than movies, books, and musicians.

\Model r
10\% 20\% 30\% 40\% 50\% 60\% 70\% 80\% 90\% 100\%
\PER 0.598 0.607 0.621 0.638 0.647 0.662 0.675 0.688 0.697 0.710
CKE 0.674 0.692 0.705 0.716 0.739 0.754 0.768 0.775 0.797 0.801
DKN 0.579 0.582 0.589 0.601 0.612 0.620 0.631 0.638 0.646 0.655
RippleNet 0.843 0.851 0.859 0.862 0.870 0.878 0.890 0.901 0.912 0.920
LibFM 0.801 0.810 0.816 0.829 0.837 0.850 0.864 0.875 0.886 0.892
Wide \| Deep 0.788 0.802 0.809 0.815 0.821 0.840 0.858 0.876 0.884 0.898
\MKR 0.868 0.874 0.881 0.882 0.889 0.897 0.903 0.908 0.913 0.917


CTR预测和顶部$ K $推荐的所有方法的结果分别在表CTR和图形精度,回忆中召回。我们有以下观察结果: - 每次在电影,书籍和音乐推荐上表现不佳,因为用户定义的元路径几乎不能在现实中最佳。此外,每个不能应用于新闻建议。 - CKE在电影,书籍和音乐推荐中表现得比新闻更好。这可能是因为Movielens-1M,书写和Last.fm比Bing新闻更密集,这更有利于CKE中的协同滤波部分。 - 与其他基准相比,DKN在新闻推荐中表现最佳,但在其他情况下表现最差。这是因为电影,书籍和音乐家名称太短而含糊不清,可以提供有用的信息。 - Ripplenet在所有基线中表现最佳,甚至在Movielens-1M上表现出MKR。这表明Ripplenet可以精确地捕获用户兴趣,特别是在用户项目交互密集的情况下。然而,Ripplenet对数据集的密度更敏感,因为它在书交中的MKR中表现差,Last.fm和Bing-News。我们将进一步研究他们在初步方案中的表现\ ref {sec:sparse}。 - 通常,我们的MKR在四个数据集的所有方法中表现最佳。具体而言,MKR分别实现了11.6%的平均低精度 11.6% 11.5% 12.7%,以及电影,书籍,音乐和新闻推荐的8.7% 8.7%,这表明了多个效果 - MKR中的学习框架。请注意,Bing新闻的顶级 k $指标将低得多,因为新闻数量明显大于电影,书籍和音乐家。

Comparison with MKR variants

We further compare MKR with its three variants to demonstrate the efficacy of cross & compress unit: - MKR-1L is MKR with one layer of cross & compress unit, which corresponds to FM model according to Proposition \ref{prop:1}. Note that MKR-1L is actually MKR in the experiments for MovieLens-1M. - MKR-DCN is a variant of MKR based on Eq. (\ref{eq:dcn}), which corresponds to DCN model. - MKR-stitch is another variant of MKR corresponding to the cross-stitch network, in which the transfer weights in Eq. (\ref{eq:cross-stitch}) are replaced by four trainable scalars. From Table ctr we observe that MKR outperforms MKR-1L and MKR-DCN, which shows that modeling high-order interactions between item and entity features is helpful for maintaining decent performance. MKR also achieves better scores than MKR-stitch. This validates the efficacy of fine-grained control on knowledge transfer in MKR compared with the simple cross-stitch units.




\dataset KGE KGE + RS
\MovieLens-1M 0.319 0.302
Book-Crossing 0.596 0.558
Last.FM 0.480 0.471
Bing-News 0.488 0.459

与MKR Variants的比较

我们进一步将MKR与其三种变体进行了比较,以演示cross & compress单元的功效: - MKR-1L是MKR,带有一层cross & compress单元,其对应于FM模型根据命题\ ref {prop:1}。请注意,MKR-1L实际上是MOVIELENS-1M实验中的MKR。 - MKR-DCN是基于EQ的MKR的变体。 (\ ref {eq:dcn}),对应于dcn模型。 - MKR-STITH是与交叉针脚网络相对应的MKR的另一个变体,其中EQ中的转印权重。 (\ ref {eq:cross-stitch})由四个可训练的标量替换。来自表CTR,我们观察到MKR优于MKR-1L和MKR-DCN,这表明项目和实体功能之间的高阶交互是有助于维护体面的性能。 MKR还实现了比MKR-STITH更好的分数。与简单的Cross-stitch单元相比,这验证了细粒度控制对MKR中知识转移的功效。

Results in sparse scenarios

One major goal of using knowledge graph in MKR is to alleviate the sparsity and the cold start problem of recommender systems. To investigate the efficacy of the KGE module in sparse scenarios, we vary the ratio of training set of MovieLens-1M from $ 100% $ to $ 10% $ (while the validation and test set are kept fixed), and report the results of $ AUC $ in CTR prediction for all methods. The results are shown in Table sparse. We observe that the performance of all methods deteriorates with the reduce of the training set. When $ r=10% $ , the $ AUC $ score decreases by $ 15.8% $ , $ 15.9% $ , $ 11.6% $ , $ 8.4% $ , $ 10.2% $ , $ 12.2% $ for PER, CKE, DKN, RippleNet, LibFM, and Wide & Deep, respectively, compared with the case when full training set is used ( $ r=100% $ ). In contrast, the $ AUC $ score of MKR only decreases by $ 5.3% $ , which demonstrates that MKR can still maintain a decent performance even when the user-item interaction is sparse. We also notice that MKR performs better than RippleNet in sparse scenarios, which is accordance with our observation in Section observation that RippleNet is more sensitive to the density of user-item interactions.


在MKR中使用知识图的一个主要目标是缓解推荐系统的稀疏性和冷启动问题。要调查KGE模块在稀疏场景中的功效,我们可以根据$ 100% $到$ 10% $(验证和测试集保留固定)的比率,提出训练组-1M的比率。并报告结果$ AUC $在CTR预测所有方法中。结果显示在表稀疏。我们观察到所有方法的性能都随着训练集的减少而恶化。当$ r=10% $,所述$ AUC $得分的$ 15.8% $,$ 15.9% $,$ 11.6% $,$ 8.4% $,$ 10.2% $,$ 12.2% $为PER,CKE,DKN减小,RippleNet ,libfm和wide $& $ sopy,与使用完整训练集的情况相比($ r=100% $)。相比之下,MKR的$ AUC $得分仅逐渐减少$ 5.3% $,这表明即使用户项交互稀疏,MKR也仍然可以保持体现性能。我们还注意到,MKR比稀疏场景中的Ripplenet更好地表现得更好,这是根据我们在剖面观察中的观察,即Ripplenet对用户项交互的密度更敏感。

Results on KGE side

Although the goal of MKR is to utilize KG to assist with recommendation, it is still interesting to investigate whether the RS task benefits the KGE task, since the principle of multi-task learning is to leverage shared information to help improve the performance of all tasks . We present the result of $ RMSE $ (rooted mean square error) between predicted and real vectors of tails in the KGE task in Table kge. Fortunately, we find that the existence of RS module can indeed reduce the prediction error by $ 1.9%\sim 6.4% $ . The results show that the cross & compress units are able to learn general and shared features that mutually benefit both sides of MKR.


虽然MKR的目标是利用KG协助推荐,但调查RS任务是否有趣,仍然有趣的是,因为多任务学习原则是利用共享信息的原则。帮助提高所有任务的性能。我们介绍了表KGE中KGE任务中的预测和真实载体之间的$ RMSE $(根均方误差)的结果。幸运的是,我们发现RS模块的存在确实可以通过$ 1.9%\sim 6.4% $降低预测误差。结果表明,cross & compress单元能够学习一般和共享功能,相互利益MKR的两侧。

Parameter Sensitivity


Impact of KG size

We vary the size of KG to further investigate the efficacy of usage of KG. The results of $ AUC $ on Bing-News are plotted in Figure ratio. Specifically, the $ AUC $ and $ Accuracy $ is enhanced by $ 13.6% $ and $ 11.8% $ with the KG ratio increasing from $ 0.1 $ to $ 1.0 $ in three scenarios, respectively. This is because the Bing-News dataset is extremely sparse, making the effect of KG usage rather obvious.


我们改变了kg的大小,进一步研究了kg的使用功效。 Bing-News上的$ AUC $的结果绘制了数字比例。具体地,$ AUC $和$ Accuracy $由$ 13.6%$和$ 11.8%$分别在三种情况下从$ 0.1 $到$ 1.0 $增加到kg比。这是因为Bing-News数据集非常稀少,使得kg使用的效果相当明显。

Impact of RS training frequency

We investigate the influence of parameters $ t $ in MKR by varying $ t $ from 1 to 10, while keeping other parameters fixed. The results are presented in Figure t. We observe that MKR achieves the best performance when $ t = 5 $ . This is because a high training frequency of the KGE module will mislead the objective function of MKR, while too small of a training frequency of KGE cannot make full use of the transferred knowledge from the KG.


我们通过改变1到10的$ t $调查MKR中参数$ t $的影响,同时保持其他参数固定。结果如图T所示。我们观察到MKR在​​$ t = 5 $时实现了最佳性能。这是因为KGE模块的高训练频率会误导MKR的目标函数,而KGE的训练频率太小不能充分利用来自KG的转移知识。

Impact of embedding dimension

We also show how the dimension of users, items, and entities affects the performance of MKR in Figure dim. We find that the performance is initially improved with the increase of dimension, because more bits in embedding layer can encode more useful information. However, the performance drops when the dimension further increases, as too large number of dimensions may introduce noises which mislead the subsequent prediction.




Knowledge Graph Embedding

The KGE module in MKR connects to a large body of work in KGE methods. KGE is used to embed entities and relations in a knowledge into low-dimensional vector spaces while still preserving the structural information .

KGE methods can be classified into the following two categories:

(1) Translational distance models exploit distance-based scoring functions when learning representations of entities and relations, such as TransE , TransH , and TransR ;

(2) Semantic matching models measure plausibility of knowledge triples by matching latent semantics of entities and relations, such as RESCAL , ANALOGY , and HolE .

Recently, researchers also propose incorporating auxiliary information, such as entity types , logic rules , and textual descriptions to assist KGE.

The above KGE methods can also be incorporated into MKR as the implementation of the KGE module, but note that the cross & compress unit in MKR needs to be redesigned accordingly.

Exploring other designs of KGE module as well as the corresponding bridging unit is also an important direction of future work.


MKR中的KGE模块连接到KGE方法中的大型工作。 KGE用于将知识中的实体和关系嵌入到低维向量空间,同时仍然保持结构信息。 KGE方法可以分为以下两类:



上述KGE方法也可以合并到MKR中作为KGE模块的实现,但请注意,MKR中的cross & compress单元需要相应地重新设计。探索KGE模块的其他设计以及相应的桥接单元也是未来工作的重要方向。

Multi-Task Learning

Multi-task learning is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks .

All of the learning tasks are assumed to be related to each other, and it is found that learning these tasks jointly can lead to performance improvement compared with learning them individually. In general, MTL algorithms can be classified into several categories, including feature learning approach , low-rank approach , task clustering approach , task relation learning approach , and decomposition approach .

For example, the cross-stitch network determines the inputs of hidden layers in different tasks by a knowledge transfer matrix; Zhou et. al aims to cluster tasks by identifying representative tasks which are a subset of the given $ m $ tasks, i.e., if task $ T_i $ is selected by task $ T_j $ as a representative task, then it is expected that model parameters for $ T_j $ are similar to those of $ T_i $ .

MTL can also be combined with other learning paradigms to improve the performance of learning tasks further, including semi-supervised learning, active learning, unsupervised learning,and reinforcement learning.

Our work can be seen as an asymmetric multi-task learning framework , in which we aim to utilize the connection between RS and KG to help improve their performance, and the two tasks are trained with different frequencies.


多任务学习是机器学习中的学习范例,其目的是利用多个相关任务中包含的有用信息来帮助改善所有任务的泛化性能。假设所有学习任务都被认为是彼此相关的,并且发现与单独学习它们相比,学习这些任务可能会导致性能改进。通常,MTL算法可以分类为几个类别,包括特征学习方法,低秩方法,任务聚类方法,任务关系学习方法和分解方法。例如,Cross-stitch网络通过知识传输矩阵确定不同任务中的隐藏层的输入;周等。 AL旨在通过识别代表性任务来统计任务,即代表性任务,即,如果任务$ T_j $作为代表任务选择任务$ T_i $,则预期$ T_j 的模型参数$类似于$ T_i $的$。 MTL也可以与其他学习范例相结合,以提高学习任务的表现,包括半监督学习,积极学习,无监督学习和加强学习。我们的工作可以被视为一个不对称的多任务学习框架,其中我们的目标是利用Rs和Kg之间的连接来帮助提高它们的性能,并且两个任务培训具有不同的频率。

Deep Recommender Systems

Recently, deep learning has been revolutionizing recommender systems and achieves better performance in many recommendation scenarios.

Roughly speaking, deep recommender systems can be classified into two categories:

(1) Using deep neural networks to process the raw features of users or items ;

For example, Collaborative Deep Learning designs autoencoders to extract short and dense features from textual input and feeds the features into a collaborative filtering module; DeepFM combines factorization machines for recommendation and deep learning for feature learning in a neural network architecture.

(2) Using deep neural networks to model the interaction among users and items . For example, Neural Collaborative Filtering replaces the inner product with a neural architecture to model the user-item interaction.

The major difference between these methods and ours is that MKR deploys a multi-task learning framework that utilizes the knowledge from a KG to assist recommendation.


(1)使用深神经网络来处理用户或物品的原始功能;例如,协作深度学习设计AutoEncoders从文本输入中提取短和密集的功能,并将其特征馈送到协作过滤模块中; DeepFM在神经网络架构中结合了构建和深度学习的因子化机器。

Conclusions and Future Work

This paper proposes MKR, a multi-task learning approach for knowledge graph enhanced recommendation. MKR is a deep and end-to-end framework that consists of two parts: the recommendation module and the KGE module. Both modules adopt multiple nonlinear layers to extract latent features from inputs and fit the complicated interactions of user-item and head-relation pairs.

Since the two tasks are not independent but connected by items and entities, we design a cross & compress unit in MKR to associate the two tasks, which can automatically learn high-order interactions of item and entity features and transfer knowledge between the two tasks.

We conduct extensive experiments in four recommendation scenarios. The results demonstrate the significant superiority of MKR over strong baselines and the efficacy of the usage of KG.

For future work, we plan to investigate other types of neural networks (such as CNN) in MKR framework. We will also incorporate other KGE methods as the implementation of KGE module in MKR by redesigning the cross & compress unit.

We omit the proofs for Proposition 2 and Proposition 3 as they are straightforward.


本文提出了MKR,一种用于知识图形的多任务学习方法,增强了建议。 MKR是一个深层和端到端的框架,由两部分组成:推荐模块和KGE模块。两个模块采用多个非线性层,以从输入中提取潜在特征,并符合用户项目和头部关系对的复杂交互。由于这两个任务不是独立的,而是由物品和实体连接,我们在MKR中设计了一个cross & compress单元,以关联两个任务,这可以自动学习项目和实体特征的高阶交互并在两者之间传输知识任务。我们在四种建议方案中进行了广泛的实验。结果证明了MKR在强基线上的显着优越性以及kg使用的疗效。对于未来的工作,我们计划在MKR框架中调查其他类型的神经网络(如CNN)。我们还将通过重新设计cross & compress单元,将其他KGE方法作为KGE模块的实现作为MKR。我们省略了命题2和命题3的证据,因为它们很简单。