Deep Embedded K-Means Clustering
Abstract—Recently, deep clustering methods have gained momentum because of the high representational power of deep neural networks (DNNs) such as auto encoder. The key idea is that representation learning and clustering can reinforce each other: Good representations lead to good clustering while good clustering provides good supervisory signals to representation learning. Critical questions include: 1) How to optimize representation learning and clustering? 2) Should the reconstruction loss of auto encoder be considered always? In this paper, we propose DEKM (for Deep Embedded K-Means) to answer these two questions. Since the embedding space generated by auto encoder may have no obvious cluster structures, we propose to further transform the embedding space to a new space that reveals the cluster-structure information. This is achieved by an ortho normal transformation matrix, which contains the ei gen vectors of the within-class scatter matrix of K-means. The eigenvalues indicate the importance of the ei gen vectors’ contributions to the clusterstructure information in the new space. Our goal is to increase the cluster-structure information. To this end, we discard the decoder and propose a greedy method to optimize the representation. Representation learning and clustering are alternately optimized by DEKM. Experimental results on the real-world datasets demonstrate that DEKM achieves state-of-the-art performance.
摘要—近年来,深度聚类方法因自编码器等深度神经网络(DNNs)的高表征能力而备受关注。其核心思想是表征学习与聚类能够相互促进:优质表征能产生良好聚类结果,而优质聚类又能为表征学习提供有效监督信号。关键问题在于:1)如何协同优化表征学习与聚类?2)自编码器的重构损失是否必须保留?本文提出的DEKM(深度嵌入K均值)旨在回答这两个问题。针对自编码器生成的嵌入空间可能缺乏明显聚类结构的问题,我们提出通过正交变换矩阵将嵌入空间转换至能显式呈现聚类结构信息的新空间,该矩阵由K均值类内散布矩阵的特征向量构成,特征值则反映各特征向量对新空间聚类结构信息的贡献度。为增强聚类结构信息,我们舍弃解码器并采用贪心算法优化表征。DEKM通过交替优化实现表征学习与聚类的协同提升。真实数据集上的实验表明,DEKM达到了当前最优性能。
I. INTRODUCTION
I. 引言
Clustering, an essential data exploratory analysis tool, has been widely studied. Numerous clustering methods have been proposed, e.g., K-means [24], Gaussian Mixture Models [3], and spectral clustering [31, 36, 40, 45]. These traditional clustering methods have achieved promising results. However, when the dimension of the data is very high, these traditional methods become inefficient and ineffective because of the notorious problem of the curse of dimensionality.
聚类作为一种重要的数据探索分析工具,已被广泛研究。目前已提出众多聚类方法,例如K-means [24]、高斯混合模型 [3] 和谱聚类 [31, 36, 40, 45]。这些传统聚类方法取得了显著成果。然而当数据维度极高时,由于众所周知的维度灾难问题,这些传统方法会变得低效且效果不佳。
To deal with this problem, people usually perform dimens ional it y reduction before clustering. The commonly used dimensionality reduction methods are Principle Component Analysis (PCA) [32], Canonical Correlation Analysis (CCA) [1], and Non negative Matrix Factorization (NMF) [21]. But all these methods are linear and shallow models, whose representational abilities are limited. They cannot capture the complex and non-linear relations hidden in the data. Benefiting from the high representational power of deep neural networks (DNNs), auto encoder has been widely used as a dimensionality reduction method for clustering in recent years. Auto encoder can learn meaningful representations of the input data through an unsupervised manner. It consists of an encoder and a decoder. The encoder transforms the input data into a lowerdimensional space (the embedding space), and the decoder is responsible for reconstructing the input data from that embedding space.
为了解决这一问题,人们通常在聚类前进行降维处理。常用的降维方法包括主成分分析 (PCA) [32]、典型相关分析 (CCA) [1] 和非负矩阵分解 (NMF) [21]。但这些方法都是线性的浅层模型,其表示能力有限,无法捕捉数据中隐藏的复杂非线性关系。得益于深度神经网络 (DNNs) 强大的表征能力,自编码器近年来被广泛用作聚类的降维方法。自编码器能以无监督方式学习输入数据的有意义表示,由编码器和解码器组成:编码器将输入数据转换到低维空间(嵌入空间),解码器负责从该嵌入空间重构输入数据。
Fig. 1. Visualization of the representation learning and clustering by DEKM on a subset of MNIST dataset. The color of data points marks the ground truth and the color of the background block marks the clustering results. Data points falling into the same block belong to the same cluster. Cluster centroids are marked by black crosses. The black line shows the direction of the last ei gen vector of the within-class scatter matrix of K-means. In this direction, the cluster-structure information is the lowest. We would like to optimize the representation to increase the cluster-structure information in this direction. We use a mini-batch updating strategy. Numbers in parentheses denote ACC and NMI, respectively.
图 1: DEKM在MNIST数据集子集上的表示学习与聚类可视化。数据点颜色表示真实标签,背景区块颜色表示聚类结果。落入同一区块的数据点属于同一聚类。聚类中心用黑色十字标记。黑线表示K-means类内散布矩阵最后一个特征向量的方向,该方向的聚类结构信息量最低。我们通过优化表示来提升该方向的聚类结构信息量。采用小批量更新策略,括号内数值分别表示准确率(ACC)和标准化互信息(NMI)。
A plethora of clustering methods [6, 8, 10, 11, 23, 35, 41, 43, 44] based on auto encoder have been proposed. One of the most critical problems with deep clustering is how to design a proper clustering loss function. Methods such as DEC [41] and IDEC [10] minimize the Kullback-Leibler (KL) divergence between the cluster distribution and an auxiliary target distribution. The basic idea is to refine the clusters by learning from their high confidence assignments. However, the auxiliary target distribution is hard to choose for better clustering results. Other methods such as DCN [43] and DKM [6] combine the objective of K-means with that of auto encoder and jointly optimize them. However, the discrimination of clusters in the embedding space is not directly related to the reconstruction loss of auoencoder. Thus, in this paper, after using an auto encoder to generate an embedding space, we discard the decoder and do not optimize the reconstruction loss anymore. Because no matter what we do to the embddding space, we can separately train the decoder to reconstruct the input data from their embeddings.
大量基于自动编码器 (auto encoder) 的聚类方法 [6, 8, 10, 11, 23, 35, 41, 43, 44] 被提出。深度聚类最关键的问题之一是如何设计合适的聚类损失函数。DEC [41] 和 IDEC [10] 等方法通过最小化聚类分布与辅助目标分布之间的 KL (Kullback-Leibler) 散度来实现聚类,其核心思想是通过高置信度分配结果来优化聚类。但辅助目标分布的选择往往难以获得更好的聚类效果。DCN [43] 和 DKM [6] 等其他方法将 K-means 目标与自动编码器目标结合并进行联合优化,但嵌入空间中的聚类区分度与自动编码器的重构损失并无直接关联。因此,本文在使用自动编码器生成嵌入空间后,舍弃解码器并不再优化重构损失。因为无论对嵌入空间进行何种操作,我们都可以单独训练解码器来根据嵌入重构输入数据。
Considering the simplicity of K-means, we extend it to a deep version, which is called DEKM that uses an auto encoder to generate an embedding space. Since this embedding space may not be disc rim i native for clustering, we propose to further transform this embedding space to a new space by an ortho normal transformation matrix, which consists of the ei gen vectors of the within-class scatter matrix of K-means. These ei gen vectors are sorted in ascending order with respect to their eigenvalues. In this new space, the cluster-structure information is revealed. And each eigenvalue indicates how much its corresponding ei gen vector contributes to the quantity of the cluster-structure information in the new space. The last ei gen vector contributes the least. To increase the clusterstructure information in the direction of the last eigenvector, we discard the decoder and propose a greedy method to optimize the representation. Optimizing representation is equivalent to minimizing the entropy. And this optimization is also in consistent with the loss of K-means. Inspired by the idea that “good representations are beneficial to clustering and clustering provides supervisory signals to representation learning” [44], we alternately optimize representation learning and clustering until some criterion is met.
考虑到K-means的简洁性,我们将其扩展为深度版本,称为DEKM (Deep Embedded K-means) ,该方法使用自动编码器生成嵌入空间。由于该嵌入空间可能对聚类缺乏判别性,我们提出通过正交变换矩阵进一步将其转换到新空间,该矩阵由K-means类内散布矩阵的特征向量构成。这些特征向量按其对应特征值升序排列。在新空间中,聚类结构信息得以显现,每个特征值表征其对应特征向量对新空间聚类结构信息量的贡献程度,最后一个特征向量贡献最小。为增强末位特征向量方向的聚类结构信息,我们舍弃解码器并提出贪心算法优化表征。表征优化等价于熵最小化,该优化过程也与K-means的损失函数一致。受"优质表征促进聚类,而聚类为表征学习提供监督信号"[44]的启发,我们交替优化表征学习与聚类直至满足终止条件。
Figure 1 shows these two alternating processes on a subset of MNIST dataset, containing randomly selected four clusters. We set the units of the embedding layer to two for better inspection. Figure 1(a)–(d) show the data representations and clustering results in the embedding space during these two alternating processes. Figure 1(a) shows the initial embedding space generated by an auto encoder. The direction where the black line lies denotes the second ei gen vector of the scatter matrix of K-means. The cluster-structure information in this direction is the lowest. To increase the cluster-structure information in this direction, we move data points closer to their centroids. Note that in this process, we use a mini-batch of data to optimize the representation, which makes only some data points closer to their centroids. Compared with Figure 1(a), we can see from Figure 1(b) that the data points in the red cluster do not move closer to their centroids, while the data points in the other three clusters do. We find in the experiments that this mini-batch updating strategy outperforms the full-batch updating strategy, which moves all the data points closer to their centroids.
图1展示了在MNIST数据集子集上这两个交替过程,该子集包含随机选择的四个聚类。我们将嵌入层的单元数设为2以便更好地观察。图1(a)-(d)展示了这两个交替过程中嵌入空间的数据表示和聚类结果。图1(a)显示由自编码器生成的初始嵌入空间。黑线所在方向表示K-means散点矩阵的第二特征向量,该方向的聚类结构信息最弱。为增强该方向的聚类结构信息,我们将数据点向其质心移动。注意在此过程中,我们使用小批量数据优化表示,这使得只有部分数据点更接近其质心。与图1(a)相比,从图1(b)可见红色聚类中的数据点未向质心移动,而其他三个聚类的数据点则发生了移动。实验发现这种小批量更新策略优于全批量更新策略(后者会使所有数据点都向质心移动)。
Our contributions can be summarized as follows:
我们的贡献可概括如下:
• We propose to transform the embedding space generated by an auto encoder to a new space using an ortho normal transformation matrix, which contains the ei gen vectors of the within-class scatter matrix of K-means. The eigenvalues indicate the importance of their corresponding eigenvectors’ contributions to the cluster-structure information in the new space.
• 我们提出使用一个正交归一化变换矩阵将自编码器生成的嵌入空间转换到新空间,该矩阵包含K均值类内散布矩阵的特征向量。特征值表明对应特征向量对新空间中聚类结构信息贡献的重要性。
II. RELATED WORK
II. 相关工作
A. K-means and Its Variants
A. K-means及其变体
K-means is one of the most fundamental clustering methods. It is usually used as one of the building blocks of many advanced clustering methods such as spectral clustering [31, 36, 40, 45]. K-means has inspired many extentions. For example, the basic idea of [14] is to replace the mean with the median. K-means $^{++}$ [2] improves the selection of the initial centroids, which are based on their proportional distance to the previous selected centroids. SubKmeans [26] assumes that the input space can be split into two independent subspaces, i.e., the clustering subspace and the noise subspace. The former subspace contains only the cluster-structure information and the latter subspace contains only the noise information. SubKmeans performs clustering in the clustering subspace. Nr-Kmeans [27, 28] finds non-redundant K-means clustering s in multiple mutually orthogonal subspaces via an orthogonal transformation matrix. Fuzzy-c-means [5] assigns each data point proportionally to multiple clusters. It relaxes the hard cluster assignments of K-means to soft cluster assignments. Mini-batch K-means [34] extends K-means to the scenario of user-facing web applications. Mini-batch K-means can be used in the deep learning frameworks because it supports the online Stochastic Gradient Descent (SGD).
K-means是最基础的聚类方法之一,通常作为谱聚类[31, 36, 40, 45]等高级聚类方法的构建模块。该方法衍生出诸多变体:例如[14]的核心思想是用中位数替代均值;K-means$^{++}$[2]通过根据与已选中心点距离的比例来优化初始质心选择;SubKmeans[26]假设输入空间可分解为聚类子空间和噪声子空间两个独立子空间,仅在前者执行聚类操作;Nr-Kmeans[27, 28]通过正交变换矩阵在多个互斥子空间中发现非冗余K-means聚类;Fuzzy-c-means[5]采用软分配机制使数据点按比例归属多个簇;Mini-batch K-means[34]则面向网页应用场景,因其支持在线随机梯度下降(SGD)而适用于深度学习框架。
For the real-world datasets, the number of clusters is unknown. To solve this problem, researchers propose to automatically find the proper number of clusters. X-means [33] uses the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) as measures to evaluate the clustering results under different cluster number $k$ . G-means [12] assumes that each cluster follows a Gaussian distribution. It runs K-means with increasing $k$ hierarchically until the statistical test accepts that the clusters follow Gaussian distributions. PG-means [7] first constructs an one-dimensional projection of the dataset and the learned model. Then the model fitness is evaluated in the projected space. It is able to discover a proper number of Gaussian clusters. Dip-means [15] assumes that each cluster follows a unimodal distribution. It first computes the pairwise distance between a data point and the other data points. Then, it applies a univariate statistical hypothesis test [13] (called Hartigans’ dip-test) for uni modality on the distributions of distances to find unimodal clusters and the proper cluster number.
对于真实世界的数据集,聚类数量通常是未知的。为解决这一问题,研究者提出了自动确定最佳聚类数的方法。X-means [33] 采用贝叶斯信息准则 (BIC) 或赤池信息准则 (AIC) 作为评估指标,在不同聚类数 $k$ 下衡量聚类结果。G-means [12] 假设每个簇服从高斯分布,通过层次化增加 $k$ 值运行K-means,直至统计检验接受簇符合高斯分布。PG-means [7] 首先生成数据集与学习模型的一维投影,在投影空间评估模型拟合度,从而发现合适的高斯簇数量。Dip-means [15] 假设每个簇服从单峰分布,先计算数据点与其他点的成对距离,再通过单变量统计假设检验 [13] (Hartigans' dip-test) 分析距离分布的单峰性,以此识别单峰簇并确定最佳聚类数。
B. Deep Clustering
B. 深度聚类
Since shallow clustering models suffer from non-linearity of real-world data, they cannot perform well. Deep clustering models use deep neural networks (DNNs) with stronger nonlinear representational capabilities to extract features, which achieve better clustering performances. Earlier deep clustering methods [4, 37] perform representation learning and clustering sequentially. Recent studies [6, 8, 10, 11, 23, 35, 41, 43, 44] have shown that jointly performing representation learning and clustering yields better performance.
由于浅层聚类模型受限于现实数据的非线性特性,其表现往往欠佳。深度聚类模型通过采用具有更强非线性表征能力的深度神经网络(DNNs)来提取特征,从而获得更优的聚类效果。早期深度聚类方法[4,37]采用表征学习与聚类分步执行的策略。最新研究[6,8,10,11,23,35,41,43,44]表明,联合进行表征学习与聚类能取得更好的性能表现。
JULE [44] proposes a recurrent framework for jointly unsupervised learning of deep representations and clustering. During the optimization procedure, clustering is conducted in the forward pass and representation learning is performed in the backward pass. DCC [35] jointly performs non-linear dimensionality reduction and clustering. The clustering process includes the optimization of the auto encoder. Since the objective function is continuous without the discrete cluster assignment, it can be solved by standard gradient-based methods. DEC [41] pre-trains an auto encoder with reconstruction loss and performs clustering to obtain the soft clustering assignment of each data point. Then, an auxiliary target distribution is derived from the current soft cluster assignments. Finally, it iterative ly refines clustering by minimizing the KullbackLeibler (KL) divergence between the soft assignments and the auxiliary target distribution. DCN [43] combines the objective of K-means with that of auto encoder to find a “K-meansfriendly” space. The cluster assignments of DCN are not soft (probability) like that in DEC, but strict (discrete), which limit the direct use of gradient-based SGD solvers. DCN refines clustering by alternately optimizing the objectives of auto encoder and K-means. DEPICT [8] consists of two parts, i.e., a convolutional auto encoder for learning the embedding space and a multi no mi al logistic regression layer functioning as a disc rim i native clustering model. For jointly learning the embedding space and clustering, DEPICT employs an alternating approach to optimize a unified objective function. IDEC [10] combines an under-complete auto encoder with DEC. The under-complete auto encoder not only learns the embedding space, but also preserves the local structure of data. Like IDEC, DCEC [11] combines a convolutional auto encoder with DEC. DKM [6] proposes a new approach for jointly clustering with K-means and learning representations. The Kmeans objective is considered as a limit of a differentiable function, so that the representation learning and clustering can be optimized by the simple stochastic gradient descent. REDKC (for Robust Embedded Deep K-means Clustering) [46] uses the $\delta$ -norm metric to constrain the feature mapping of the auto encoder so that data embeddings are more conductive to the robust K-means clustering.
JULE [44] 提出了一种循环框架,用于联合无监督学习深度表示和聚类。在优化过程中,前向传播执行聚类,反向传播执行表示学习。DCC [35] 联合执行非线性降维和聚类,其聚类过程包含对自编码器的优化。由于目标函数连续且不含离散聚类分配,可通过标准梯度方法求解。DEC [41] 先用重构损失预训练自编码器,再通过聚类获得各数据点的软聚类分配,随后从当前软分配推导辅助目标分布,最终通过最小化软分配与辅助目标分布间的KL散度迭代优化聚类。DCN [43] 将K均值目标与自编码器目标结合以寻找"K均值友好"空间,其聚类分配不像DEC采用概率形式而是严格离散值,这限制了基于梯度的SGD求解器的直接使用。DCN通过交替优化自编码器和K均值目标来改进聚类。DEPICT [8] 包含卷积自编码器(用于学习嵌入空间)和多分类逻辑回归层(作为判别式聚类模型),采用交替优化策略联合学习嵌入空间与聚类。IDEC [10] 将欠完备自编码器与DEC结合,该自编码器不仅能学习嵌入空间,还能保留数据局部结构。DCEC [11] 类似地结合了卷积自编码器与DEC。DKM [6] 提出联合K均值聚类与表示学习的新方法,将K均值目标视为可微函数的极限,从而可通过简单随机梯度下降优化表示学习和聚类。REDKC(鲁棒嵌入式深度K均值聚类)[46] 使用$\delta$-范数度量约束自编码器的特征映射,使数据嵌入更有利于鲁棒K均值聚类。
Our proposed DEKM is also a method that jointly performs representation learning and clustering. Like DCEC, DEKM first uses an auto encoder to find the embedding space. Then, it discards the decoder and optimizes the representation for better clustering. The representation optimization of DEKM is different from that of DCEC, which does not optimize the Kullback-Leibler (KL) divergence between the cluster distribution and an auxiliary target distribution. Instead, DEKM optimizes the representation by reducing its entropy.
我们提出的DEKM也是一种联合执行表示学习和聚类的方法。与DCEC类似,DEKM首先使用自动编码器寻找嵌入空间,然后丢弃解码器并优化表示以获得更好的聚类效果。DEKM的表示优化方式与DCEC不同:DCEC通过优化聚类分布与辅助目标分布之间的KL散度 (Kullback-Leibler divergence) 来实现优化,而DEKM则是通过降低表示熵来优化表征。
III. DEEP EMBEDDED K-MEANS
III. 深度嵌入式K均值
We assume that the cluster structures exist in a lowerdimensional subspace. Instead of clustering directly in the original space, DEKM uses auto encoder to transform the original space into an embedding space to reduce the dimens ional it y before clustering. DEKM alternately optimizes representation learning and clustering. DEKM has three steps: (1) generating an embedding space by an auto encoder, (2) detecting clusters in the embedding space by K-means, and (3) optimizing the representation to increase the cluster-structure information. The last two steps are alternately optimized to generate better embedding space and clustering results. Table I shows the used symbols and their corresponding interpretations in this paper.
我们假设聚类结构存在于一个低维子空间中。DEKM并非直接在原始空间进行聚类,而是通过自编码器将原始空间转换为嵌入空间以降低维度后再进行聚类。DEKM交替优化表示学习和聚类过程,包含三个步骤:(1) 通过自编码器生成嵌入空间,(2) 在嵌入空间中使用K-means检测聚类,(3) 优化表示以增强聚类结构信息。后两个步骤交替优化以生成更好的嵌入空间和聚类结果。表1列出了本文使用的符号及其对应解释。
TABLE I SYMBOLS AND INTERPRETATIONS
表 I 符号与释义
符号 | 释义 |
---|---|
dE N eEN nEN kEN | 输入空间维度 嵌入空间维度 数据点数量 聚类数量 |
x E Rdx1 μERex1 h E Rex1 | 数据点 聚类中心点 数据点嵌入 |
X=[x1;...;Xn] IE Rexe VE Rexe H=[h1;...;hn] | 数据矩阵 单位矩阵 正交变换矩阵 数据嵌入 |
C ={C1,...,Ck} | 聚类中心点集合 聚类集合 |
f() g() | 编码器 解码器 |
A. Generating an Embedding Space
A. 生成嵌入空间
Auto encoder is a type of deep neural networks (DNNs) that can learn low-dimensional representations of the input data in an unsupervised way. It consists of an encoder and a decoder. The encoder $f(\cdot)$ transforms the input data into a lower-dimensional space (the embedding space), and the decoder $g(\cdot)$ reconstructs the input data from the embedding space. Auto encoder is trained to minimize the reconstruction loss, such as the least-squares error:
自动编码器 (Auto encoder) 是一种深度神经网络 (DNNs),能够以无监督方式学习输入数据的低维表示。它由编码器和解码器组成。编码器 $f(\cdot)$ 将输入数据转换到低维空间 (嵌入空间),解码器 $g(\cdot)$ 则从嵌入空间重构输入数据。自动编码器通过最小化重构损失 (如最小二乘误差) 进行训练:
where $mathbf{x}{i}$ is the $i$ -th data point, $f(mathbf{x}{i})$ is the output of the encoder $f(cdot)$ , and $widehat{mathbf{x}}{i}$ is the reconstructed output of the decoder $g(\cdot)$ . The dimensi obn of the embedding space usually is set to a number that is much smaller than that of the original space. This not only mitigates the curse of dimensionality, but also helps to avoid the trivial solution of auto encoder, of which both $f(\cdot)$ and $g(\cdot)$ equal to the identity matrix.
其中 $mathbf{x}{i}$ 是第 $i$ 个数据点,$f(mathbf{x}{i})$ 是编码器 $f(cdot)$ 的输出,$widehat{mathbf{x}}{i}$ 是解码器 $g(cdot)$ 的重构输出。嵌入空间的维度通常设置为远小于原始空间的数值,这不仅缓解了维度灾难 (curse of dimensionality) ,还能避免自编码器 (auto encoder) 的平凡解,即 $f(cdot)$ 和 $g(cdot)$ 都等于单位矩阵的情况。
B. Detecting Clusters
B. 检测集群
In the above section, we train the auto encoder using the least-squares error loss to generate an embedding space $\mathbf{H}={\mathbf{\nabla}}f(\mathbf{X})$ , without considering the characteristics of the embedding space. This embedding space may not contain any cluster structures. DCN [43] combines the objective function of the auto encoder with that of K-means and optimizes them alternately. DCN wants to find a “K-means-friendly” subspace. However, the relative importance parameter between the two objective functions is hard to set. In addition, this paradigm is difficult to generate a “K-means-friendly” subspace because of the reconstruction loss of the auto encoder. In the optimization procedure, the reconstruction loss of the auto encoder should not be used anymore. The reason is that whatever we modify on the encoder, we can still train the decoder to make Equation (1) minimized.
在上述章节中,我们使用最小平方误差损失训练自动编码器以生成嵌入空间 $\mathbf{H}={\mathbf{\nabla}}f(\mathbf{X})$ ,但未考虑嵌入空间的特性。该嵌入空间可能不包含任何聚类结构。DCN [43] 将自动编码器的目标函数与K-means的目标函数相结合,并交替优化它们。DCN试图找到一个"K-means友好"的子空间。然而,两个目标函数之间的相对重要性参数难以设置。此外,由于自动编码器的重构损失,这种范式难以生成"K-means友好"的子空间。在优化过程中,不应再使用自动编码器的重构损失。原因在于无论我们对编码器进行何种修改,仍可以训练解码器使方程(1)最小化。
We use K-means [24] to find a partition ${{mathcal C}{i}}{i=1}^{k}$ of the data points in the embedding space $mathbf{H}$ . Its objective function is as follows:
我们使用K-means [24]算法在嵌入空间$mathbf{H}$中对数据点进行划分${{mathcal C}{i}}{i=1}^{k}$,其目标函数如下:
$$
\operatorname*{min}L_{2}=\sum_{i=1}^{k}\sum_{\mathbf{h}\in{\mathcal{C}}_{i}}\left|\mathbf{h}-{\boldsymbol{\mu}}_{i}\right|^{2}
$$
where $mathbf{h}$ is a data point in the embedding space, $k$ is the number of clusters, $mathcal{C}{i}$ denotes the set of data points assigned to the $i$ -th cluster, $begin{array}{r}{pmb{mu}{i}=frac{1}{|mathcal{C}{i}|} sum{mathbf{h} in mathcal{C}{i}} mathbf{h}} end{array}$ denotes the centroid of the $i$ -th cluster.
其中 $mathbf{h}$ 是嵌入空间中的一个数据点,$k$ 是聚类数量,$mathcal{C}{i}$ 表示分配给第 $i$ 个聚类的数据点集合,$begin{array}{r}{pmb{mu}{i}=frac{1}{|mathcal{C}{i}|}sum{mathbf{h}in mathcal{C}{i}}mathbf{h}}end{array}$ 表示第 $i$ 个聚类的质心
To reveal the cluster structures in the embedding space, we propose to transform the embedding space $\mathbf{H}$ to a new space by an ortho normal transformation matrix $\mathbf{V}$ . In the new space $\mathbf{Y}=\mathbf{VH}$ , Equation (2) becomes the following:
为揭示嵌入空间中的聚类结构,我们提出通过正交归一变换矩阵$\mathbf{V}$将嵌入空间$\mathbf{H}$转换至新空间$\mathbf{Y}=\mathbf{VH}$。在新空间中,公式(2)可转化为以下形式:
$$
{\begin{array}{r l}{\operatorname*{min}L_{3}=\displaystyle\sum_{i=1}^{k}\sum_{\mathbf{h}\in{\mathcal{C}}_{i}}|\mathbf{V}\mathbf{h}-\mathbf{V}\mu_{i}|^{2}}\ &{\qquad=\displaystyle\sum_{i=1}^{k}\sum_{\mathbf{h}\in{\mathcal{C}}_{i}}\left(\mathbf{V}\mathbf{h}-\mathbf{V}\mu_{i}\right)^{\mathsf{T}}\left(\mathbf{V}\mathbf{h}-\mathbf{V}\mu_{i}\right)}\ &{\qquad=\displaystyle\sum_{i=1}^{k}\sum_{\mathbf{h}\in{\mathcal{C}}_{i}}\left(\mathbf{h}-\mu_{i}\right)^{\mathsf{T}}\mathbf{V}^{\mathsf{T}}\mathbf{V}\left(\mathbf{h}-\mu_{i}\right)}\ &{\qquad=\displaystyle\sum_{i=1}^{k}\sum_{\mathbf{h}\in{\mathcal{C}}_{i}}\mathrm{Trace}\left(\left(\mathbf{h}-\mu_{i}\right)^{\mathsf{T}}\mathbf{V}^{\mathsf{T}}\mathbf{V}\left(\mathbf{h}-\mu_{i}\right)\right)}\end{array}}
$$
where we use the trace-trick in the last step, because a scalar can also be considered as a matrix of size $1\times1$ . Since $\mathbf{V}^{\top}\mathbf{V}^{\top}=\mathbf{I}$ , minimizing Equation (3) is equivalent to minimizing Equation (2).
其中我们在最后一步使用了迹技巧 (trace-trick),因为标量也可以被视为大小为 $1\times1$ 的矩阵。由于 $\mathbf{V}^{\top}\mathbf{V}^{\top}=\mathbf{I}$,最小化方程 (3) 等价于最小化方程 (2)。
The above equation can be further written as:
上述等式可进一步表示为:
where $begin{array}{r}{mathbf{S}{w}=sum{i=1}^{k}sum{mathbf{h}in mathcal{C}{i}}left(mathbf{h}-pmb{mu}{i}right)(mathbf{h}-pmb{mu}{i})^{top}}end{array}$ is the withinclass scatter matrix of K-means.
其中 $begin{array}{r}{mathbf{S}{w}=sum{i=1}^{k}sum{mathbf{h}in mathcal{C}{i}}left(mathbf{h}-pmb{mu}{i}right)(mathbf{h}-pmb{mu}{i})^{top}}end{array}$ 是K均值算法的类内散布矩阵。
Since $mathbf{V}$ is an ortho normal matrix, minimizing Equation (4) is a standard trace minimization problem. A version of the Rayleigh-Ritz theorem [25] indicates that the solution $mathbf{V}$ contains the ei gen vectors of $mathbf{S}{w}$ with ascending eigenvalues. The eigenvalues indicate the importance of the ei gen vectors’ contributions to the cluster structures in the transformed space $mathbf{Y}=mathbf{VH}$ . The smaller the eigenvalue, the more important its corresponding ei gen vector contributes to the cluster structures in the transformed space $mathbf{Y}$ . We should note that $mathbf{S}{w}$ is symmetric and therefore it is orthogonal ly diagonal iz able. So it is feasible to find the ortho normal matrix $mathbf{V}$ .
由于 $mathbf{V}$ 是一个正交归一化矩阵,最小化方程(4)是一个标准的迹最小化问题。Rayleigh-Ritz定理[25]的一个版本表明,解 $mathbf{V}$ 包含 $mathbf{S}{w}$ 的特征向量,其特征值按升序排列。这些特征值表示特征向量对变换空间 $mathbf{Y}=mathbf{VH}$ 中聚类结构贡献的重要性。特征值越小,其对应的特征向量对变换空间 $mathbf{Y}$ 中聚类结构的贡献就越重要。需要注意的是,$mathbf{S}{w}$ 是对称矩阵,因此可以进行正交对角化。因此,找到正交归一化矩阵 $mathbf{V}$ 是可行的。
C. Optimizing Representation
C. 优化表示
As discussed above, minimizing Equation (3) equals to minimizing Equation (2). We can first perform K-means in the embedding space $mathbf{H}$ to get $mathbf{S}{w}$ , and then ei gen decompose $mathbf{S}{w}$ to get $mathbf{V}$ . Finally, we transform the embedding space to a new space $mathbf{Y}$ that reveals the cluster-structure information. We also know the importance of each dimension of $mathbf{Y}$ in terms of the cluster-structure information, i.e., the last dimension has the least cluster-structure information.
如上所述,最小化方程(3)等价于最小化方程(2)。我们首先在嵌入空间$mathbf{H}$中执行K-means聚类得到$mathbf{S}{w}$,然后对$mathbf{S}{w}$进行特征分解获得$mathbf{V}$。最终将嵌入空间转换至能揭示聚类结构信息的新空间$mathbf{Y}$。同时可知$mathbf{Y}$各维度对聚类结构信息的重要性排序,即最后一维所含聚类结构信息最少。
We can rewrite Equation (3) as follows:
我们可以将方程 (3) 改写如下:
$$
\operatorname*{min}L_{3}=\sum_{i=1}^{k}\sum_{\mathbf{y}\in{\mathcal{C}}_{i}}\left|\mathbf{y}-\mathbf{m}_{i}\right|^{2}
$$
$$
\operatorname*{min}L_{3}=\sum_{i=1}^{k}\sum_{\mathbf{y}\in{\mathcal{C}}_{i}}\left|\mathbf{y}-\mathbf{m}_{i}\right|^{2}
$$
where $\mathbf{y}=\mathbf{V}\mathbf{h}$ and $\mathbf{m}_{i}=\mathbf{V}\pmb{\mu}_{i}$
其中 $\mathbf{y}=\mathbf{V}\mathbf{h}$ 且 $\mathbf{m}_{i}=\mathbf{V}\pmb{\mu}_{i}$
Now the question is how to optimize the representation to improve the cluster-structure information in $\mathbf{Y}$ . In this paper, we measure the cluster-structure information by entropy. The lower the entropy of data, the higher the cluster-structure information it contains. For example, Figure 2(a) shows 400 data points that follow a uniform distribution in the range [0, 10]. The entropy is − Pi4=001 4100 ln � 4100 = 5.992 bits. Compared with Figure 2(a), there are two Gaussian clusters in Figure 2(b), each of which has 200 data points. The two Gaussian clusters have the same variance of one. The two Gaussian clusters in Figure 2(c) have the same variance of 0.5, each of which also has 200 data points. Note that the entropy of a $d$ - dimensional Gaussian distribution is $\begin{array}{r l}{\small}&{{}\frac{d}{2}\ln\left(2\pi e\sigma_{1}^{2}\sigma_{2}^{2}\cdot\cdot\cdot\sigma_{n}^{2}\right)^{1/d}}\end{array}$ as shown in [29]. Thus, the entropy of data in Figure 2(b) is 1.419 bits, and that of data in Figure $2(\mathrm{c})$ is 0.033 bits. From Figure 2, we have two observations: 1) Compared with data that has no cluster-structure information, data that contains the cluster-structure information has lower entropy. 2) Reducing the variance of a Gaussian cluster will reduce its entropy. Also note that the clusters found by $\mathbf{K}$ -means follow isotropic Gaussian distributions. Based on these, we propose a new objective to optimize the representation to increase the clusterstructure information. Specifically, we would like to move the data points in clusters found by K-means closer to their centroids. This strategy is also in consistent with Equation (2), which is further minimized.
现在的问题是如何优化表示以增强 $\mathbf{Y}$ 中的聚类结构信息。本文采用熵作为聚类结构信息的度量指标,数据熵值越低则蕴含的聚类结构信息越强。例如图 2(a) 展示了400个在[0,10]范围内均匀分布的数据点,其熵值为− Pi4=001 4100 ln � 4100 = 5.992比特。与之相比,图2(b)中存在两个方差均为1的高斯簇(各含200个数据点),图2(c)则展示方差均为0.5的两个高斯簇(各含200个数据点)。根据文献[29],$d$维高斯分布的熵值为$\begin{array}{r l}{\small}&{{}\frac{d}{2}\ln\left(2\pi e\sigma_{1}^{2}\sigma_{2}^{2}\cdot\cdot\cdot\sigma_{n}^{2}\right)^{1/d}}\end{array}$。因此图2(b)数据熵为1.419比特,图$2(\mathrm{c})$数据熵为0.033比特。通过图2可得到两个观察结论:1) 相较于无聚类结构的数据,具有聚类结构的数据表现出更低的熵值;2) 减小高斯簇的方差会降低其熵值。值得注意的是,$\mathbf{K}$-均值算法发现的簇遵循各向同性高斯分布。基于此,我们提出通过优化表示来增强聚类结构信息的新目标:推动K-means发现的簇内数据点向其质心靠拢,该策略也与式(2)的进一步最小化目标相一致。
Instead of moving data points closer to their respective centroids in each dimension of $\mathbf{Y}$ , we propose a greedy method that only moves data points closer to their respective centroids in the last dimension of Y. In this way, the representation will be easily optimized to increase the cluster-structure information. We found in our experiments that this greedy method performs the best. The details of the greedy method is as follows: We first replicate y as $\mathbf{y}^{\prime}$ and then replace the last dimension of $\mathbf{y}^{\prime}$ with the last dimension of $\mathbf{m}_{i}$ . Finally, The objective function is defined as follows:
我们提出一种贪婪方法,不再将数据点向$\mathbf{Y}$各维度上的各自质心移动,而是仅使其在Y的最后一个维度上靠近所属质心。这种方法能更高效地优化表征以增强聚类结构信息。实验表明该贪婪方法性能最优,具体实现如下:先复制y得到$\mathbf{y}^{\prime}$,再将$\mathbf{y}^{\prime}$最后一维替换为$\mathbf{m}_{i}$的最后一维。最终目标函数定义为:
Fig. 2. Demonstration of the relationships between entropy and cluster-structure information. The lower the entropy of data, the higher the cluster-structure information it contains. Entropy decreases from (a) to (c), while cluster-structure information increases from (a) to (c).
图 2: 熵与聚类结构信息关系的示意图。数据熵值越低,其包含的聚类结构信息越高。从(a)到(c)熵值逐渐降低,而聚类结构信息则从(a)到(c)逐步增加。
$$
\operatorname*{min}{L_{4}}=\sum_{i=1}^{k}\sum_{\mathbf{y}\in\mathcal{C}_{i}}\left|\mathbf{y}-\mathbf{y}^{\prime}\right|^{2}
$$
$$
\operatorname*{min}{L_{4}}=\sum_{i=1}^{k}\sum_{\mathbf{y}\in\mathcal{C}_{i}}\left|\mathbf{y}-\mathbf{y}^{\prime}\right|^{2}
$$
We do not use a full-batch updating strategy that moves all the data points closer to their centroids in the last dimension. Instead, we use a mini-batch updating strategy that moves only some mini-batches of data points closer to their centroids. We find that the mini-batch updating strategy is superior to the full-batch updating strategy. After optimizing the representation, we have a new embedding space $\mathbf{H}$ . Then, we perform K-means again to find clusters and their respective centroids. We alternately repeat the second and third steps until some criterion is met, such as the predefined number of iterations or there are less than $0.1%$ of samples that change their clustering assignments between two consecutive iterations. The pseudo-code of DEKM is shown in Algorithm 1. Line 1 uses Equation (1) to train the auto encoder. Line 3 uses the encoder to generate an embedding space $\mathbf{H}={\mathbf{\nabla}}f(\mathbf{X})$ . Then in the embedding space, line 4 performs K-means to find clusters. Lines 5–6 compute the within-class scatter matrix $\mathbf{S}_{w}$ and ei