[论文翻译]深度嵌入式K均值聚类


原文地址:https://arxiv.org/pdf/2109.15149v1


Deep Embedded K-Means Clustering

Abstract—Recently, deep clustering methods have gained momentum because of the high representational power of deep neural networks (DNNs) such as auto encoder. The key idea is that representation learning and clustering can reinforce each other: Good representations lead to good clustering while good clustering provides good supervisory signals to representation learning. Critical questions include: 1) How to optimize representation learning and clustering? 2) Should the reconstruction loss of auto encoder be considered always? In this paper, we propose DEKM (for Deep Embedded K-Means) to answer these two questions. Since the embedding space generated by auto encoder may have no obvious cluster structures, we propose to further transform the embedding space to a new space that reveals the cluster-structure information. This is achieved by an ortho normal transformation matrix, which contains the ei gen vectors of the within-class scatter matrix of K-means. The eigenvalues indicate the importance of the ei gen vectors’ contributions to the clusterstructure information in the new space. Our goal is to increase the cluster-structure information. To this end, we discard the decoder and propose a greedy method to optimize the representation. Representation learning and clustering are alternately optimized by DEKM. Experimental results on the real-world datasets demonstrate that DEKM achieves state-of-the-art performance.

摘要—近年来,深度聚类方法因自编码器等深度神经网络(DNNs)的高表征能力而备受关注。其核心思想是表征学习与聚类能够相互促进:优质表征能产生良好聚类结果,而优质聚类又能为表征学习提供有效监督信号。关键问题在于:1)如何协同优化表征学习与聚类?2)自编码器的重构损失是否必须保留?本文提出的DEKM(深度嵌入K均值)旨在回答这两个问题。针对自编码器生成的嵌入空间可能缺乏明显聚类结构的问题,我们提出通过正交变换矩阵将嵌入空间转换至能显式呈现聚类结构信息的新空间,该矩阵由K均值类内散布矩阵的特征向量构成,特征值则反映各特征向量对新空间聚类结构信息的贡献度。为增强聚类结构信息,我们舍弃解码器并采用贪心算法优化表征。DEKM通过交替优化实现表征学习与聚类的协同提升。真实数据集上的实验表明,DEKM达到了当前最优性能。

I. INTRODUCTION

I. 引言

Clustering, an essential data exploratory analysis tool, has been widely studied. Numerous clustering methods have been proposed, e.g., K-means [24], Gaussian Mixture Models [3], and spectral clustering [31, 36, 40, 45]. These traditional clustering methods have achieved promising results. However, when the dimension of the data is very high, these traditional methods become inefficient and ineffective because of the notorious problem of the curse of dimensionality.

聚类作为一种重要的数据探索分析工具,已被广泛研究。目前已提出众多聚类方法,例如K-means [24]、高斯混合模型 [3] 和谱聚类 [31, 36, 40, 45]。这些传统聚类方法取得了显著成果。然而当数据维度极高时,由于众所周知的维度灾难问题,这些传统方法会变得低效且效果不佳。

To deal with this problem, people usually perform dimens ional it y reduction before clustering. The commonly used dimensionality reduction methods are Principle Component Analysis (PCA) [32], Canonical Correlation Analysis (CCA) [1], and Non negative Matrix Factorization (NMF) [21]. But all these methods are linear and shallow models, whose representational abilities are limited. They cannot capture the complex and non-linear relations hidden in the data. Benefiting from the high representational power of deep neural networks (DNNs), auto encoder has been widely used as a dimensionality reduction method for clustering in recent years. Auto encoder can learn meaningful representations of the input data through an unsupervised manner. It consists of an encoder and a decoder. The encoder transforms the input data into a lowerdimensional space (the embedding space), and the decoder is responsible for reconstructing the input data from that embedding space.

为了解决这一问题,人们通常在聚类前进行降维处理。常用的降维方法包括主成分分析 (PCA) [32]、典型相关分析 (CCA) [1] 和非负矩阵分解 (NMF) [21]。但这些方法都是线性的浅层模型,其表示能力有限,无法捕捉数据中隐藏的复杂非线性关系。得益于深度神经网络 (DNNs) 强大的表征能力,自编码器近年来被广泛用作聚类的降维方法。自编码器能以无监督方式学习输入数据的有意义表示,由编码器和解码器组成:编码器将输入数据转换到低维空间(嵌入空间),解码器负责从该嵌入空间重构输入数据。


Fig. 1. Visualization of the representation learning and clustering by DEKM on a subset of MNIST dataset. The color of data points marks the ground truth and the color of the background block marks the clustering results. Data points falling into the same block belong to the same cluster. Cluster centroids are marked by black crosses. The black line shows the direction of the last ei gen vector of the within-class scatter matrix of K-means. In this direction, the cluster-structure information is the lowest. We would like to optimize the representation to increase the cluster-structure information in this direction. We use a mini-batch updating strategy. Numbers in parentheses denote ACC and NMI, respectively.

图 1: DEKM在MNIST数据集子集上的表示学习与聚类可视化。数据点颜色表示真实标签,背景区块颜色表示聚类结果。落入同一区块的数据点属于同一聚类。聚类中心用黑色十字标记。黑线表示K-means类内散布矩阵最后一个特征向量的方向,该方向的聚类结构信息量最低。我们通过优化表示来提升该方向的聚类结构信息量。采用小批量更新策略,括号内数值分别表示准确率(ACC)和标准化互信息(NMI)。

A plethora of clustering methods [6, 8, 10, 11, 23, 35, 41, 43, 44] based on auto encoder have been proposed. One of the most critical problems with deep clustering is how to design a proper clustering loss function. Methods such as DEC [41] and IDEC [10] minimize the Kullback-Leibler (KL) divergence between the cluster distribution and an auxiliary target distribution. The basic idea is to refine the clusters by learning from their high confidence assignments. However, the auxiliary target distribution is hard to choose for better clustering results. Other methods such as DCN [43] and DKM [6] combine the objective of K-means with that of auto encoder and jointly optimize them. However, the discrimination of clusters in the embedding space is not directly related to the reconstruction loss of auoencoder. Thus, in this paper, after using an auto encoder to generate an embedding space, we discard the decoder and do not optimize the reconstruction loss anymore. Because no matter what we do to the embddding space, we can separately train the decoder to reconstruct the input data from their embeddings.

大量基于自动编码器 (auto encoder) 的聚类方法 [6, 8, 10, 11, 23, 35, 41, 43, 44] 被提出。深度聚类最关键的问题之一是如何设计合适的聚类损失函数。DEC [41] 和 IDEC [10] 等方法通过最小化聚类分布与辅助目标分布之间的 KL (Kullback-Leibler) 散度来实现聚类,其核心思想是通过高置信度分配结果来优化聚类。但辅助目标分布的选择往往难以获得更好的聚类效果。DCN [43] 和 DKM [6] 等其他方法将 K-means 目标与自动编码器目标结合并进行联合优化,但嵌入空间中的聚类区分度与自动编码器的重构损失并无直接关联。因此,本文在使用自动编码器生成嵌入空间后,舍弃解码器并不再优化重构损失。因为无论对嵌入空间进行何种操作,我们都可以单独训练解码器来根据嵌入重构输入数据。

Considering the simplicity of K-means, we extend it to a deep version, which is called DEKM that uses an auto encoder to generate an embedding space. Since this embedding space may not be disc rim i native for clustering, we propose to further transform this embedding space to a new space by an ortho normal transformation matrix, which consists of the ei gen vectors of the within-class scatter matrix of K-means. These ei gen vectors are sorted in ascending order with respect to their eigenvalues. In this new space, the cluster-structure information is revealed. And each eigenvalue indicates how much its corresponding ei gen vector contributes to the quantity of the cluster-structure information in the new space. The last ei gen vector contributes the least. To increase the clusterstructure information in the direction of the last eigenvector, we discard the decoder and propose a greedy method to optimize the representation. Optimizing representation is equivalent to minimizing the entropy. And this optimization is also in consistent with the loss of K-means. Inspired by the idea that “good representations are beneficial to clustering and clustering provides supervisory signals to representation learning” [44], we alternately optimize representation learning and clustering until some criterion is met.

考虑到K-means的简洁性,我们将其扩展为深度版本,称为DEKM (Deep Embedded K-means) ,该方法使用自动编码器生成嵌入空间。由于该嵌入空间可能对聚类缺乏判别性,我们提出通过正交变换矩阵进一步将其转换到新空间,该矩阵由K-means类内散布矩阵的特征向量构成。这些特征向量按其对应特征值升序排列。在新空间中,聚类结构信息得以显现,每个特征值表征其对应特征向量对新空间聚类结构信息量的贡献程度,最后一个特征向量贡献最小。为增强末位特征向量方向的聚类结构信息,我们舍弃解码器并提出贪心算法优化表征。表征优化等价于熵最小化,该优化过程也与K-means的损失函数一致。受"优质表征促进聚类,而聚类为表征学习提供监督信号"[44]的启发,我们交替优化表征学习与聚类直至满足终止条件。

Figure 1 shows these two alternating processes on a subset of MNIST dataset, containing randomly selected four clusters. We set the units of the embedding layer to two for better inspection. Figure 1(a)–(d) show the data representations and clustering results in the embedding space during these two alternating processes. Figure 1(a) shows the initial embedding space generated by an auto encoder. The direction where the black line lies denotes the second ei gen vector of the scatter matrix of K-means. The cluster-structure information in this direction is the lowest. To increase the cluster-structure information in this direction, we move data points closer to their centroids. Note that in this process, we use a mini-batch of data to optimize the representation, which makes only some data points closer to their centroids. Compared with Figure 1(a), we can see from Figure 1(b) that the data points in the red cluster do not move closer to their centroids, while the data points in the other three clusters do. We find in the experiments that this mini-batch updating strategy outperforms the full-batch updating strategy, which moves all the data points closer to their centroids.

图1展示了在MNIST数据集子集上这两个交替过程,该子集包含随机选择的四个聚类。我们将嵌入层的单元数设为2以便更好地观察。图1(a)-(d)展示了这两个交替过程中嵌入空间的数据表示和聚类结果。图1(a)显示由自编码器生成的初始嵌入空间。黑线所在方向表示K-means散点矩阵的第二特征向量,该方向的聚类结构信息最弱。为增强该方向的聚类结构信息,我们将数据点向其质心移动。注意在此过程中,我们使用小批量数据优化表示,这使得只有部分数据点更接近其质心。与图1(a)相比,从图1(b)可见红色聚类中的数据点未向质心移动,而其他三个聚类的数据点则发生了移动。实验发现这种小批量更新策略优于全批量更新策略(后者会使所有数据点都向质心移动)。

Our contributions can be summarized as follows:

我们的贡献可概括如下:

• We propose to transform the embedding space generated by an auto encoder to a new space using an ortho normal transformation matrix, which contains the ei gen vectors of the within-class scatter matrix of K-means. The eigenvalues indicate the importance of their corresponding eigenvectors’ contributions to the cluster-structure information in the new space.

• 我们提出使用一个正交归一化变换矩阵将自编码器生成的嵌入空间转换到新空间,该矩阵包含K均值类内散布矩阵的特征向量。特征值表明对应特征向量对新空间中聚类结构信息贡献的重要性。

II. RELATED WORK

II. 相关工作

A. K-means and Its Variants

A. K-means及其变体

K-means is one of the most fundamental clustering methods. It is usually used as one of the building blocks of many advanced clustering methods such as spectral clustering [31, 36, 40, 45]. K-means has inspired many extentions. For example, the basic idea of [14] is to replace the mean with the median. K-means $^{++}$ [2] improves the selection of the initial centroids, which are based on their proportional distance to the previous selected centroids. SubKmeans [26] assumes that the input space can be split into two independent subspaces, i.e., the clustering subspace and the noise subspace. The former subspace contains only the cluster-structure information and the latter subspace contains only the noise information. SubKmeans performs clustering in the clustering subspace. Nr-Kmeans [27, 28] finds non-redundant K-means clustering s in multiple mutually orthogonal subspaces via an orthogonal transformation matrix. Fuzzy-c-means [5] assigns each data point proportionally to multiple clusters. It relaxes the hard cluster assignments of K-means to soft cluster assignments. Mini-batch K-means [34] extends K-means to the scenario of user-facing web applications. Mini-batch K-means can be used in the deep learning frameworks because it supports the online Stochastic Gradient Descent (SGD).

K-means是最基础的聚类方法之一,通常作为谱聚类[31, 36, 40, 45]等高级聚类方法的构建模块。该方法衍生出诸多变体:例如[14]的核心思想是用中位数替代均值;K-means$^{++}$[2]通过根据与已选中心点距离的比例来优化初始质心选择;SubKmeans[26]假设输入空间可分解为聚类子空间和噪声子空间两个独立子空间,仅在前者执行聚类操作;Nr-Kmeans[27, 28]通过正交变换矩阵在多个互斥子空间中发现非冗余K-means聚类;Fuzzy-c-means[5]采用软分配机制使数据点按比例归属多个簇;Mini-batch K-means[34]则面向网页应用场景,因其支持在线随机梯度下降(SGD)而适用于深度学习框架。

For the real-world datasets, the number of clusters is unknown. To solve this problem, researchers propose to automatically find the proper number of clusters. X-means [33] uses the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) as measures to evaluate the clustering results under different cluster number $k$ . G-means [12] assumes that each cluster follows a Gaussian distribution. It runs K-means with increasing $k$ hierarchically until the statistical test accepts that the clusters follow Gaussian distributions. PG-means [7] first constructs an one-dimensional projection of the dataset and the learned model. Then the model fitness is evaluated in the projected space. It is able to discover a proper number of Gaussian clusters. Dip-means [15] assumes that each cluster follows a unimodal distribution. It first computes the pairwise distance between a data point and the other data points. Then, it applies a univariate statistical hypothesis test [13] (called Hartigans’ dip-test) for uni modality on the distributions of distances to find unimodal clusters and the proper cluster number.

对于真实世界的数据集,聚类数量通常是未知的。为解决这一问题,研究者提出了自动确定最佳聚类数的方法。X-means [33] 采用贝叶斯信息准则 (BIC) 或赤池信息准则 (AIC) 作为评估指标,在不同聚类数 $k$ 下衡量聚类结果。G-means [12] 假设每个簇服从高斯分布,通过层次化增加 $k$ 值运行K-means,直至统计检验接受簇符合高斯分布。PG-means [7] 首先生成数据集与学习模型的一维投影,在投影空间评估模型拟合度,从而发现合适的高斯簇数量。Dip-means [15] 假设每个簇服从单峰分布,先计算数据点与其他点的成对距离,再通过单变量统计假设检验 [13] (Hartigans' dip-test) 分析距离分布的单峰性,以此识别单峰簇并确定最佳聚类数。

B. Deep Clustering

B. 深度聚类

Since shallow clustering models suffer from non-linearity of real-world data, they cannot perform well. Deep clustering models use deep neural networks (DNNs) with stronger nonlinear representational capabilities to extract features, which achieve better clustering performances. Earlier deep clustering methods [4, 37] perform representation learning and clustering sequentially. Recent studies [6, 8, 10, 11, 23, 35, 41, 43, 44] have shown that jointly performing representation learning and clustering yields better performance.

由于浅层聚类模型受限于现实数据的非线性特性,其表现往往欠佳。深度聚类模型通过采用具有更强非线性表征能力的深度神经网络(DNNs)来提取特征,从而获得更优的聚类效果。早期深度聚类方法[4,37]采用表征学习与聚类分步执行的策略。最新研究[6,8,10,11,23,35,41,43,44]表明,联合进行表征学习与聚类能取得更好的性能表现。

JULE [44] proposes a recurrent framework for jointly unsupervised learning of deep representations and clustering. During the optimization procedure, clustering is conducted in the forward pass and representation learning is performed in the backward pass. DCC [35] jointly performs non-linear dimensionality reduction and clustering. The clustering process includes the optimization of the auto encoder. Since the objective function is continuous without the discrete cluster assignment, it can be solved by standard gradient-based methods. DEC [41] pre-trains an auto encoder with reconstruction loss and performs clustering to obtain the soft clustering assignment of each data point. Then, an auxiliary target distribution is derived from the current soft cluster assignments. Finally, it iterative ly refines clustering by minimizing the KullbackLeibler (KL) divergence between the soft assignments and the auxiliary target distribution. DCN [43] combines the objective of K-means with that of auto encoder to find a “K-meansfriendly” space. The cluster assignments of DCN are not soft (probability) like that in DEC, but strict (discrete), which limit the direct use of gradient-based SGD solvers. DCN refines clustering by alternately optimizing the objectives of auto encoder and K-means. DEPICT [8] consists of two parts, i.e., a convolutional auto encoder for learning the embedding space and a multi no mi al logistic regression layer functioning as a disc rim i native clustering model. For jointly learning the embedding space and clustering, DEPICT employs an alternating approach to optimize a unified objective function. IDEC [10] combines an under-complete auto encoder with DEC. The under-complete auto encoder not only learns the embedding space, but also preserves the local structure of data. Like IDEC, DCEC [11] combines a convolutional auto encoder with DEC. DKM [6] proposes a new approach for jointly clustering with K-means and learning representations. The Kmeans objective is considered as a limit of a differentiable function, so that the representation learning and clustering can be optimized by the simple stochastic gradient descent. REDKC (for Robust Embedded Deep K-means Clustering) [46] uses the $\delta$ -norm metric to constrain the feature mapping of the auto encoder so that data embeddings are more conductive to the robust K-means clustering.

JULE [44] 提出了一种循环框架,用于联合无监督学习深度表示和聚类。在优化过程中,前向传播执行聚类,反向传播执行表示学习。DCC [35] 联合执行非线性降维和聚类,其聚类过程包含对自编码器的优化。由于目标函数连续且不含离散聚类分配,可通过标准梯度方法求解。DEC [41] 先用重构损失预训练自编码器,再通过聚类获得各数据点的软聚类分配,随后从当前软分配推导辅助目标分布,最终通过最小化软分配与辅助目标分布间的KL散度迭代优化聚类。DCN [43] 将K均值目标与自编码器目标结合以寻找"K均值友好"空间,其聚类分配不像DEC采用概率形式而是严格离散值,这限制了基于梯度的SGD求解器的直接使用。DCN通过交替优化自编码器和K均值目标来改进聚类。DEPICT [8] 包含卷积自编码器(用于学习嵌入空间)和多分类逻辑回归层(作为判别式聚类模型),采用交替优化策略联合学习嵌入空间与聚类。IDEC [10] 将欠完备自编码器与DEC结合,该自编码器不仅能学习嵌入空间,还能保留数据局部结构。DCEC [11] 类似地结合了卷积自编码器与DEC。DKM [6] 提出联合K均值聚类与表示学习的新方法,将K均值目标视为可微函数的极限,从而可通过简单随机梯度下降优化表示学习和聚类。REDKC(鲁棒嵌入式深度K均值聚类)[46] 使用$\delta$-范数度量约束自编码器的特征映射,使数据嵌入更有利于鲁棒K均值聚类。

Our proposed DEKM is also a method that jointly performs representation learning and clustering. Like DCEC, DEKM first uses an auto encoder to find the embedding space. Then, it discards the decoder and optimizes the representation for better clustering. The representation optimization of DEKM is different from that of DCEC, which does not optimize the Kullback-Leibler (KL) divergence between the cluster distribution and an auxiliary target distribution. Instead, DEKM optimizes the representation by reducing its entropy.

我们提出的DEKM也是一种联合执行表示学习和聚类的方法。与DCEC类似,DEKM首先使用自动编码器寻找嵌入空间,然后丢弃解码器并优化表示以获得更好的聚类效果。DEKM的表示优化方式与DCEC不同:DCEC通过优化聚类分布与辅助目标分布之间的KL散度 (Kullback-Leibler divergence) 来实现优化,而DEKM则是通过降低表示熵来优化表征。

III. DEEP EMBEDDED K-MEANS

III. 深度嵌入式K均值

We assume that the cluster structures exist in a lowerdimensional subspace. Instead of clustering directly in the original space, DEKM uses auto encoder to transform the original space into an embedding space to reduce the dimens ional it y before clustering. DEKM alternately optimizes representation learning and clustering. DEKM has three steps: (1) generating an embedding space by an auto encoder, (2) detecting clusters in the embedding space by K-means, and (3) optimizing the representation to increase the cluster-structure information. The last two steps are alternately optimized to generate better embedding space and clustering results. Table I shows the used symbols and their corresponding interpretations in this paper.

我们假设聚类结构存在于一个低维子空间中。DEKM并非直接在原始空间进行聚类,而是通过自编码器将原始空间转换为嵌入空间以降低维度后再进行聚类。DEKM交替优化表示学习和聚类过程,包含三个步骤:(1) 通过自编码器生成嵌入空间,(2) 在嵌入空间中使用K-means检测聚类,(3) 优化表示以增强聚类结构信息。后两个步骤交替优化以生成更好的嵌入空间和聚类结果。表1列出了本文使用的符号及其对应解释。

TABLE I SYMBOLS AND INTERPRETATIONS

表 I 符号与释义

符号 释义
dE N eEN nEN kEN 输入空间维度 嵌入空间维度 数据点数量 聚类数量
x E Rdx1 μERex1 h E Rex1 数据点 聚类中心点 数据点嵌入
X=[x1;...;Xn] IE Rexe VE Rexe H=[h1;...;hn] 数据矩阵 单位矩阵 正交变换矩阵 数据嵌入
C ={C1,...,Ck} 聚类中心点集合 聚类集合
f() g() 编码器 解码器

A. Generating an Embedding Space

A. 生成嵌入空间

Auto encoder is a type of deep neural networks (DNNs) that can learn low-dimensional representations of the input data in an unsupervised way. It consists of an encoder and a decoder. The encoder $f(\cdot)$ transforms the input data into a lower-dimensional space (the embedding space), and the decoder $g(\cdot)$ reconstructs the input data from the embedding space. Auto encoder is trained to minimize the reconstruction loss, such as the least-squares error:

自动编码器 (Auto encoder) 是一种深度神经网络 (DNNs),能够以无监督方式学习输入数据的低维表示。它由编码器和解码器组成。编码器 $f(\cdot)$ 将输入数据转换到低维空间 (嵌入空间),解码器 $g(\cdot)$ 则从嵌入空间重构输入数据。自动编码器通过最小化重构损失 (如最小二乘误差) 进行训练:

图片.png

where $mathbf{x}{i}$ is the $i$ -th data point, $f(mathbf{x}{i})$ is the output of the encoder $f(cdot)$ , and $widehat{mathbf{x}}{i}$ is the reconstructed output of the decoder $g(\cdot)$ . The dimensi obn of the embedding space usually is set to a number that is much smaller than that of the original space. This not only mitigates the curse of dimensionality, but also helps to avoid the trivial solution of auto encoder, of which both $f(\cdot)$ and $g(\cdot)$ equal to the identity matrix.

其中 $mathbf{x}{i}$ 是第 $i$ 个数据点,$f(mathbf{x}{i})$ 是编码器 $f(cdot)$ 的输出,$widehat{mathbf{x}}{i}$ 是解码器 $g(cdot)$ 的重构输出。嵌入空间的维度通常设置为远小于原始空间的数值,这不仅缓解了维度灾难 (curse of dimensionality) ,还能避免自编码器 (auto encoder) 的平凡解,即 $f(cdot)$ 和 $g(cdot)$ 都等于单位矩阵的情况。

B. Detecting Clusters

B. 检测集群

In the above section, we train the auto encoder using the least-squares error loss to generate an embedding space $\mathbf{H}={\mathbf{\nabla}}f(\mathbf{X})$ , without considering the characteristics of the embedding space. This embedding space may not contain any cluster structures. DCN [43] combines the objective function of the auto encoder with that of K-means and optimizes them alternately. DCN wants to find a “K-means-friendly” subspace. However, the relative importance parameter between the two objective functions is hard to set. In addition, this paradigm is difficult to generate a “K-means-friendly” subspace because of the reconstruction loss of the auto encoder. In the optimization procedure, the reconstruction loss of the auto encoder should not be used anymore. The reason is that whatever we modify on the encoder, we can still train the decoder to make Equation (1) minimized.

在上述章节中,我们使用最小平方误差损失训练自动编码器以生成嵌入空间 $\mathbf{H}={\mathbf{\nabla}}f(\mathbf{X})$ ,但未考虑嵌入空间的特性。该嵌入空间可能不包含任何聚类结构。DCN [43] 将自动编码器的目标函数与K-means的目标函数相结合,并交替优化它们。DCN试图找到一个"K-means友好"的子空间。然而,两个目标函数之间的相对重要性参数难以设置。此外,由于自动编码器的重构损失,这种范式难以生成"K-means友好"的子空间。在优化过程中,不应再使用自动编码器的重构损失。原因在于无论我们对编码器进行何种修改,仍可以训练解码器使方程(1)最小化。

We use K-means [24] to find a partition ${{mathcal C}{i}}{i=1}^{k}$ of the data points in the embedding space $mathbf{H}$ . Its objective function is as follows:

我们使用K-means [24]算法在嵌入空间$mathbf{H}$中对数据点进行划分${{mathcal C}{i}}{i=1}^{k}$,其目标函数如下:

$$
\operatorname*{min}L_{2}=\sum_{i=1}^{k}\sum_{\mathbf{h}\in{\mathcal{C}}_{i}}\left|\mathbf{h}-{\boldsymbol{\mu}}_{i}\right|^{2}
$$

where $mathbf{h}$ is a data point in the embedding space, $k$ is the number of clusters, $mathcal{C}{i}$ denotes the set of data points assigned to the $i$ -th cluster, $begin{array}{r}{pmb{mu}{i}=frac{1}{|mathcal{C}{i}|} sum{mathbf{h} in mathcal{C}{i}} mathbf{h}} end{array}$ denotes the centroid of the $i$ -th cluster.

其中 $mathbf{h}$ 是嵌入空间中的一个数据点,$k$ 是聚类数量,$mathcal{C}{i}$ 表示分配给第 $i$ 个聚类的数据点集合,$begin{array}{r}{pmb{mu}{i}=frac{1}{|mathcal{C}{i}|}sum{mathbf{h}in mathcal{C}{i}}mathbf{h}}end{array}$ 表示第 $i$ 个聚类的质心

To reveal the cluster structures in the embedding space, we propose to transform the embedding space $\mathbf{H}$ to a new space by an ortho normal transformation matrix $\mathbf{V}$ . In the new space $\mathbf{Y}=\mathbf{VH}$ , Equation (2) becomes the following:

为揭示嵌入空间中的聚类结构,我们提出通过正交归一变换矩阵$\mathbf{V}$将嵌入空间$\mathbf{H}$转换至新空间$\mathbf{Y}=\mathbf{VH}$。在新空间中,公式(2)可转化为以下形式:

$$
{\begin{array}{r l}{\operatorname*{min}L_{3}=\displaystyle\sum_{i=1}^{k}\sum_{\mathbf{h}\in{\mathcal{C}}_{i}}|\mathbf{V}\mathbf{h}-\mathbf{V}\mu_{i}|^{2}}\ &{\qquad=\displaystyle\sum_{i=1}^{k}\sum_{\mathbf{h}\in{\mathcal{C}}_{i}}\left(\mathbf{V}\mathbf{h}-\mathbf{V}\mu_{i}\right)^{\mathsf{T}}\left(\mathbf{V}\mathbf{h}-\mathbf{V}\mu_{i}\right)}\ &{\qquad=\displaystyle\sum_{i=1}^{k}\sum_{\mathbf{h}\in{\mathcal{C}}_{i}}\left(\mathbf{h}-\mu_{i}\right)^{\mathsf{T}}\mathbf{V}^{\mathsf{T}}\mathbf{V}\left(\mathbf{h}-\mu_{i}\right)}\ &{\qquad=\displaystyle\sum_{i=1}^{k}\sum_{\mathbf{h}\in{\mathcal{C}}_{i}}\mathrm{Trace}\left(\left(\mathbf{h}-\mu_{i}\right)^{\mathsf{T}}\mathbf{V}^{\mathsf{T}}\mathbf{V}\left(\mathbf{h}-\mu_{i}\right)\right)}\end{array}}
$$

where we use the trace-trick in the last step, because a scalar can also be considered as a matrix of size $1\times1$ . Since $\mathbf{V}^{\top}\mathbf{V}^{\top}=\mathbf{I}$ , minimizing Equation (3) is equivalent to minimizing Equation (2).

其中我们在最后一步使用了迹技巧 (trace-trick),因为标量也可以被视为大小为 $1\times1$ 的矩阵。由于 $\mathbf{V}^{\top}\mathbf{V}^{\top}=\mathbf{I}$,最小化方程 (3) 等价于最小化方程 (2)。

The above equation can be further written as:

上述等式可进一步表示为:
图片.png

where $begin{array}{r}{mathbf{S}{w}=sum{i=1}^{k}sum{mathbf{h}in mathcal{C}{i}}left(mathbf{h}-pmb{mu}{i}right)(mathbf{h}-pmb{mu}{i})^{top}}end{array}$ is the withinclass scatter matrix of K-means.

其中 $begin{array}{r}{mathbf{S}{w}=sum{i=1}^{k}sum{mathbf{h}in mathcal{C}{i}}left(mathbf{h}-pmb{mu}{i}right)(mathbf{h}-pmb{mu}{i})^{top}}end{array}$ 是K均值算法的类内散布矩阵。

Since $mathbf{V}$ is an ortho normal matrix, minimizing Equation (4) is a standard trace minimization problem. A version of the Rayleigh-Ritz theorem [25] indicates that the solution $mathbf{V}$ contains the ei gen vectors of $mathbf{S}{w}$ with ascending eigenvalues. The eigenvalues indicate the importance of the ei gen vectors’ contributions to the cluster structures in the transformed space $mathbf{Y}=mathbf{VH}$ . The smaller the eigenvalue, the more important its corresponding ei gen vector contributes to the cluster structures in the transformed space $mathbf{Y}$ . We should note that $mathbf{S}{w}$ is symmetric and therefore it is orthogonal ly diagonal iz able. So it is feasible to find the ortho normal matrix $mathbf{V}$ .

由于 $mathbf{V}$ 是一个正交归一化矩阵,最小化方程(4)是一个标准的迹最小化问题。Rayleigh-Ritz定理[25]的一个版本表明,解 $mathbf{V}$ 包含 $mathbf{S}{w}$ 的特征向量,其特征值按升序排列。这些特征值表示特征向量对变换空间 $mathbf{Y}=mathbf{VH}$ 中聚类结构贡献的重要性。特征值越小,其对应的特征向量对变换空间 $mathbf{Y}$ 中聚类结构的贡献就越重要。需要注意的是,$mathbf{S}{w}$ 是对称矩阵,因此可以进行正交对角化。因此,找到正交归一化矩阵 $mathbf{V}$ 是可行的。

C. Optimizing Representation

C. 优化表示

As discussed above, minimizing Equation (3) equals to minimizing Equation (2). We can first perform K-means in the embedding space $mathbf{H}$ to get $mathbf{S}{w}$ , and then ei gen decompose $mathbf{S}{w}$ to get $mathbf{V}$ . Finally, we transform the embedding space to a new space $mathbf{Y}$ that reveals the cluster-structure information. We also know the importance of each dimension of $mathbf{Y}$ in terms of the cluster-structure information, i.e., the last dimension has the least cluster-structure information.

如上所述,最小化方程(3)等价于最小化方程(2)。我们首先在嵌入空间$mathbf{H}$中执行K-means聚类得到$mathbf{S}{w}$,然后对$mathbf{S}{w}$进行特征分解获得$mathbf{V}$。最终将嵌入空间转换至能揭示聚类结构信息的新空间$mathbf{Y}$。同时可知$mathbf{Y}$各维度对聚类结构信息的重要性排序,即最后一维所含聚类结构信息最少。

We can rewrite Equation (3) as follows:

我们可以将方程 (3) 改写如下:

$$
\operatorname*{min}L_{3}=\sum_{i=1}^{k}\sum_{\mathbf{y}\in{\mathcal{C}}_{i}}\left|\mathbf{y}-\mathbf{m}_{i}\right|^{2}
$$

$$
\operatorname*{min}L_{3}=\sum_{i=1}^{k}\sum_{\mathbf{y}\in{\mathcal{C}}_{i}}\left|\mathbf{y}-\mathbf{m}_{i}\right|^{2}
$$

where $\mathbf{y}=\mathbf{V}\mathbf{h}$ and $\mathbf{m}_{i}=\mathbf{V}\pmb{\mu}_{i}$

其中 $\mathbf{y}=\mathbf{V}\mathbf{h}$ 且 $\mathbf{m}_{i}=\mathbf{V}\pmb{\mu}_{i}$

Now the question is how to optimize the representation to improve the cluster-structure information in $\mathbf{Y}$ . In this paper, we measure the cluster-structure information by entropy. The lower the entropy of data, the higher the cluster-structure information it contains. For example, Figure 2(a) shows 400 data points that follow a uniform distribution in the range [0, 10]. The entropy is − Pi4=001 4100 ln � 4100  = 5.992 bits. Compared with Figure 2(a), there are two Gaussian clusters in Figure 2(b), each of which has 200 data points. The two Gaussian clusters have the same variance of one. The two Gaussian clusters in Figure 2(c) have the same variance of 0.5, each of which also has 200 data points. Note that the entropy of a $d$ - dimensional Gaussian distribution is $\begin{array}{r l}{\small}&{{}\frac{d}{2}\ln\left(2\pi e\sigma_{1}^{2}\sigma_{2}^{2}\cdot\cdot\cdot\sigma_{n}^{2}\right)^{1/d}}\end{array}$ as shown in [29]. Thus, the entropy of data in Figure 2(b) is 1.419 bits, and that of data in Figure $2(\mathrm{c})$ is 0.033 bits. From Figure 2, we have two observations: 1) Compared with data that has no cluster-structure information, data that contains the cluster-structure information has lower entropy. 2) Reducing the variance of a Gaussian cluster will reduce its entropy. Also note that the clusters found by $\mathbf{K}$ -means follow isotropic Gaussian distributions. Based on these, we propose a new objective to optimize the representation to increase the clusterstructure information. Specifically, we would like to move the data points in clusters found by K-means closer to their centroids. This strategy is also in consistent with Equation (2), which is further minimized.

现在的问题是如何优化表示以增强 $\mathbf{Y}$ 中的聚类结构信息。本文采用熵作为聚类结构信息的度量指标,数据熵值越低则蕴含的聚类结构信息越强。例如图 2(a) 展示了400个在[0,10]范围内均匀分布的数据点,其熵值为− Pi4=001 4100 ln � 4100  = 5.992比特。与之相比,图2(b)中存在两个方差均为1的高斯簇(各含200个数据点),图2(c)则展示方差均为0.5的两个高斯簇(各含200个数据点)。根据文献[29],$d$维高斯分布的熵值为$\begin{array}{r l}{\small}&{{}\frac{d}{2}\ln\left(2\pi e\sigma_{1}^{2}\sigma_{2}^{2}\cdot\cdot\cdot\sigma_{n}^{2}\right)^{1/d}}\end{array}$。因此图2(b)数据熵为1.419比特,图$2(\mathrm{c})$数据熵为0.033比特。通过图2可得到两个观察结论:1) 相较于无聚类结构的数据,具有聚类结构的数据表现出更低的熵值;2) 减小高斯簇的方差会降低其熵值。值得注意的是,$\mathbf{K}$-均值算法发现的簇遵循各向同性高斯分布。基于此,我们提出通过优化表示来增强聚类结构信息的新目标:推动K-means发现的簇内数据点向其质心靠拢,该策略也与式(2)的进一步最小化目标相一致。

Instead of moving data points closer to their respective centroids in each dimension of $\mathbf{Y}$ , we propose a greedy method that only moves data points closer to their respective centroids in the last dimension of Y. In this way, the representation will be easily optimized to increase the cluster-structure information. We found in our experiments that this greedy method performs the best. The details of the greedy method is as follows: We first replicate y as $\mathbf{y}^{\prime}$ and then replace the last dimension of $\mathbf{y}^{\prime}$ with the last dimension of $\mathbf{m}_{i}$ . Finally, The objective function is defined as follows:

我们提出一种贪婪方法,不再将数据点向$\mathbf{Y}$各维度上的各自质心移动,而是仅使其在Y的最后一个维度上靠近所属质心。这种方法能更高效地优化表征以增强聚类结构信息。实验表明该贪婪方法性能最优,具体实现如下:先复制y得到$\mathbf{y}^{\prime}$,再将$\mathbf{y}^{\prime}$最后一维替换为$\mathbf{m}_{i}$的最后一维。最终目标函数定义为:


Fig. 2. Demonstration of the relationships between entropy and cluster-structure information. The lower the entropy of data, the higher the cluster-structure information it contains. Entropy decreases from (a) to (c), while cluster-structure information increases from (a) to (c).

图 2: 熵与聚类结构信息关系的示意图。数据熵值越低,其包含的聚类结构信息越高。从(a)到(c)熵值逐渐降低,而聚类结构信息则从(a)到(c)逐步增加。

$$
\operatorname*{min}{L_{4}}=\sum_{i=1}^{k}\sum_{\mathbf{y}\in\mathcal{C}_{i}}\left|\mathbf{y}-\mathbf{y}^{\prime}\right|^{2}
$$

$$
\operatorname*{min}{L_{4}}=\sum_{i=1}^{k}\sum_{\mathbf{y}\in\mathcal{C}_{i}}\left|\mathbf{y}-\mathbf{y}^{\prime}\right|^{2}
$$

We do not use a full-batch updating strategy that moves all the data points closer to their centroids in the last dimension. Instead, we use a mini-batch updating strategy that moves only some mini-batches of data points closer to their centroids. We find that the mini-batch updating strategy is superior to the full-batch updating strategy. After optimizing the representation, we have a new embedding space $\mathbf{H}$ . Then, we perform K-means again to find clusters and their respective centroids. We alternately repeat the second and third steps until some criterion is met, such as the predefined number of iterations or there are less than $0.1%$ of samples that change their clustering assignments between two consecutive iterations. The pseudo-code of DEKM is shown in Algorithm 1. Line 1 uses Equation (1) to train the auto encoder. Line 3 uses the encoder to generate an embedding space $\mathbf{H}={\mathbf{\nabla}}f(\mathbf{X})$ . Then in the embedding space, line 4 performs K-means to find clusters. Lines 5–6 compute the within-class scatter matrix $\mathbf{S}_{w}$ and ei gen decompose it to get the ortho normal transformation matrix V. Line 7 uses Equation (6) to optimize the representation. We repeat the process for a number Iter of iterations and return the final cluster set $\mathcal{C}$ .

我们并未采用全批次更新策略(即在最后一维将所有数据点向其聚类中心移动),而是采用小批次更新策略(仅移动部分小批次数据点)。实验表明小批次策略优于全批次策略。优化表征后,我们获得新的嵌入空间 $\mathbf{H}$,随后再次执行K-means算法寻找聚类及其中心点。第二、三步交替迭代直至满足终止条件(如达到预设迭代次数,或连续两次迭代间聚类分配变化的样本数少于 $0.1%$)。DEKM算法的伪代码如算法1所示:第1行使用式(1)训练自编码器;第3行通过编码器生成嵌入空间 $\mathbf{H}={\mathbf{\nabla}}f(\mathbf{X})$;第4行在嵌入空间执行K-means聚类;5-6行计算类内散布矩阵 $\mathbf{S}_{w}$ 并通过特征分解获得正交变换矩阵V;第7行利用式(6)优化表征。该过程重复Iter次后返回最终聚类集合 $\mathcal{C}$。

IV. EXPERIMENTAL EVALUATION

IV. 实验评估

A. Datasets

A. 数据集

To evaluate the performance and generality of DEKM, we conduct experiments on benchmark datasets and compare with state-of-the-art methods. In order to show that DEKM

为了评估 DEKM 的性能和泛化能力,我们在基准数据集上进行了实验,并与最先进方法进行了对比。为了证明 DEKM

Algorithm 1: DEKM

算法 1: DEKM

Input: Data matrix $mathbf{X}$ , the number of clusters $k$ , and the maximum number of iterations Iter Output: Cluster set $mathcal{C}={mathcal{C}{1},ldots,mathcal{C}{k}}$ 1 Using Equation (1) to train the auto encoder; 2 for $igets1$ to Iter do 3 Generate an embedding space by the encoder $mathbf{H}=f(mathbf{X})$ ; 4 $(boldsymbol{mathscr{C}},boldsymbol{mathscr{U}})gets mathbf{Kmeans}left(mathbf{H},kright);$ 5 Compute the within-class scatter matrix $begin{array}{r l}{quad{bf S}{w}getssum{i=1}^{k}sum{{bf h}in{mathcal C}{i}}left({bf h}-pmb{mu}{i}right)({bf h}-pmb{mu}{i})^{top};} {quad{bf V}gets mathrm{bfEig}left({bf S}{w}right);}end{array}$ ; 6 ins the ei gen vectors sorted in ascending order w.r.t. their eigenvalues $star$ Using Equation (6) to optimize the representation; 8 end 9 return

输入:数据矩阵 $mathbf{X}$,聚类数量 $k$,最大迭代次数 Iter
输出:聚类集合 $mathcal{C}={mathcal{C}{1},ldots,mathcal{C}{k}}$
1 使用方程(1)训练自编码器;
2 for $igets1$ to Iter do
3 通过编码器生成嵌入空间 $mathbf{H}=f(\mathbf{X})$;
4 $(boldsymbol{mathscr{C}},boldsymbol{mathscr{U}})gets mathbf{Kmeans}left(mathbf{H},k right);$
5 计算类内散布矩阵 $begin{array}{r l}{quad{bf S}{w}gets um{i=1}^{k}sum{{bf h}in{mathcal C}{i}}left({bf h}-pmb{mu}{i}right)({bf h}-pmb{mu}{i})^{top};} {quad{bf V}get smathrm{bfEig}left({bf S}{w}right);}end{array}$;
6 按特征值升序排列特征向量 $star$
7 使用方程(6)优化表示;
8 end
9 return

works well on various datasets, we choose four image datasets that cover domains such as handwritten digits, objects, and human faces and three text datasets. Table II provides a brief description for each dataset. And some examples in the image datasets are shown in Figure 3.

在多种数据集上表现良好,我们选择了涵盖手写数字、物体和人脸等领域的四个图像数据集以及三个文本数据集。表II简要描述了每个数据集。图像数据集中的部分示例如图3所示。

MNIST [20] is a dataset that consists of 70,000 hand-written grayscale digit images. Each image is of size $28\times28$ pixels. USPS is a dataset of handwritten grayscale digit images from the USPS postal service, containing 9,298 images of size $16\times16$ pixels. COIL-20 [30] is a dataset containing 1,440 color images of 20 objects (72 images per object). The objects have a wide variety of complex geometric and reflectance characteristics. The size of each image is resized to $28\times28$ pixels. FRGC is a human face dataset. Following [44], we randomly select 20 subjects from the original dataset and collect their 2,462 face images. Similarly, we crop the face regions and resize them into $28\times28$ pixels. For all the image datasets, each image is normalized by scaling between 0 and 1. REUTERS-10K [41] contains a random subset of 10,000 samples from REUTERS dataset which has about 810,000 English news stories. REUTERS-10K contains four categories: corporate/industrial, government/social, markets, and economics. The 20 Newsgroups dataset (20NEWS) [19] contains 18,846 documents labeled into 20 different classes, each corresponding to a different topic. Reuters Corpus Volume I (RCV1) [22] contains 804,414 manually categorized newswire stories. Following [6], we sample from the full RCV1 collection a random subset of 10,000 documents from the largest four categories, denoted by RCV1-10K. For the three text datasets, we represent each document as a tf-idf feature vector on the 2,000 most frequently occurring word stems. And each sample $mathbf{x}{i}$ is normalized so that $begin{array}{r}{frac{1}{d}|mathbf{x}{i}{2}^{2}}end{array}$ is approximately 1, where $d$ is the dimension of the input space.

MNIST [20] 是一个包含70,000张手写灰度数字图像的数据集。每张图像尺寸为 $28 times28$ 像素。USPS数据集包含来自美国邮政服务的手写灰度数字图像,共9,298张 $16 times16$ 像素的图像。COIL-20 [30] 数据集包含20个物体的1,440张彩色图像(每个物体72张图像),这些物体具有多样化的复杂几何和反射特性。所有图像尺寸统一调整为 $28 times28$ 像素。FRGC是人脸数据集,参照[44]的方法,我们从原始数据集中随机选取20个主体的2,462张面部图像,裁剪面部区域并调整为 $28 times28$ 像素。所有图像数据集均通过0到1的缩放进行归一化处理。

REUTERS-10K [41] 是从约81万篇英文新闻组成的REUTERS数据集中随机抽取的10,000个样本子集,包含企业/工业、政府/社会、市场和经济四大类别。20 Newsgroups数据集(20NEWS) [19] 包含18,846篇文档,标记为20个不同主题类别。Reuters Corpus Volume I (RCV1) [22] 包含804,414篇人工分类的新闻稿件,参照[6]的方法,我们从RCV1全集中抽取最大四个类别的10,000篇文档随机子集,记为RCV1-10K。对于这三个文本数据集,我们将每篇文档表示为基于2000个最高频词干的tf-idf特征向量,并对每个样本 $mathbf{x}{i}$ 进行归一化处理,使得 $begin{array}{r}{frac{1}{d}|mathbf{x}{i}|{2}^{2}}end{array}$ 近似等于1,其中 $d$ 表示输入空间的维度。

TABLE II DESCRIPTION OF DATASETS

表 II 数据集描述

数据集 样本数 聚类数 维度
MNIST 70,000 10 28×28×1
USPS 9,298 10 16×16×1
COIL-20 1,440 20 28×28×1
FRGC 2,462 20 32×32×3
REUTERS-10K 10,000 4 2,000
20NEWS 18,846 20 2,000
RCV1-10K 10,000 4 2,000


Fig. 3. Some image samples, from top row to bottom row, come from MNIST, USPS, COIL-20 and FRGC datasets, respectively.

图 3: 部分图像样本示例,从上到下各行分别来自 MNIST、USPS、COIL-20 和 FRGC 数据集。

B. Baseline Methods

B. 基线方法

We compare the proposed DEKM with the following methods:

我们提出的DEKM与以下方法进行比较:

C. Experimental Settings

C. 实验设置

Since convolutional neural network (CNN) is good at capturing the semantic visual features of input images, we exploit convolutional auto encoder to find the embedding space for the image datasets. Specifically, we use three convolutional layers followed by a dense layer (embedding layer) in the encoder-to-decoder pathway. The channel numbers of the three convolutional layers are 32, 64, and 128 respectively. The kernel sizes are set to $5\times5$ , $5\times5$ , and $3\times3$ respectively. The stride of all the convolutional layers is set to two. The number of neurons in the embedding layer is set to the number of clusters of datasets. The decoder is a mirror of the encoder and the output of each layer of the decoder is appropriately zero-padded to match the input size of the corresponding encoder layer. All the intermediate layers of the convolutional auto encoder are activated by ReLU [17]. For the text datasets, we use a fully connected multilayer perceptron (MLP) for the backbone of auto encoder. Following the settings in DEC [41], the encoder has dimensions of $d\cdot$ - 500-500-2000-10, where $d$ is the dimension of the input data. The decoder is a mirror of the encoder. All the intermediate layers are activated by ReLU. The weights of all the layers are initialized by Xavier approach [9]. The Adam [16] optimizer is adopted with the initial learning rate $l=0.001$ , $\beta_{1}=0.9$ , $\beta_{2}=0.999$ . We stop the clustering process when there are less than $0.1%$ of samples that change their clustering assignments between two consecutive iterations. Our code is available at: https://github.com/spdj2271/DEKM. All the experiments are conducted on a machine with one Intel(R) Xeon(R) CPU (2.00GHz) and one Nvidia Tesla P100 GPU.

由于卷积神经网络 (CNN) 擅长捕捉输入图像的语义视觉特征,我们采用卷积自编码器为图像数据集寻找嵌入空间。具体而言,在编码器-解码器路径中使用了三个卷积层和一个全连接层 (嵌入层),三个卷积层的通道数分别为32、64和128,卷积核大小分别设置为 $5\times5$、$5\times5$ 和 $3\times3$,所有卷积层的步长设为2。嵌入层的神经元数量与数据集聚类数相同。解码器是编码器的镜像结构,其每层输出通过适当零填充以匹配对应编码器层的输入尺寸。卷积自编码器所有中间层均采用ReLU激活函数 [17]。

对于文本数据集,我们采用全连接多层感知机 (MLP) 作为自编码器主干网络。参照DEC [41] 的设置,编码器维度为 $d\cdot$ -500-500-2000-10,其中 $d$ 表示输入数据维度。解码器同样采用镜像结构,所有中间层使用ReLU激活,各层权重通过Xavier方法初始化 [9]。优化器采用Adam [16],初始学习率 $l=0.001$,参数设为 $\beta_{1}=0.9$、$\beta_{2}=0.999$。当连续两次迭代间聚类分配变化的样本数低于 $0.1%$ 时终止聚类过程。代码已开源:https://github.com/spdj2271/DEKM。所有实验均在配备Intel(R) Xeon(R) CPU (2.00GHz) 和Nvidia Tesla P100 GPU的机器上完成。

D. Evaluation Metric

D. 评估指标

To evaluate the clustering methods, we adopt two standard evaluation metrics: normalized mutual information (NMI) [39] and unsupervised clustering accuracy (ACC) [42]. Both the NMI and ACC values are in the range $[0,1]$ . The higher the values, the better the clustering results. NMI is an informationtheoretic measure, which calculates the normalized measure of similarity between the ground-truth labels and the obtained cluster assignments. NMI is defined as follows:

为评估聚类方法,我们采用两种标准评价指标:归一化互信息(NMI) [39]和无监督聚类准确率(ACC) [42]。NMI和ACC的取值范围均为$[0,1]$,数值越高表明聚类效果越好。NMI是一种信息论度量方法,通过计算真实标签与聚类结果之间的归一化相似度来衡量聚类性能,其定义如下:

$$
{\mathrm{NMI}}={\frac{2\times I({\boldsymbol{\mathcal{G}}};{\boldsymbol{\mathcal{C}}})}{H({\boldsymbol{\mathcal{G}}})+H({\boldsymbol{\mathcal{C}}})}}
$$

where $\mathcal{G}$ is the ground-truth, $\mathcal{C}$ is the cluster assignments, $I$ denotes mutual information, and $H$ denotes entropy.

其中 $\mathcal{G}$ 是真实值,$\mathcal{C}$ 是聚类分配,$I$ 表示互信息,$H$ 表示熵。

ACC measures the proportion of samples whose cluster assignments can be correctly mapped to the ground-truth labels. ACC is defined as follows:

ACC衡量的是能够正确映射到真实标签的聚类样本比例。其定义如下:
图片.png

where $mathbf{g}{i}$ is the ground-truth label of the $i$ -th data point, $\mathbf{c}{i}$ is the cluster assignment of the $i$ -th data point, $m$ ranges over all possible one-to-one mappings between ground-truth labels and cluster assignments. The mapping is based on the Hungarian algorithm [18].

其中 $\mathbf{g}{i}$ 是第 $i$ 个数据点的真实标签,$\mathbf{c}{i}$ 是第 $i$ 个数据点的聚类分配,$m$ 遍历真实标签与聚类分配之间所有可能的一一映射。该映射基于匈牙利算法 [18]。

the following strategies: 1. reducing the entropy of the last dimension of Y, 2. reducing the entropy of a random dimension of $\mathbf{Y}$ , and 3. reducing the entropy of all the dimensions of $\mathbf{Y}$ . We also compare with another two strategies: 4. reducing the entropy of a random dimension of $\mathbf{H}$ , and 5. reducing the entropy of all the dimensions of H. Note that all these strategies use the mini-batch updating strategy. Figure 4 shows the comparison results. We can see that the first strategy of entropy reduction in the last dimension of $\mathbf{Y}$ is the best. It outperforms the other four strategies with a large margin. Strategy 2 performs better than strategy 4. Strategy 3 performs similarly to strategy 5.

以下策略:1. 降低Y最后一维的熵,2. 降低$\mathbf{Y}$随机一维的熵,3. 降低$\mathbf{Y}$所有维度的熵。我们还对比了另外两种策略:4. 降低$\mathbf{H}$随机一维的熵,5. 降低H所有维度的熵。注意这些策略均采用小批量更新方式。图4展示了对比结果,可见降低$\mathbf{Y}$最后一维熵的第一策略效果最佳,显著优于其他四种策略。策略2表现优于策略4,策略3与策略5效果相近。

E. Clustering Results

E. 聚类结果

Due to random iz at ions of the weights of auto encoder and the centroids of K-means, we run DEKM on each dataset for three times and report the average results. Table III reports the clustering performances of different methods on the benchmark datasets in terms of normalized mutual information (NMI) and unsupervised clustering accuracy (ACC). For the comparison methods, we present the reported results from their original papers if they are available (marked by ∗ in Table III). For unreported results on specific datasets, we run the released code mentioned in the original papers. When the released code is not available or running it is not practical, we put dash marks $(-)$ in Table III.

由于自编码器权重和K-means质心的随机初始化,我们在每个数据集上运行DEKM三次并报告平均结果。表III展示了不同方法在基准数据集上基于归一化互信息(NMI)和无监督聚类准确率(ACC)的聚类性能。对于对比方法,若原论文提供了结果则直接引用其报告数据(表III中用∗标记)。针对特定数据集未报告的结果,我们运行原论文中提到的公开代码。若无法获取代码或运行不切实际,则在表III中用短横线$(-)$标注。

We can see from Table III that DEKM outperforms the comparison methods on most of the datasets. DEKM achieves competitive results to DCEC on COIL-20 in terms of NMI. DEKM outperforms all the comparison methods with a large margin on MNIST, 20NEWS and RCV1-10K. DEKM F uses the full-batch updating strategy to optimize the representation. Compared with DEKM, DEKM F performs slightly worse on four datasets MNIST, USPS, FRGC, and RCV1-10K. As can be seen, all the deep clustering methods perform much better than the traditional shallow clustering methods (i.e., K-means and K-means+PCA). This suggests that the embedding space generated by auto encoder is more advantageous for clustering. The performance gap between DEKM and $\mathrm{AE+K}$ -means is large, which means our representation optimization strategy is promising. Both DEKM and DCEC use convolutional autoencoders to find the embedding space for the image datasets. The performance gap between DEKM and DCEC reflects the effect of different representation optimization strategy. The representation optimization strategy of DEKM is superior to that of DCEC. Note that DCEC replaces the MLP auto encoder in DEC with a convolutional auto encoder. For the text datasets, DCEC using the MLP auto encoder is equivalent to DEC. Compared with DCEC on the text datasets, we can see that DEKM also performs better. Thus, the representation optimization strategy of DEKM works in different scenarios, making DEKM a universal clustering framework.

从表 III 可以看出,DEKM 在大多数数据集上优于对比方法。在 COIL-20 数据集上,DEKM 在 NMI 指标上与 DCEC 表现相当。在 MNIST、20NEWS 和 RCV1-10K 数据集上,DEKM 显著优于所有对比方法。DEKM F 采用全批量更新策略优化表示,与 DEKM 相比,在 MNIST、USPS、FRGC 和 RCV1-10K 四个数据集上表现稍逊。显然,所有深度聚类方法的表现都远优于传统浅层聚类方法(即 K-means 和 K-means+PCA),这表明自编码器生成的嵌入空间更有利于聚类。DEKM 与 $\mathrm{AE+K}$ -means 的性能差距较大,说明我们的表示优化策略具有优势。DEKM 和 DCEC 都使用卷积自编码器为图像数据集寻找嵌入空间,两者性能差距反映了不同表示优化策略的效果——DEKM 的策略优于 DCEC。值得注意的是,DCEC 将 DEC 中的 MLP 自编码器替换为卷积自编码器。对于文本数据集,使用 MLP 自编码器的 DCEC 等同于 DEC。与文本数据集上的 DCEC 相比,DEKM 表现更优,这表明 DEKM 的表示优化策略适用于不同场景,使其成为通用聚类框架。

F. Representation Optimization Strategy

F. 表征优化策略

We examine the effects of several representation optimization strategies on the MNIST dataset. Specifically, we compare

我们研究了多种表征优化策略在MNIST数据集上的效果。具体而言,我们比较了


Fig. 4. The clustering performances of the different representation optimization strategies on MNIST dataset.

图 4: 不同表征优化策略在MNIST数据集上的聚类性能。

G. Embedding Space Comparison

G. 嵌入空间比较

Figure 5 illustrates the t-SNE [38] visualization of the embedding spaces of different algorithms on MNIST dataset. Figure 5(a) demonstrates the embedding space of PCA. Figure 5(b) illustrates the embedding space of a convolutional auto encoder, which is the initial embedding space for DEKM. Figure 5(c) shows the embedding space of DEC. And Figure 5(d) shows the embedding space of DEKM. Note that all these embedding spaces are used to get the clustering results in Table III. Compared with the clusters in the initial embedding space (as shown in Figure 5(b)) of the convolutional auto encoder, the clusters in the embedding space (as shown in Figure 5(d)) of DEKM are more focused and isotropic, which are good for K-means. The two clusters in the embedding space of DEC are mixed, which leads to a lower performance compared with DEKM. The clusters in the embedding space of PCA are not isotropic Gaussian clusters, which is the reason why K-means does not perform well.

图5展示了不同算法在MNIST数据集上嵌入空间的t-SNE [38]可视化结果。图5(a)展示了PCA的嵌入空间。图5(b)展示了卷积自编码器的嵌入空间,这是DEKM的初始嵌入空间。图5(c)展示了DEC的嵌入空间,而图5(d)展示了DEKM的嵌入空间。请注意,所有这些嵌入空间都用于获取表III中的聚类结果。与卷积自编码器初始嵌入空间(如图5(b)所示)中的聚类相比,DEKM嵌入空间(如图5(d)所示)中的聚类更加集中且各向同性,这对K-means算法非常有利。DEC嵌入空间中的两个聚类相互混合,导致其性能低于DEKM。PCA嵌入空间中的聚类不是各向同性的高斯聚类,这就是K-means表现不佳的原因。

V. CONCLUSION

V. 结论

In this paper, we have proposed a new deep clustering method called DEKM that alternately learns the deep embedding space and finds clusters inside. First, we train an autoencoder to generate an embedding space. Then in this embedding space we find clusters by K-means. We ei gen decompose the within-class scatter matrix of K-means to get an ortho normal transformation matrix, which is further used to transform the embedding space to a new space that reveals the clusterstructure information. Each row vector of the ortho normal transformation matrix is the ei gen vector of the within-class scatter matrix. The eigenvalue indicates the importance of the corresponding ei gen vector’s contribution to the clusterstructure information in the new space. Finally, we propose a greedy method to optimize the representation to increase the cluster-structure information in the new space. We optimize DEKM in an alternating way. Experimental results show that DEKM achieves superior performances compared to the baselines. In the future, we would like to extend DEKM to find non-redundant clusters in multiple independent subspaces.

本文提出了一种名为DEKM的新型深度聚类方法,该方法交替学习深度嵌入空间并在其中发现聚类。首先,我们训练自编码器生成嵌入空间,然后在该嵌入空间中使用K-means算法寻找聚类。通过对K-means类内散布矩阵进行特征分解,我们获得正交变换矩阵,进而将嵌入空间转换至能揭示聚类结构信息的新空间。该正交变换矩阵的每个行向量都是类内散布矩阵的特征向量,其特征值表征了对应特征向量在新空间中对聚类结构信息贡献的重要性。最后,我们提出一种贪心算法来优化表征,以增强新空间中的聚类结构信息。DEKM采用交替方式进行优化,实验结果表明其性能优于基线方法。未来我们将扩展DEKM以在多个独立子空间中寻找非冗余聚类。

TABLE III CLUSTERING RESULTS OF DIFFERENT ALGORITHMS IN TERMS OF THE UNSUPERVISED CLUSTERING ACCURACY ${\bf A C C}%$ ) AND NORMALIZED MUTUAL INFORMATION $(\mathbf{NMI}%$ ). THE RESULTS MARKED BY $^*$ COME FROM THE ORIGINAL PAPERS. − DENOTES THAT THE RESULT IS NOT AVAILABLE. DEKM F MEANS THAT DEKM USES THE FULL-BATCH UPDATING STRATEGY.

表 III 不同算法在无监督聚类准确率 ${\bf ACC}%$ ) 和归一化互信息 $(\mathbf{NMI}%$ ) 上的聚类结果。标记为 $^*$ 的结果来自原论文。− 表示结果不可用。DEKM_F 表示 DEKM 使用全批量更新策略。

方法 MNIST ACC MNIST NMI USPS ACC USPS NMI COIL-20 ACC COIL-20 NMI FRGC ACC FRGC NMI REUTERS-10K ACC REUTERS-10K NMI 20NEWS ACC 20NEWS NMI RCV1-10K ACC RCV1-10K NMI
K-means 53.40 50.00 66.81 62.62 25.28 67.17 12.63 18.27 54.05 41.91 21.59 19.70 51.96 31.35
PCA+K-means 53.13 49.91 54.20 60.96 26.53 65.34 12.55 15.19 53.97 41.35 21.75 20.46 51.91 31.39
AE+K-means 85.47 80.53 74.84 74.16 68.82 79.56 33.31 42.81 70.52 49.02 40.24 37.60 64.11 41.12
DEC 84.30* 83.72* 73.68 75.29* 65.42* 80.49* 37.10* 44.60* 72.17* 54.87 35.65 43.79 66.81 45.24
DCEC 87.16 85.92 78.35 81.34 71.88 82.24 33.40 41.50 72.17 54.87 35.65 43.79 66.81 45.24
IDEC 88.06* 86.72* 76.05* 78.46* 75.64* 49.81* 40.50 38.20 59.50 34.70
DCN 84.00 80.00 67.00 67.00 74.00 81.00 44.00* 48.00* 56.70 31.60
DKM 84.00* 79.60 75.70 77.60* 51.20* 46.70* 58.30* 33.10*
DEKM 95.75 91.06 79.75 82.23 69.03 80.06 38.59 50.78 76.28 59.06 41.08 40.27 67.15 46.18
DEKM_F 94.23 90.54 78.87 80.51 72.62 81.97 37.29 49.32 76.42 59.99 41.20 43.76 62.12 35.98


Fig. 5. The t-SNE visualization of the embedding spaces of different algorithms on MNIST dataset.

图 5: MNIST数据集上不同算法的嵌入空间t-SNE可视化。

ACKNOWLEDGMENT

致谢

The authors would like to thank anonymous reviewers for their constructive and helpful comments. This work was supported partially by the National Key Research and Development Program of China (Project No. 2020 Y FD 1100603), the National Natural Science Foundation of China (NSFC) (Project No. 62176184) and the Fundamental Research Funds for the Central Universities.

作者感谢匿名评审提出的建设性意见。本研究部分由国家重点研发计划(项目编号2020YFD1100603)、国家自然科学基金(项目编号62176184)和中央高校基本科研业务费专项资金资助。

REFERENCES

参考文献

阅读全文(20积分)