[论文翻译]基于概率比率割优化的深度聚类


原文地址:https://arxiv.org/pdf/2502.03405


Deep Clustering via Probabilistic Ratio-Cut Optimization

基于概率比率割优化的深度聚类

Abstract

摘要

were even shown to be fundamentally equivalent in specific settings (Dhillon et al., 2004).

在特定场景下甚至被证明具有根本等价性 (Dhillon et al., 2004)。

We propose a novel approach for optimizing the graph ratio-cut by modeling the binary assignments as random variables. We provide an upper bound on the expected ratio-cut, as well as an unbiased estimate of its gradient, to learn the parameters of the assignment variables in an online setting. The clustering resulting from our probabilistic approach (PRCut) outperforms the Rayleigh quotient relaxation of the combinatorial problem, its online learning extensions, and several widely used methods. We demonstrate that the PRCut clustering closely aligns with the similarity measure and can perform as well as a supervised classifier when label-based similarities are provided. This novel approach can leverage out-of-the-box self-supervised represent at ions to achieve competitive performance and serve as an evaluation method for the quality of these representations.

我们提出了一种通过将二元分配建模为随机变量来优化图比率割的新方法。我们给出了期望比率割的上界及其梯度的无偏估计,以便在线学习分配变量的参数。我们的概率方法(PRCut)产生的聚类效果优于组合问题的瑞利商松弛、其在线学习扩展以及几种广泛使用的方法。我们证明,PRCut聚类与相似性度量高度一致,并且在提供基于标签的相似性时,其表现可与监督分类器相媲美。这一新方法能够利用现成的自监督表示来实现有竞争力的性能,并可作为评估这些表示质量的方法。

The recent advances in generative models and selfsupervised representation learning have produced powerful embeddings that capture the similarity between the original data samples. This makes leveraging these similarity measures to achieve effective clustering more relevant than ever, as it can serve as pseudo-labels for pre-training class if i ers or eliminate the need for a decoder (Ji et al., 2021) in the training of these generative models. Therefore, it is highly advantageous to develop a method that can transform the similarity information into clustering of equal quality. An efficient extension of spectral clustering to stochastic gradient descent would facilitate the conversion of learned embeddings to cluster assignments without relying on the full dataset or large batches to accurately approximate the graph structure of the embedding space.

生成式模型(Generative AI)和自监督表征学习的最新进展催生了能够捕捉原始数据样本间相似性的强大嵌入表示。这使利用相似性度量实现高效聚类变得前所未有的重要——它既可作为预训练分类器的伪标签,又能在生成模型训练中省去解码器(Ji et al., 2021)。因此,开发能将相似度信息转化为同等质量聚类的方法具有显著优势。若能将谱聚类(spectral clustering)高效扩展至随机梯度下降框架,即可在不依赖完整数据集或大批量样本的情况下,将学习到的嵌入表示转化为聚类分配,从而准确逼近嵌入空间的图结构。

1 INTRODUCTION

1 引言

Unsupervised learning is based on the premise that labels are not necessary for the training process, particularly for clustering tasks where samples that are highly similar are grouped together. Various clustering algorithms (Ezugwu et al., 2022) have been proposed in the context of machine learning and data mining. The K-Means algorithm (Lloyd, 1957) and ratio-cut partitioning (Hagen & Kahng, 1991) were among the first approaches to address the clustering problem. These methods were further refined and extended beyond binary partitioning and Euclidean distances, and they

无监督学习基于训练过程不需要标签的前提,尤其适用于将高度相似的样本聚类的任务。在机器学习和数据挖掘领域,已提出多种聚类算法 [Ezugwu et al., 2022]。K-Means算法 [Lloyd, 1957] 和比率切割分区 [Hagen & Kahng, 1991] 是最早解决聚类问题的方法之一。这些方法后来被进一步改进和扩展,超越了二值划分和欧氏距离的限制。

Several clustering methods have been applied to data streams to provide weak signals for the pre-training of neural networks. Meanwhile, other approaches have been proposed to utilize deep learning specifically for clustering, particularly with auto encoders and generative neural networks (Jiang et al., 2017). Generally, these methods can be categorized into contrastive learning (Li et al., 2020), where samples are grouped based on pairwise distances in a large batch, or generative models such as variation al auto encoders that use a prior with an auxiliary variable as the cluster assignment. The former methods do not fully capture the global structure of the clusters, while the latter may suffer from over fitting the prior due to the complexity of the neural network, with limited options to prevent this without relying on labels. Spectral clustering avoids these issues by clustering the data based on global similarities between clusters rather than just pairs of samples.

多种聚类方法已被应用于数据流中,为神经网络的预训练提供弱信号。与此同时,其他研究提出了专门利用深度学习进行聚类的方法,特别是通过自编码器和生成式神经网络(Jiang et al., 2017)。这些方法通常可分为对比学习(Li et al., 2020)——基于大批量样本间的成对距离进行分组,以及生成模型(如变分自编码器)——使用带有辅助变量的先验分布作为聚类分配依据。前者未能充分捕捉簇的全局结构,后者则可能因神经网络复杂性导致先验过拟合,且在不依赖标签的情况下缺乏有效的预防手段。谱聚类通过基于簇间全局相似性(而非仅样本对)进行数据聚类,规避了上述问题。

Spectral clustering has long resisted attempts to extend it to parametric learning, primarily due to the challenge of handling the spectral decomposition of large matrices. Since it is based on the spectral decomposition of the relaxed ratio-cut rather than the combinatorial version, it requires the projection of the data into a principal eigenspace. A clustering algorithm is then applied to these projections, further making the final performance dependent on the chosen clustering algorithm.

谱聚类长期以来难以扩展到参数化学习,主要因为处理大矩阵谱分解的挑战。该方法基于松弛比率割的谱分解而非组合版本,需要将数据投影到主特征空间。随后对这些投影应用聚类算法,使得最终性能依赖于所选聚类算法。

In this paper, we develop a novel method to optimize the ratio-cut without relying on the spectral decomposition of the Laplacian matrix by employing a probabilistic approach that treats the cluster assignment as random variables. Furthermore, we utilize neural networks to parameter ize the assignment probabilities and address the clustering problem in an online manner using stochastic gradient descent. The result is an online learning algorithm that achieves a better ratio-cut objective than the memory-intensive spectral method applied to the full graph Laplacian. We also demonstrate that our approach achieves comparable performance to a supervised classifier when utilizing supervised similarity (i.e., two samples are considered similar if they share the same label). This showcases a high fidelity between the resulting clustering and the given similarity measure. The proposed algorithm can also serve as a drop-in replacement for any other clustering method used in the pre-training of text or speech transformers, which opens up several opportunities for enhancing the effectiveness of pre-training for downstream tasks.

本文提出了一种新颖的方法,通过采用概率化处理聚类分配为随机变量的策略,在不依赖拉普拉斯矩阵谱分解的情况下优化比率切割 (ratio-cut) 。此外,我们利用神经网络对分配概率进行参数化,并通过随机梯度下降以在线方式解决聚类问题。由此产生的在线学习算法相比应用于完整图拉普拉斯矩阵的内存密集型谱方法,能获得更优的比率切割目标值。实验还表明,当采用监督相似度(即共享相同标签的样本被视为相似)时,本方法能达到与监督分类器相当的性能,这印证了所得聚类与给定相似度度量之间的高度一致性。该算法还可直接替代文本或语音Transformer预训练中使用的其他聚类方法,为提升下游任务预训练效果提供了多种可能性。

2 BACKGROUND

2 背景

In this section, we succinctly present the relevant elements related to the notion of graph ratio-cut. The curious reader may refer to von Luxburg (2007) for a more detailed account of spectral clustering and other types of graph cuts.

在本节中,我们简要介绍与图比率割 (graph ratio-cut) 概念相关的要素。感兴趣的读者可以参考 von Luxburg (2007) 以获取关于谱聚类和其他类型图割的更详细说明。

Let $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{K})$ be an undirected weighted graph where $\mathcal{V}\overset{\mathrm{def}}{=}{v_{i}|1\leq i\leq n}\subset\mathbb{R}^{p}$ is the set of vertices, $\mathcal{E}\subset\mathcal{V}\times\mathcal{V}$ is the set of edges $e_{i j}$ linking vertices $v_{i}$ and $v_{j}$ , and $K:\mathcal{V}\times\mathcal{V}\rightarrow\mathbb{R}^{+}$ is a symmetric nonnegative kernel. Let $W$ be the symmetric $n\times n$ adjacency matrix where $W_{i j}=\mathscr{K}(v_{i},v_{j})$ . The degree of the vertex $v_{i}$ is $\begin{array}{r}{d_{i}=\sum_{j}W_{i j}}\end{array}$ , and the degree matrix D d=ef diag (d1, . . . , dn).

设 $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{K})$ 为一个无向加权图,其中 $\mathcal{V}\overset{\mathrm{def}}{=}{v_{i}|1\leq i\leq n}\subset\mathbb{R}^{p}$ 是顶点集,$\mathcal{E}\subset\mathcal{V}\times\mathcal{V}$ 是连接顶点 $v_{i}$ 和 $v_{j}$ 的边集 $e_{i j}$,$K:\mathcal{V}\times\mathcal{V}\rightarrow\mathbb{R}^{+}$ 是对称非负核函数。设 $W$ 为对称的 $n\times n$ 邻接矩阵,其中 $W_{i j}=\mathscr{K}(v_{i},v_{j})$。顶点 $v_{i}$ 的度为 $\begin{array}{r}{d_{i}=\sum_{j}W_{i j}}\end{array}$,度矩阵定义为 $D \overset{\mathrm{def}}{=} \text{diag}(d_1, ..., d_n)$。

Let $k\geq2$ and ${\mathcal{C}}{k}={\mathbb{C}{\ell}|1\leq\ell\leq k}$ a partitioning of the graph $\mathcal{G}$ into $k$ disjoint clusters. We shall represent the subset $\mathbb{C}{\ell}\subset\mathcal{V}$ using the binary assignment vector where $\mathbf{1}{\mathbb{C}{\ell}}(i)=1$ if and only if $v_{i}\in\mathbb{C}{\ell}$ . The size of $\mathbb{C}{\ell}$ is measured using its cardinality $\left\lvert\mathbb{C}{\ell}\right\rvert=$ in=1 1Cℓ(i), and we denote by f (ℓ) d=ef $\begin{array}{r}{\pmb{f}^{(\ell)}\stackrel{\mathrm{def}}{=}\frac{1}{\sqrt{|\mathbb{C}{\ell}|}}\mathbf{1}{\mathbb{C}{\ell}}}\end{array}$ ratio assignment when $|\mathbb{C}_{\ell}|>0$ .

设 $k\geq2$ 且 ${\mathcal{C}}{k}={\mathbb{C}{\ell}|1\leq\ell\leq k}$ 为图 $\mathcal{G}$ 划分为 $k$ 个不相交簇的集合。我们用二元分配向量 表示子集 $\mathbb{C}{\ell}\subset\mathcal{V}$,其中 $\mathbf{1}{\mathbb{C}{\ell}}(i)=1$ 当且仅当 $v_{i}\in\mathbb{C}{\ell}$。$\mathbb{C}{\ell}$ 的大小由其基数 $\left\lvert\mathbb{C}{\ell}\right\rvert=\sum_{i=1}^{n}\mathbf{1}{\mathbb{C}{\ell}}(i)$ 衡量,并定义归一化分配向量 $\begin{array}{r}{\pmb{f}^{(\ell)}\stackrel{\mathrm{def}}{=}\frac{1}{\sqrt{|\mathbb{C}{\ell}|}}\mathbf{1}{\mathbb{C}{\ell}}}\end{array}$,其中 $|\mathbb{C}_{\ell}|>0$。

The ratio-cut for $\mathit{\check{C}}_{k}$ is defined as:

$\mathit{\check{C}}_{k}$ 的比率割 (ratio-cut) 定义为:

$$
\begin{array}{l}{{\displaystyle{\mathrm{RatioCut}}({\mathcal{C}}{k})\stackrel{\mathrm{def}}{=}\frac{1}{2}\sum_{\ell=1}^{k}\frac{1}{|\mathbb{C}{\ell}|}\sum_{\substack{i,j\in\mathbb{C}{\ell}\times\overline{{\mathbb{C}{\ell}}}}}W_{i j}}}\ {~=\displaystyle{\frac{1}{2}\sum_{\ell=1}^{k}\frac{1}{|\mathbb{C}{\ell}|}{\mathbf{1}}{\mathbb{C}{\ell}}^{\top}}W\left({\mathbf{1}}{n}-{\mathbf{1}}{\mathbb{C}_{\ell}}\right),}\end{array}
$$

$$
\begin{array}{l}{{\displaystyle{\mathrm{RatioCut}}({\mathcal{C}}{k})\stackrel{\mathrm{def}}{=}\frac{1}{2}\sum_{\ell=1}^{k}\frac{1}{|\mathbb{C}{\ell}|}\sum_{\substack{i,j\in\mathbb{C}{\ell}\times\overline{{\mathbb{C}{\ell}}}}}W_{i j}}}\ {~=\displaystyle{\frac{1}{2}\sum_{\ell=1}^{k}\frac{1}{|\mathbb{C}{\ell}|}{\mathbf{1}}{\mathbb{C}{\ell}}^{\top}}W\left({\mathbf{1}}{n}-{\mathbf{1}}{\mathbb{C}_{\ell}}\right),}\end{array}
$$

where $\mathbb{\ K}$ denotes $\nu\backslash\mathbb{A}$ , the complement of $\mathbb{A}$ in $\nu$ .

其中 $\mathbb{\ K}$ 表示 $\nu\backslash\mathbb{A}$,即 $\mathbb{A}$ 在 $\nu$ 中的补集。

We define the un normalized Laplacian matrix as ${\cal L}{u n}{\stackrel{\mathrm{def}}{=}}{\cal D}-{\cal W}$ , which can be used to express the ratio-cut of $\mathit{\mathcal{C}}_{k}$ as:

我们定义未归一化的拉普拉斯矩阵为 ${\cal L}{u n}{\stackrel{\mathrm{def}}{=}}{\cal D}-{\cal W}$,它可用于将 $\mathit{\mathcal{C}}_{k}$ 的比率割表示为:

$$
\mathrm{RatioCut}(\mathcal{C}{k})=\frac{1}{2}\mathrm{Tr}\left[F_{\mathcal{C}{k}}^{\top}L_{u n}F_{\mathcal{C}_{k}}\right],
$$

$$
\mathrm{RatioCut}(\mathcal{C}{k})=\frac{1}{2}\mathrm{Tr}\left[F_{\mathcal{C}{k}}^{\top}L_{u n}F_{\mathcal{C}_{k}}\right],
$$

where $F_{{\mathcal{C}}_{k}}{\stackrel{\mathrm{def}}{=}}\left[\pmb{f}^{(1)},\dots\pmb{f}^{(k)}\right]\in\mathbb{R}^{n\times k}$ is the ratio assignments matrix.

其中 $F_{{\mathcal{C}}_{k}}{\stackrel{\mathrm{def}}{=}}\left[\pmb{f}^{(1)},\dots\pmb{f}^{(k)}\right]\in\mathbb{R}^{n\times k}$ 为比率分配矩阵。

Since the clusters should be disjoint and different from the naive partitioning ${\nu,\emptyset,\dots,\emptyset}$ , we can express these constraints as $\mathbf{}F_{:,\ell}^{\top}\mathbf{1}_{V}\neq1$ for all $1\leq\ell\leq k$ .

由于簇应当是互不相交且不同于朴素划分 ${\nu,\emptyset,\dots,\emptyset}$ ,我们可以将这些约束表示为 $\mathbf{}F_{:,\ell}^{\top}\mathbf{1}_{V}\neq1$ 对于所有 $1\leq\ell\leq k$ 。

The optimization of the ratio-cut is then equivalent to minimizing a Rayleigh quotient on ${0,1}^{n\times k}$ . Solving eq. (4) on the set ${0,1}^{n}$ is generally an NP- hard problem. The optimization is hence relaxed so that the unknown vectors are in the unit sphere . The objective is then:

比率割的优化问题等价于在 ${0,1}^{n\times k}$ 上最小化瑞利商。在集合 ${0,1}^{n}$ 上求解方程(4)通常是一个NP难问题。因此放宽优化条件,使未知向量位于单位球面 上。此时目标函数为:

$$
\begin{array}{r l}{\underset{\boldsymbol{F}}{\mathrm{minimize}}}&{\mathrm{Tr}(\boldsymbol{F}^{\top}\boldsymbol{L}\boldsymbol{F})}\ {\mathrm{subject~to}}&{\boldsymbol{F}^{\top}\boldsymbol{F}=\boldsymbol{I}{k}\mathrm{and}\boldsymbol{F}{:,j}^{\top}\mathbf{1}_{V}\neq1}\end{array}
$$

$$
\begin{array}{r l}{\underset{\boldsymbol{F}}{\mathrm{minimize}}}&{\mathrm{Tr}(\boldsymbol{F}^{\top}\boldsymbol{L}\boldsymbol{F})}\ {\mathrm{subject~to}}&{\boldsymbol{F}^{\top}\boldsymbol{F}=\boldsymbol{I}{k}\mathrm{and}\boldsymbol{F}{:,j}^{\top}\mathbf{1}_{V}\neq1}\end{array}
$$

The minimization of the Rayleigh quotient under the outlined constraints yields the $k$ smoothest eigenvectors of the Laplacian matrix (excluding the trivial first ei gen vector ${\bf1}_{V}$ ). Vertices within the same cluster are anticipated to have similar projections onto the solutions of the relaxed problem. As we ascend the Lapla- cian spectrum, the projections onto the ei gen vectors encapsulate more specific (higher frequency) features. Subsequently, the binary assignments are determined through $k$ -means clustering (Ng et al., 2001) of the relaxed problem’s solution.

在所述约束条件下最小化瑞利商(Rayleigh quotient)可得到拉普拉斯矩阵的$k$个最平滑特征向量(不包括平凡的第一特征向量${\bf1}_{V}$)。预计同一聚类中的顶点在松弛问题解上的投影具有相似性。随着沿拉普拉斯谱向上移动,特征向量上的投影会捕获更具体(更高频)的特征。随后通过松弛问题解的$k$-均值聚类(Ng et al., 2001)确定二值分配。

In this paper, we introduce a novel approach to optimizing eq. (4) that circumvents the necessity for spectral decomposition of extensive matrices or kernels, thereby sidestepping the relaxation of the problem into Euclidean space. Rather, we propose to relax the problem into an optimization over a simplex through the parameter iz ation of the cluster assignment distribution.

在本文中,我们提出了一种优化方程(4)的新方法,该方法避免了对大型矩阵或核进行谱分解的必要性,从而绕过了将问题松弛到欧几里得空间的过程。相反,我们建议通过对聚类分配分布的参数化,将问题松弛为单纯形上的优化。

3 RELATED WORK

3 相关工作

There have been many attempts to extend Spectral Clustering to streaming settings (Guha et al., 2000), infinitely countable datasets, or continuous input spaces. For instance, Yoo et al. (2016) proposed an extension to non-parametric Spectral Clustering for data streams, particularly when the kernel similarity is bilinear or can be approximated linearly. With the arrival of each new stream batch, a batch-based Singular Value Decomposition (SVD) is used to embed the stream and realign a facility set consisting of anchor points. These anchor points are then utilized to run a batch $K$ -means algorithm (Shindler & Wong, 2011) to cluster the new points.

已有许多尝试将谱聚类(Spectral Clustering)扩展到流式场景(Guha et al., 2000)、无限可数数据集或连续输入空间。例如,Yoo等人(2016)提出了一种针对数据流的非参数谱聚类扩展方法,尤其适用于核相似度为双线性或可线性近似的情况。每当接收到新的流数据批次时,该方法会使用基于批次的奇异值分解(SVD)来嵌入数据流,并重新调整由锚点组成的设施集。随后利用这些锚点运行批处理$K$均值算法(Shindler & Wong, 2011)对新数据点进行聚类。

However, research on extending Spectral Clustering to the domain of parametric learning, where the optimization variable is a mapping of the input space to the eigenspace, is limited. This is primarily due to the sensitivity of the ei gen decomposition (Alam & Bora, 2005) and the proven difficulty of ei gen decomposition in non-parametric settings when the similarity kernel takes specific forms (Mohan & Monteleoni, 2017).

然而,关于将谱聚类 (Spectral Clustering) 扩展到参数化学习领域的研究仍然有限,其中优化变量是输入空间到特征空间的映射。这主要是由于特征分解 (eigen decomposition) 的敏感性 (Alam & Bora, 2005) ,以及当相似性核采用特定形式时,在非参数化设置中特征分解已被证实的困难性 (Mohan & Monteleoni, 2017) 。

One of the early attempts in this direction was made in the context of Deep Reinforcement Learning (Kulkarni et al., 2016), where the Markov Decision Process (MDP) is decomposed into sub-tasks that navigate the representation space (Machado et al., 2017a). These sub-tasks, also called options (Machado et al., 2017b), correspond to ei gen vectors associated with the largest eigenvalues of the graph Laplacian, where the vertices represent elements of the state space and the similarity is the likelihood of the agent moving from one state to another. Extending such an approach to a clustering setting would require constructing an MDP for which the corresponden ce between ei gen vectors and options holds.

该方向的早期尝试之一出现在深度强化学习领域 (Kulkarni et al., 2016) ,其中马尔可夫决策过程 (MDP) 被分解为在表征空间中导航的子任务 (Machado et al., 2017a) 。这些子任务 (也称为选项 (Machado et al., 2017b) ) 对应于与图拉普拉斯矩阵最大特征值相关联的特征向量,其中顶点表示状态空间的元素,相似度则是智能体从一个状态转移到另一个状态的概率。将这种方法扩展到聚类场景需要构建一个MDP,以保持特征向量与选项之间的对应关系。

Spectral Nets (Shaham et al., 2018) were perhaps the first attempt at explicitly training a neural network to approximate ei gen vectors of the Laplacian for largescale clustering. The learned mapping is a neural network $N_{\theta}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{k}$ , parameterized by $\theta$ , that maps the input vertices to their projection on the subspace spanned by the $k$ -smoothest ei gen vectors of the Laplacian. The network is constructed to ensure that the output features are approximately ortho normal:

谱网络 (Spectral Nets) (Shaham et al., 2018) 可能是首次尝试显式训练神经网络来近似拉普拉斯矩阵的特征向量以实现大规模聚类。该学习映射是一个由参数 $\theta$ 定义的神经网络 $N_{\theta}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{k}$ ,它将输入顶点映射到由拉普拉斯矩阵前 $k$ 个最平滑特征向量张成的子空间上的投影。该网络的设计确保输出特征近似正交归一:

$$
{\frac{1}{n}}\sum_{i=1}^{n}N_{\theta}{(v_{i})}^{\top}N_{\theta}{(v_{i})}\approx I_{k}
$$

$$
{\frac{1}{n}}\sum_{i=1}^{n}N_{\theta}{(v_{i})}^{\top}N_{\theta}{(v_{i})}\approx I_{k}
$$

We denote by $\hat{N}_{\theta}$ the neural network without the last orthogonal iz ation layer, which is based on the Cholesky decomposition of the estimated covariance

我们用 $\hat{N}_{\theta}$ 表示不包含最后正交化层的神经网络,该层基于估计协方差的 Cholesky 分解。

matrix Σθ ∈ Rk×k:

矩阵 Σθ ∈ Rk×k:

$$
\Sigma_{\theta}=\frac{1}{n}\sum_{i=1}^{n}\hat{N_{\theta}}{(v_{i})}^{\top}\hat{N_{\theta}}{(v_{i})}.
$$

$$
\Sigma_{\theta}=\frac{1}{n}\sum_{i=1}^{n}\hat{N_{\theta}}{(v_{i})}^{\top}\hat{N_{\theta}}{(v_{i})}.
$$

Spectral Nets approach is then based on the minimization of:

谱网络方法基于以下最小化:

$$
\mathcal{L}{s p e c t r a l}(\theta)\stackrel{\mathrm{def}}{=}\sum_{i,j=1}^{n}W_{i j}\left\lVert N_{\theta}(v_{i})-N_{\theta}(v_{j})\right\rVert^{2}
$$

$$
\mathcal{L}{s p e c t r a l}(\theta)\stackrel{\mathrm{def}}{=}\sum_{i,j=1}^{n}W_{i j}\left\lVert N_{\theta}(v_{i})-N_{\theta}(v_{j})\right\rVert^{2}
$$

The matrix Πθ Rk×k serves as the parametric counterpart of ${\pmb F}^{\top}{\pmb L}{\pmb F}$ derived from eq. (4). Training the neural network equates to an alternating optimization scheme, where one batch is utilized to estimate $\Sigma_{\theta}$ , while the other is employed to minimize the loss $\mathcal{L}_{\mathit{s p e c t r a l}}(\theta)$ .

矩阵 Πθ Rk×k 作为 ${\pmb F}^{\top}{\pmb L}{\pmb F}$ 的参数化对应项,源自等式 (4)。训练神经网络相当于交替优化方案:一个批次用于估计 $\Sigma_{\theta}$,另一个批次用于最小化损失 $\mathcal{L}_{\mathit{s p e c t r a l}}(\theta)$。

During training, should the input exhibit noise, the Spectral Net’s output may converge to a constant mapping, resulting in the spectral loss approaching zero. Consequently, this approach encounters numerical instability in high-dimensional data scenarios. One common strategy to circumvent this involves offsetting the matrix $\Sigma_{\theta}$ by a scaled identity and establishing an appropriate stopping criterion. However, the resulting algorithm remains highly susceptible to input and similarity noise. This sensitivity becomes particularly evident when benchmarked against non-parametric Spectral Clustering. Spectral Nets achieve competitive performance only when the input consists of the code space from another well-performing pre-trained variational encoder (Jiang et al., 2017).

训练过程中,若输入数据存在噪声,Spectral Net 的输出可能收敛至恒定映射,导致谱损失趋近于零。因此,该方法在高维数据场景下会遭遇数值不稳定性。常见的规避策略是对矩阵 $\Sigma_{\theta}$ 施加缩放单位矩阵偏移,并设定合适的停止准则。然而,所得算法仍极易受到输入噪声和相似性噪声的影响。这种敏感性在与非参数化谱聚类 (Spectral Clustering) 的基准对比中尤为明显。仅当输入数据来自另一表现良好的预训练变分编码器 (Jiang et al., 2017) 的代码空间时,Spectral Nets 才能实现具有竞争力的性能。

Spectral Inference Networks (Spin) (Pfau et al., 2020) adopt a distinct strategy for parametric optimization of the Rayleigh quotient. As defined previously, the loss function takes the form:

谱推理网络 (Spectral Inference Networks, Spin) (Pfau等人, 2020) 采用了一种独特的参数化优化策略来处理瑞利商问题。如前所述,其损失函数形式为:

$$
\mathcal{L}{s p i n}\left(\boldsymbol{\theta}\right)=\mathrm{Tr}\left(\boldsymbol{\Sigma_{\boldsymbol{\theta}}}^{-1}\boldsymbol{\Pi_{\boldsymbol{\theta}}}\right)
$$

$$
\mathcal{L}{s p i n}\left(\boldsymbol{\theta}\right)=\mathrm{Tr}\left(\boldsymbol{\Sigma_{\boldsymbol{\theta}}}^{-1}\boldsymbol{\Pi_{\boldsymbol{\theta}}}\right)
$$

While primarily utilized within the domain of Deep Reinforcement Learning, targeting the ei gen functions associated with the largest eigenvalues, this approach can be adapted for loss minimization rather than maximization.

虽然这种方法主要应用于深度强化学习领域,目标是针对与最大特征值相关的特征函数,但它也可以调整为用于损失最小化而非最大化。

Upon differentiation (Petersen & Pedersen, 2012), the full gradient can be expressed as:

微分时 (Petersen & Pedersen, 2012),完整梯度可表示为:

$$
\operatorname{Tr}\left(\boldsymbol{\Sigma_{\theta}}^{-1}\nabla_{\theta}\Pi_{\theta}\right)-\operatorname{Tr}\left(\boldsymbol{\Sigma_{\theta}}^{-1}(\nabla_{\theta}\boldsymbol{\Sigma_{\theta}})\boldsymbol{\Sigma_{\theta}}^{-1}\Pi_{\theta}\right)
$$

$$
\operatorname{Tr}\left(\boldsymbol{\Sigma_{\theta}}^{-1}\nabla_{\theta}\Pi_{\theta}\right)-\operatorname{Tr}\left(\boldsymbol{\Sigma_{\theta}}^{-1}(\nabla_{\theta}\boldsymbol{\Sigma_{\theta}})\boldsymbol{\Sigma_{\theta}}^{-1}\Pi_{\theta}\right)
$$

The Spin algorithm estimates $\Sigma_{\theta}$ and $\nabla_{\boldsymbol{\theta}}\Sigma_{\boldsymbol{\theta}}$ utilizing moving averages. It addresses the issue of overlapping ei gen vectors by modifying the gradient to ensure sequential independence in updates: the gradients of the first $\textit{l}$ ei gen functions are made independent of gradients from other components $l+1,\ldots,k$ . This alteration significantly slows down training, in addition to the computational burden of computing $\nabla_{\boldsymbol{\theta}}\Sigma_{\boldsymbol{\theta}}$ , which necessitates calculating $k^{2}$ backward passes and storing $k^{2}\times\operatorname{size}(\theta)$ , thereby increasing memory and computational costs.

Spin算法通过移动平均来估计$\Sigma_{\theta}$和$\nabla_{\boldsymbol{\theta}}\Sigma_{\boldsymbol{\theta}}$。该算法通过修改梯度来解决特征向量重叠问题,确保更新时的顺序独立性:前$\textit{l}$个特征函数的梯度被设计为独立于其他组件$l+1,\ldots,k$的梯度。除了计算$\nabla_{\boldsymbol{\theta}}\Sigma_{\boldsymbol{\theta}}$带来的计算负担外(需要执行$k^{2}$次反向传播并存储$k^{2}\times\operatorname{size}(\theta)$大小的数据),这一改动还会显著降低训练速度,从而增加内存和计算成本。

We assert that the performance of these attempts is ultimately hindered by the difficulty of the eigendecomposition of large kernels. Therefore, we take a different approach to optimizing the graph ratio-cut that circumvents the spectral decomposition.

我们断言,这些尝试的性能最终受到大型核特征分解难度的限制。因此,我们采用了一种不同的方法来优化图比率切割(graph ratio-cut),从而规避了谱分解。

4 PROPOSED METHOD

4 提出的方法

In our probabilistic approach, the minimization of the ratio-cut is addressed differently. Instead of the deterministic assignments $\mathbf{1}{\mathbb{C}_{\ell}}$ , we use the random assignment vector $\mathbf{a}^{(\ell)}\in{0,1}^{n}$ under the assumption that $(\mathbf{a}{i}^{(\ell)}){i}$ are independent random variables such that:

在我们的概率方法中,比率切割最小化问题采用了不同的处理方式。我们不再使用确定性分配 $\mathbf{1}{\mathbb{C}{\ell}}$,而是在假设 $(\mathbf{a}{i}^{(\ell)})_{i}$ 为独立随机变量的前提下,采用随机分配向量 $\mathbf{a}^{(\ell)}\in{0,1}^{n}$,其满足以下条件:

$$
\operatorname*{Pr}\left(v_{i}\in\mathbb{C}{\ell}\right)=\operatorname*{Pr}\left(\mathbf{a}{i}^{(\ell)}=1\right){\stackrel{\mathrm{def}}{=}}P_{i,\ell},
$$

$$
\operatorname*{Pr}\left(v_{i}\in\mathbb{C}{\ell}\right)=\operatorname*{Pr}\left(\mathbf{a}{i}^{(\ell)}=1\right){\stackrel{\mathrm{def}}{=}}P_{i,\ell},
$$

where the rows of the matrix $P\in[0,1]^{n\times k}$ sum to 1: $\textstyle\sum_{\ell=1}^{k}P_{i,\ell}=1$ for all $i\in{1,\ldots,n}$ .

矩阵 $P\in[0,1]^{n\times k}$ 的行和为1:对所有 $i\in{1,\ldots,n}$ 满足 $\textstyle\sum_{\ell=1}^{k}P_{i,\ell}=1$。

The random assignment for each vertex $i$ follows a categorical distribution of parameter $P_{i,\phantom{\dagger}}$ : and the random clustering $\mathit{{C_{k}}}$ is thus parameterized by $P\in[0,1]^{n\times k}$ . We define the random ratio-assignment vector $\mathbf{f}^{(\ell)}$ for cluster $\mathbb{C}_{\ell}$ as:

每个顶点 $i$ 的随机分配遵循参数 $P_{i,\phantom{\dagger}}$ 的类别分布:随机聚类 $\mathit{{C_{k}}}$ 因此由 $P\in[0,1]^{n\times k}$ 参数化。我们将簇 $\mathbb{C}_{\ell}$ 的随机比率分配向量 $\mathbf{f}^{(\ell)}$ 定义为:

$$
\mathbf{f}^{(\ell)}={\frac{1}{\sqrt{|\mathbb{C}_{\ell}|}}}\mathbf{a}^{(\ell)},
$$

$$
\mathbf{f}^{(\ell)}={\frac{1}{\sqrt{|\mathbb{C}_{\ell}|}}}\mathbf{a}^{(\ell)},
$$

where $\begin{array}{r}{\widehat{|\mathbb{C}{\ell}|}=\sum_{i=1}^{n}\mathbf{a}_{i}^{(\ell)}}\end{array}$ , and all the elements of $\mathbf{f}^{(\ell)}$ are in thde inte rvPal $[0,1]$ with the convention $\textstyle{\frac{0}{0}}=1$ .

其中 $\begin{array}{r}{\widehat{|\mathbb{C}{\ell}|}=\sum_{i=1}^{n}\mathbf{a}_{i}^{(\ell)}}\end{array}$ ,且 $\mathbf{f}^{(\ell)}$ 的所有元素都在区间 $[0,1]$ 内,按照惯例 $\textstyle{\frac{0}{0}}=1$ 。

The stochastic counterpart of the quantity introduced earlier in Equation (1) is:

式 (1) 中引入的随机对应量为:

$$
\widehat{\mathrm{RatioCut}}(\mathcal{C}{k})=\frac{1}{2}\sum_{\ell=1}^{k}\sum_{1\leq i,j\leq n}W_{i j}(\mathbf{f}{i}^{(\ell)}-\mathbf{f}_{j}^{(\ell)})^{2}.
$$

$$
\widehat{\mathrm{RatioCut}}(\mathcal{C}{k})=\frac{1}{2}\sum_{\ell=1}^{k}\sum_{1\leq i,j\leq n}W_{i j}(\mathbf{f}{i}^{(\ell)}-\mathbf{f}_{j}^{(\ell)})^{2}.
$$

Equation (3) can also be extended to the stochastic setting similarly.

式 (3) 同样可以类似地推广到随机设定中。

The next step is to compute the expected ratio-cut for a clustering $\mathit{\check{C}}{k}$ parameterized by ${P}$ . Without loss of generality, we only have to compute $\mathbb{E}\left[(\mathbf{f}{1}^{(\ell)}-\mathbf{f}{2}^{(\ell)})^{2}\right]$ as a function of $_{P}$ due to the linearity of the expectation. We prove the following result in Appendix A.1:

下一步是计算由参数 ${P}$ 决定的聚类 $\mathit{\check{C}}{k}$ 的期望比率割。不失一般性,由于期望的线性性质,我们只需计算 $\mathbb{E}\left[(\mathbf{f}{1}^{(\ell)}-\mathbf{f}{2}^{(\ell)})^{2}\right]$ 作为 $_{P}$ 的函数。我们在附录 A.1 中证明了以下结果:

Lemma 4.1 (Difference expectation). Let $\mathbb{C}$ be a subset of $\nu$ and a its random assignment vector parameterized by $p\in[0,1]^{n}$ . Let f be its random ratioassignment vector. Then we have the following:

引理 4.1 (差分期望). 设 $\mathbb{C}$ 为 $\nu$ 的子集,其随机分配向量由参数 $p\in[0,1]^{n}$ 确定。设 f 为其随机比率分配向量,则有以下结论:

$$
\mathbb{E}\left[(\mathbf{f}{1}-\mathbf{f}{2})^{2}\right]=(p_{1}+p_{2}-2p_{1}p_{2})\mathbb{E}\left[\frac{1}{1+\widehat{\vert\mathbb{C}^{\perp(1,2)}\vert}}\right],
$$

$$
\mathbb{E}\left[(\mathbf{f}{1}-\mathbf{f}{2})^{2}\right]=(p_{1}+p_{2}-2p_{1}p_{2})\mathbb{E}\left[\frac{1}{1+\widehat{\vert\mathbb{C}^{\perp(1,2)}\vert}}\right],
$$

where $\mathbb{C}^{\bot(i,j)}=\mathbb{C}\backslash{v_{i},v_{j}}$

其中 $\mathbb{C}^{\bot(i,j)}=\mathbb{C}\backslash{v_{i},v_{j}}$

To avoid excluding pairs $(i,i){1\leq i\leq n}$ from the summation in Equation (6) each time, we will henceforth assume that $W_{i i}=0$ for all $i\in{1,\ldots,n}$ .

为避免每次从方程(6)的求和中排除 $(i,i){1\leq i\leq n}$ 对,我们此后假设对于所有 $i\in{1,\ldots,n}$ 都有 $W_{i i}=0$。

The random variable $\begin{array}{r}{|\widehat{\mathbb{C}^{\bot(1,2)}}|=\sum_{i=3}^{n}\mathbf{a}_{i}}\end{array}$ is the sum of (n-2) independent (but not ide ntical) Bernoulli random variables, also known as a Poisson binomial. It coincides with the binomial distribution when the Bernoulli random variables share the same parameter. See Appendix A.3 for a compilation of properties of this distribution including the proof of Lemma 4.2.

随机变量 $\begin{array}{r}{|\widehat{\mathbb{C}^{\bot(1,2)}}|=\sum_{i=3}^{n}\mathbf{a}_{i}}\end{array}$ 是 (n-2) 个独立 (但不同分布) 伯努利随机变量的和,也称为泊松二项分布。当伯努利随机变量具有相同参数时,该分布退化为二项分布。关于该分布的属性汇编及引理 4.2 的证明详见附录 A.3。

Lemma 4.2 (Poisson binomial expectation). Let Z be a Poisson binomial random variable of parameter $\pmb{\alpha}=(\alpha_{1},\ldots,\alpha_{m})\in[0,1]^{m}$ . Then we have:

引理 4.2 (泊松二项期望). 设 Z 为参数 $\pmb{\alpha}=(\alpha_{1},\ldots,\alpha_{m})\in[0,1]^{m}$ 的泊松二项随机变量,则有:

$$
\mathbb{E}\left[\frac{1}{1+\mathrm{Z}}\right]=\int_{0}^{1}\prod_{i=1}^{m}(1-\alpha_{i}t)d t
$$

$$
\mathbb{E}\left[\frac{1}{1+\mathrm{Z}}\right]=\int_{0}^{1}\prod_{i=1}^{m}(1-\alpha_{i}t)d t
$$

We can now express the expected ratio-cut by combining the results from Lemma 4.1 and Lemma 4.2.

现在我们可以结合引理4.1和引理4.2的结果来表达期望的比率割。

Theorem 4.1 (Ratio-cut expectation). The expected ratio-cut of the random clustering $\mathcal{C}{k}$ parameterized by $P\in[0,1]^{n\times k}$ can be computed as $\mathbb{E}\left[\widehat{\mathrm{RatioCut}}(\mathcal{C}{k})\right]=$ $\textstyle\sum_{\ell=1}^{k}\operatorname{RC}(P_{:,\ell})$ where:

定理 4.1 (Ratio-cut期望). 由参数 $P\in[0,1]^{n\times k}$ 定义的随机聚类 $\mathcal{C}{k}$ 的期望 ratio-cut 可计算为 $\mathbb{E}\left[\widehat{\mathrm{RatioCut}}(\mathcal{C}{k})\right]=$ $\textstyle\sum_{\ell=1}^{k}\operatorname{RC}(P_{:,\ell})$ 其中:

$$
\begin{array}{l}{\displaystyle\mathrm{RC}(\boldsymbol{p})=\frac{1}{2}\sum_{i,j=1}^{n}W_{i j}\left(p_{i}+p_{j}-2p_{i}p_{j}\right)\mathbb{I}(p,i,j)}\ {\displaystyle\mathbb{I}(p,i,j)\overset{d e f}{=}\int_{0}^{1}\prod_{m\neq i,j}^{n}{(1-p_{m}t)d t}}\end{array}
$$

$$
\begin{array}{l}{\displaystyle\mathrm{RC}(\boldsymbol{p})=\frac{1}{2}\sum_{i,j=1}^{n}W_{i j}\left(p_{i}+p_{j}-2p_{i}p_{j}\right)\mathbb{I}(p,i,j)}\ {\displaystyle\mathbb{I}(p,i,j)\overset{d e f}{=}\int_{0}^{1}\prod_{m\neq i,j}^{n}{(1-p_{m}t)d t}}\end{array}
$$

The expression for $\operatorname{RC}(\pmb{p})$ can be further simplified by consolidating the terms linear in $\pmb{p}$ . However, we retain the current formulation as it will prove useful when sampling random pairs $(i,j)$ in the implementation.

$\operatorname{RC}(\pmb{p})$ 的表达式可以通过合并 $\pmb{p}$ 中的线性项进一步简化。不过,我们保留了当前公式,因为在实现过程中对随机对 $(i,j)$ 进行采样时,这将非常有用。

We will drop the cluster index $\ell$ in the next sections and use $p\in[0,1]^{n}$ as the parameter of the random assignment. The goal now is to optimize $\operatorname{RC}(\pmb{p})$ .

我们将在接下来的章节中省略簇索引 $\ell$,并使用 $p\in[0,1]^{n}$ 作为随机分配的参数。现在的目标是优化 $\operatorname{RC}(\pmb{p})$。

4.1 Computation of the expected ratio-cut

4.1 期望比率割的计算

A straightforward method to estimate the expected ratio-cut from Theorem 4.1 involves disc ret i zing the

一种直接估计定理4.1中期望比率割(ratio-cut)的方法是离散化

interval $[0,1]$ to approximate the integral. By using a uniform disc ret iz ation with a step size of $\textstyle{\frac{1}{T}}$ for $T>0$ , we can apply the formula:

区间 $[0,1]$ 来近似积分。通过采用步长为 $\textstyle{\frac{1}{T}}$ 的均匀离散化(其中 $T>0$),可应用公式:

$$
\mathbb{I}(\pmb{p},i,j)=\frac{1}{T}\sum_{t=1}^{T}\prod_{m\neq i,j}(1-p_{i}\frac{t}{T})
$$

$$
\mathbb{I}(\pmb{p},i,j)=\frac{1}{T}\sum_{t=1}^{T}\prod_{m\neq i,j}(1-p_{i}\frac{t}{T})
$$

The quality of the approximation depends on $T$ , but since the integrated function is polynomial in $t$ , we have the luxury of using a weighted disc ret iz ation scheme that ensures the exact computation of the integral using the following quadrature method proven in Appendix A.4.

近似质量取决于 $T$,但由于被积函数是关于 $t$ 的多项式,我们可以采用加权离散化方案,确保通过附录 A.4 证明的以下求积方法精确计算积分。

Lemma 4.3 (Integral computation via quadrature). Let $m>0$ be an integer and $c_{m}\stackrel{d e f}{=}\left\lfloor\frac{m}{2}\right\rfloor+1$ (the smallest integer such that $2c_{m}\geq m+1.$ ), then there exist $c_{m}$ tuples $(s_{q},t_{q}){1\leq q\leq c{m}}$ such that:

引理 4.3 (基于求积分的积分计算). 设整数 $m>0$ 且 $c_{m}\stackrel{d e f}{=}\left\lfloor\frac{m}{2}\right\rfloor+1$ (这是满足 $2c_{m}\geq m+1$ 的最小整数), 则存在 $c_{m}$ 个元组 $(s_{q},t_{q}){1\leq q\leq c{m}}$ 使得:

$$
\int_{0}^{1}\prod_{i=1}^{m}(1-p_{i}t)d t=\sum_{q=1}^{c_{m}}s_{q}\prod_{i=1}^{m}(1-p_{i}t_{q}),
$$

$$
\int_{0}^{1}\prod_{i=1}^{m}(1-p_{i}t)d t=\sum_{q=1}^{c_{m}}s_{q}\prod_{i=1}^{m}(1-p_{i}t_{q}),
$$

for all $\pmb{p}\in[0,1]^{m}$ .

对于所有 $\pmb{p}\in[0,1]^{m}$。

The computation of $\mathbb{I}(\pmb{p},i,j)$ for a given pair $(i,j)$ costs $O(n^{2})$ . We can still obtain the total $\operatorname{RC}(\pmb{p})$ in $O(c_{n}n^{2})$ instead of $O(n^{4})$ by computing the $c_{n}$ quantities (Pnm=1 log (1 − pmtq))1≤q≤cn in advance. Another challenge encountered when using the quadrature is the instability of the computations due to the potentially large number of elements in the product. Even with a tight approximation using batch-based estimates, it still costs $O(b^{3})$ , where $b$ is the batch size. Furthermore, the higher the number of clusters, the larger the batch size should be to obtain a good estimate of the integral for different clusters. These challenges are to be expected since the original combinatorial problem is generally NP-hard. See Appendix A.5 for a detailed analysis of the batch-based estimate. Instead, we prove in Appendix A.6 that the expected ratio cut can be upper-bounded by a much more accessible quantity, as shown in Lemma 4.4.

对于给定的一对 $(i,j)$,计算 $\mathbb{I}(\pmb{p},i,j)$ 的复杂度为 $O(n^{2})$。通过预先计算 $c_{n}$ 个量 $(Pnm=1 \log (1 - pmtq)){1\leq q\leq c_n}$,我们仍能在 $O(c_{n}n^{2})$ 而非 $O(n^{4})$ 的时间内获得总 $\operatorname{RC}(\pmb{p})$。

使用数值积分时遇到的另一个挑战是,由于乘积中可能包含大量元素,计算过程可能不稳定。即使采用基于批次的估计进行紧密逼近,其复杂度仍为 $O(b^{3})$,其中 $b$ 是批次大小。此外,聚类数量越多,为了对不同聚类获得良好的积分估计,所需的批次大小也应越大。这些挑战是可以预见的,因为原始组合问题通常是 NP 难的。关于基于批次的估计的详细分析,请参阅附录 A.5。

相反,我们在附录 A.6 中证明了期望比率割可以被一个更容易计算的量上界约束,如引理 4.4 所示。

Lemma 4.4 (Integral upper-bound). We adopt the same notation of Lemma 4.2 where $\mathbf{\delta\alpha}{\alpha}::::::::$ (α1, . . . , αm) ∈ [0, 1]m, and we assume that α d=ef $\textstyle{\frac{1}{m}}\sum_{i=1}^{m}\alpha_{i}>0$ . Then we have:

引理 4.4 (积分上界). 我们采用与引理 4.2 相同的符号表示,其中 $\mathbf{\delta\alpha}{\alpha}::::::::$ (α1, ..., αm) ∈ [0, 1]m,并假设 α d=ef $\textstyle{\frac{1}{m}}\sum_{i=1}^{m}\alpha_{i}>0$。那么我们有:

$$
\int_{0}^{1}\prod_{i=1}^{m}(1-\alpha_{i}t)d t\leq\frac{1}{(m+1)\overline{{\alpha}}}.
$$

$$
\int_{0}^{1}\prod_{i=1}^{m}(1-\alpha_{i}t)d t\leq\frac{1}{(m+1)\overline{{\alpha}}}.
$$

The assumption $\overline{{\alpha}}>0$ is not necessary if we extend the property to $\mathbb{R}\cup{+\infty}$ . We can now use Lemma 4.4 to upper-bound $\operatorname{RC}(\pmb{p})$ .

假设 $\overline{{\alpha}}>0$ 在将性质扩展到 $\mathbb{R}\cup{+\infty}$ 时并非必要条件。现在我们可以利用引理4.4来上界 $\operatorname{RC}(\pmb{p})$。

Lemma 4.5 (RC upper-bound). Using the notations of Theorem 4.1 and Lemma 4.4, we have:

引理 4.5 (RC 上界). 使用定理 4.1 和引理 4.4 的符号表示, 我们有:

$$
\mathrm{RC}({\pmb p})\leq\frac{e^{2}}{2n}\frac{1}{\overline{{\pmb p}}}\sum_{1\leq i,j\leq n}w_{i j}(p_{i}+p_{j}-2p_{i}p_{j})
$$

$$
\mathrm{RC}({\pmb p})\leq\frac{e^{2}}{2n}\frac{1}{\overline{{\pmb p}}}\sum_{1\leq i,j\leq n}w_{i j}(p_{i}+p_{j}-2p_{i}p_{j})
$$

Interpretation Before we proceed with determining an unbiased gradient estimator for the bound in Lemma 4.5, we seek to understand the significance of the various quantities introduced previously. In contrastive learning, the objective can be written as $-2\textstyle\sum_{i j}W_{i j}p_{i}p_{j}$ , whose minimization aims to assign hig hly similar samples to the same cluster. The objective in ratio-cut can be expressed as $\begin{array}{r}{\sum_{i j}W_{i j}\left[p_{i}(1-p_{j})+p_{j}(1-p_{i})\right]}\end{array}$ , which is minimized when dissimilar samples are separated into different clusters, in alignment with the initial goal of minimizing the cut. It is worth noting that the minimum of the term $W_{i j}\left[p_{i}(1-p_{j})+p_{j}(1-p_{i})\right]$ is $W_{i j}$ , and it is achieved when either $v_{i}$ and $v_{j}$ are in different clusters.

解读
在继续确定引理4.5中边界无偏梯度估计量之前,我们需要理解先前引入的各种量的意义。对比学习的目标可表述为 $-2\textstyle\sum_{i j}W_{i j}p_{i}p_{j}$ ,其最小化旨在将高度相似的样本分配到同一簇。比率切割 (ratio-cut) 的目标可表示为 $\begin{array}{r}{\sum_{i j}W_{i j}\left[p_{i}(1-p_{j})+p_{j}(1-p_{i}\right]}\end{array}$ ,当不相似样本被分离到不同簇时该目标最小化,这与最小化切割的初始目标一致。值得注意的是,项 $W_{i j}\left[p_{i}(1-p_{j})+p_{j}(1-p_{i})\right]$ 的最小值为 $W_{i j}$ ,当 $v_{i}$ 和 $v_{j}$ 位于不同簇时取得该最小值。

Lemma 4.2 generalizes the scaling by the inverse of the size of a cluster $\mathbb{C}$ to the probabilistic setting. Let’s assume that $\pmb{p}\in{0,1}^{n}$ (deterministic assignment), then $(p_{i}+p_{j}-2p_{i}p_{j})=1$ if and only if $v_{i}$ and $v_{j}$ are in different clusters. In such case, we can easily prove the following:

引理 4.2 将聚类 $\mathbb{C}$ 大小的倒数缩放推广到了概率化设定。假设 $\pmb{p}\in{0,1}^{n}$ (确定性分配) ,则当且仅当 $v_{i}$ 和 $v_{j}$ 属于不同聚类时 $(p_{i}+p_{j}-2p_{i}p_{j})=1$ 。在此情况下,我们可以轻松证明以下结论:

$$
\operatorname{\mathbb{E}}\left[{\frac{1}{1+|{\widehat{\mathbb{C}}}|}}\right]=\int_{0}^{1}\prod_{i=1,p_{i}=1}^{n}(1-t)d t={\frac{1}{1+n{\overline{{p}}}}}.
$$

$$
\operatorname{\mathbb{E}}\left[{\frac{1}{1+|{\widehat{\mathbb{C}}}|}}\right]=\int_{0}^{1}\prod_{i=1,p_{i}=1}^{n}(1-t)d t={\frac{1}{1+n{\overline{{p}}}}}.
$$

As $n$ becomes sufficiently large, the quantity $\frac{1}{n}\frac{1}{\overline{{p}}}$ from Lemma 4.5 approximates 1+1np and, therefore, also approximates $\begin{array}{r}{\mathbb{E}\left[\frac{1}{1+\left|\mathbb{C}\right|}\right]}\end{array}$ Consequently, the bounds we’ve seen so far are tight in the deterministic case.

当 $n$ 足够大时,引理4.5中的量 $\frac{1}{n}\frac{1}{\overline{{p}}}$ 近似于 $1+\frac{1}{np}$,因此也近似于 $\begin{array}{r}{\mathbb{E}\left[\frac{1}{1+\left|\mathbb{C}\right|}\right]}\end{array}$。由此可见,在确定性情况下,我们目前所见的边界是紧的。

4.2 Stochastic gradient of the objective

4.2 目标函数的随机梯度

We will drop the scaling factor $\textstyle{\frac{1}{2n}}$ when upper bounding $\operatorname{RC}(\mathcal{C}{k})$ , we can thus succinctly write the following upper bound for the expected ratio-cut with $\overline{{\pmb{P}}}:=:\mathrm{diag}\left(\overline{{\pmb{P}{:,1}}},\ldots,\overline{{\pmb{P}{:,k}}}\right)$ as $\mathrm{RC}(\mathscr{C}{k})\leq\mathscr{L}_{r c}(W,P)$ such that:

在计算 $\operatorname{RC}(\mathcal{C}{k})$ 上界时,我们将忽略缩放因子 $\textstyle{\frac{1}{2n}}$ ,因此可以简洁地写出以下期望比率割的上界表达式:设 $\overline{{\pmb{P}}}:=:\mathrm{diag}\left(\overline{{\pmb{P}{:,1}}},\ldots,\overline{{\pmb{P}{:,k}}}\right)$ ,则有 $\mathrm{RC}(\mathscr{C}{k})\leq\mathscr{L}_{r c}(W,P)$ ,具体形式如下:

$$
\begin{array}{l}{\displaystyle\mathcal{L}{r c}(\boldsymbol{W},\boldsymbol{P})=\sum_{\ell=1}^{k}\sum_{i,j=1}^{n}\frac{1}{\overline{{P}}{:,\ell}}W_{i j}(P_{i,\ell}+P_{j,\ell}-2P_{i,\ell}P_{j,\ell})}\ {\displaystyle\qquad=\operatorname{Tr}(\overline{{\boldsymbol{P}}}^{-1}(\mathbf{1}_{n,k}-\boldsymbol{P})^{\top}\boldsymbol{W}\boldsymbol{P}),\qquad(7)}\end{array}
$$

$$
\begin{array}{l}{\displaystyle\mathcal{L}{r c}(\boldsymbol{W},\boldsymbol{P})=\sum_{\ell=1}^{k}\sum_{i,j=1}^{n}\frac{1}{\overline{{P}}{:,\ell}}W_{i j}(P_{i,\ell}+P_{j,\ell}-2P_{i,\ell}P_{j,\ell})}\ {\displaystyle\qquad=\operatorname{Tr}(\overline{{\boldsymbol{P}}}^{-1}(\mathbf{1}_{n,k}-\boldsymbol{P})^{\top}\boldsymbol{W}\boldsymbol{P}),\qquad(7)}\end{array}
$$

Let us examine the derivative of $\mathcal{L}{r c}$ of a single cluster with respect to pi,pi = $\begin{array}{r}{p_{i},\dot{p_{i}}=\frac{d\mathcal{L}{p r c u t}}{d p_{i}}}\end{array}$

让我们考察单个簇的 $\mathcal{L}{r c}$ 关于 $p_i$ 的导数,$p_i = \begin{array}{r}{p_{i},\dot{p_{i}}=\frac{d\mathcal{L}{p r c u t}}{d p_{i}}}\end{array}$

$$
\dot{p_{i}}\propto\frac{1}{\overline{{p}}^{2}}\sum_{i,j=1}^{n}W_{i j}\Big[(1-2p_{j})\overline{{p}}-\frac{1}{n}(p_{i}+p_{j}-2p_{i}p_{j})\Big],
$$

$$
\dot{p_{i}}\propto\frac{1}{\overline{{p}}^{2}}\sum_{i,j=1}^{n}W_{i j}\Big[(1-2p_{j})\overline{{p}}-\frac{1}{n}(p_{i}+p_{j}-2p_{i}p_{j})\Big],
$$

Algorithm 1 Probabilistic Ratio-Cut (PRCut) Algorithm

算法 1: 概率比率切割 (PRCut) 算法

| 输入: 数据集 (vi), 相似性核函数 K, 批次大小 b, 编码器 Ne 参数化为 0, 簇权重, t = 0.
1: while 未终止 do
2: t←t+1
3: 采样大小为 b 的左批次 St
4: 采样大小为 b 的右批次 Sr
5: 计算 W = K(St,Sr)
6: 计算 P, P ← Ne(Si), Ne(Sr) E Rbxk
7: 更新 Pt← (1-βt)Pt-1 + (P + P

In order to obtain an unbiased estimate for the gradient, we need to acquire an accurate estimate of $\overline{{p}}$ . We do this in the online setting by computing the moving average:

为了获得梯度的无偏估计,我们需要准确估计 $\overline{{p}}$ 。在在线设置中,我们通过计算移动平均值来实现这一点:

$$
\overline{{{\pmb{p}}}}{t}^{m a}=(1-\beta_{t})\overline{{{\pmb{p}}}}{t-1}^{m a}+\beta_{t}\overline{{{\pmb{p}}}}_{t},
$$

$$
\overline{{{\pmb{p}}}}{t}^{m a}=(1-\beta_{t})\overline{{{\pmb{p}}}}{t-1}^{m a}+\beta_{t}\overline{{{\pmb{p}}}}_{t},
$$

such that $\begin{array}{r}{\beta_{t}=\frac{\beta}{t}}\end{array}$ and $\overline{{\pmb{p}}}{t}$ is a batch-based estimate of $\overline{{p}}$ at time $t$ . Then, the unbiased gradient can be obtained by back-propagating $\left[\operatorname{sg}(\dot{p_{i}})p_{i}\right]$ , with sg being the gradient-stopping operator.

使得 $\begin{array}{r}{\beta_{t}=\frac{\beta}{t}}\end{array}$ 且 $\overline{{\pmb{p}}}{t}$ 是 $\overline{{p}}$ 在时间 $t$ 的基于批次的估计。那么,无偏梯度可以通过反向传播 $\left[\operatorname{sg}(\dot{p_{i}})p_{i}\right]$ 获得,其中 sg 是梯度停止算子。

4.3 Regular iz ation

4.3 正则化 (Regularization)

As we aim to optimize $\mathcal{L}{r c}(W,P)$ using gradient de- scent, the main challenge our method faces is that the assignments may collapse to a single cluster $\mathbb{C}{m}$ with a probability close to 1. This translates to $\left(\overline{{P_{:,\ell}}}\right){\ell\neq m}\approx0$ , which renders the gradient updates highly unstable. Benamou et al. (2014) have adopted the constraint that ${P}$ belongs to the polytope of distributions $\mathcal{U}_{k}$ , defined as:

在优化 $\mathcal{L}{r c}(W,P)$ 时,我们采用梯度下降法面临的主要挑战是分配可能坍缩至单一聚类 $\mathbb{C}{m}$ ,其概率接近 1。这意味着 $\left(\overline{{P_{:,\ell}}}\right){\ell\neq m}\approx0$ ,导致梯度更新极不稳定。

Instead of restricting the clusters to be equally likely through the Bregman projection into $\mathcal{U}{k}$ , we only encourage such behavior using Kullback-Leibler divergence regular iz ation. By selecting the appropriate regu lari z ation weight $\gamma$ , the additional term will ensure that the likelihood of each of the $k$ clusters, $\overline{{\mathbf{\nabla}P}}$ , exceeds a certain threshold $\delta(\gamma)>0$ for the optimal ${P}$ . This approach does not necessarily imply that all the clusters will be utilized, as the assignment probability for cluster $\mathbb{C}{\ell}$ could be significantly above $\delta$ without any sample being more likely assigned to cluster $\mathbb{C}_{\ell}$ . The final objective that we aim to optimize is:

我们并不通过将簇限制为等概率的 Bregman 投影到 $\mathcal{U}{k}$ 中,而是仅利用 Kullback-Leibler 散度正则化来鼓励这种行为。通过选择合适的正则化权重 $\gamma$,附加项将确保在最优解 ${P}$ 下,每个 $k$ 簇的似然 $\overline{{\mathbf{\nabla}P}}$ 超过特定阈值 $\delta(\gamma)>0$。这种方法并不必然意味着所有簇都会被利用,因为簇 $\mathbb{C}{\ell}$ 的分配概率可能显著高于 $\delta$,而实际上没有任何样本更可能被分配到簇 $\mathbb{C}_{\ell}$。最终我们优化的目标函数为:

$$
\mathcal{L}{p r c u t}(\pmb{W},\pmb{P})=\mathcal{L}{r c}(\pmb{W},\pmb{P})+\gamma D_{\mathrm{KL}}(\overline{{\pmb{P}}}\parallel\frac{1}{k}\pmb{1}_{k})
$$

$$
\mathcal{L}{p r c u t}(\pmb{W},\pmb{P})=\mathcal{L}{r c}(\pmb{W},\pmb{P})+\gamma D_{\mathrm{KL}}(\overline{{\pmb{P}}}\parallel\frac{1}{k}\pmb{1}_{k})
$$

4.4 Similarity measure

4.4 相似性度量

We have assumed thus far that the kernel $\boldsymbol{\kappa}$ is provided as input. We also assert that the performance of any similarity-based clustering ultimately depends on the quality of the similarity function used. The simplest kernel we can use is the adjacency matrix for a $k$ -nearest neighbors graph, where $\mathcal{K}(v_{i},v_{j})=1$ when either $v_{i}$ or $v_{j}$ is among the $k$ nearest neighbors of each other (to ensure that the kernel is symmetric). We can also use the Simple (SimCLR) or the Symbiotic (All4One) contrastive learning methods (Chen et al., 2020; Estepa et al., 2023) to train a neural network to learn the pairwise similarities. Once we train our similarity network, we use the cosine of the representations of the samples as our similarity function to either compute $\begin{array}{r}{W_{i j}=\exp\left(\frac{c o s i n e(z_{i},z_{j})}{\tau}\right)}\end{array}$ for some temperature $\tau>0$ or to build the $k$ -nearest neighbors graph based on these representations.

我们此前一直假设核函数 $\boldsymbol{\kappa}$ 是作为输入提供的。我们还认为,任何基于相似性的聚类性能最终取决于所用相似度函数的质量。我们可以使用的最简单核是 $k$ 近邻图的邻接矩阵,其中当 $v_{i}$ 或 $v_{j}$ 是彼此的 $k$ 个最近邻之一时(为确保核对称性),$\mathcal{K}(v_{i},v_{j})=1$。我们也可以使用简单(SimCLR)或共生(All4One)对比学习方法(Chen et al., 2020; Estepa et al., 2023)来训练神经网络以学习成对相似性。训练相似性网络后,我们使用样本表征的余弦值作为相似度函数,对于某个温度 $\tau>0$ 计算 $\begin{array}{r}{W_{i j}=\exp\left(\frac{c o s i n e(z_{i},z_{j})}{\tau}\right)}\end{array}$,或基于这些表征构建 $k$ 近邻图。

4.5 Computational and Memory Footprint

4.5 计算与内存占用

If the similarity matrix $W$ is dense, the time complexity of computing the PRCut objective is $O(n^{3})$ , identical to a vanilla spectral clustering approach. In contrast, for a sparse graph where similarity is determined based on the $m$ -nearest neighbors, the time complexity of conventional spectral clustering is $O(n m k)$ , not accounting for the additional cost of $O(n k^{2})$ incurred by running k-means on the computed ei gen vectors.

如果相似度矩阵 $W$ 是稠密的,计算 PRCut 目标函数的时间复杂度为 $O(n^{3})$ ,与传统谱聚类方法相同。相比之下,对于基于 $m$ 近邻确定相似度的稀疏图,传统谱聚类的时间复杂度为 $O(n m k)$ ,这还不包括在计算得到的特征向量上运行 k-means 所产生的 $O(n k^{2})$ 额外开销。

In the case of batch PRCut, the expected computational cost for evaluating the batch loss with a batch size of $b$ is $\begin{array}{r}{O\left(\frac{m}{n}k b^{2}\right)}\end{array}$ , while the memory requirement remains constant at $O(b^{2})$ . Consequently, when executed for $T$ steps, the overall time complexity of stochastic PRCut becomes $\begin{array}{r}{O\left(T\frac{m}{n}k b^{2}\right)}\end{array}$ .

在批量PRCut的情况下,评估批大小为$b$的批损失预期计算成本为$\begin{array}{r}{O\left(\frac{m}{n}k b^{2}\right)}\end{array}$,而内存需求保持恒定在$O(b^{2})$。因此,当执行$T$步时,随机PRCut的总体时间复杂度变为$\begin{array}{r}{O\left(T\frac{m}{n}k b^{2}\right)}\end{array}$。

5 EXPERIMENTS

5 实验

We train a neural network $N_{\theta}:\mathbb{R}^{p}\mapsto\Delta^{k-1}$ that maps the vertex $v_{i}$ to its cluster assignment probabilities $P_{i}^{\theta}$ , where $\theta\in\mathbb{R}^{q}$ is the parameter of the network. The last layer of the neural network is a Softmax layer, ensuring that the constraint $\textstyle\sum_{\ell=1}^{k}P_{i\ell}=1$ is always satisfied. We assume that the number of class labels $k$ is provided. When computing the batch-based gradient using Equation (8), we use the factor $\frac{1}{b}$ instead of $\textstyle{\frac{1}{n}}$ , where $b$ is the batch size1.

我们训练一个神经网络 $N_{\theta}:\mathbb{R}^{p}\mapsto\Delta^{k-1}$,将顶点 $v_{i}$ 映射到其聚类分配概率 $P_{i}^{\theta}$,其中 $\theta\in\mathbb{R}^{q}$ 是网络参数。该神经网络的最后一层是 Softmax 层,确保始终满足约束条件 $\textstyle\sum_{\ell=1}^{k}P_{i\ell}=1$。我们假设类别标签数量 $k$ 是已知的。当使用公式(8)计算基于批次的梯度时,我们采用系数 $\frac{1}{b}$ 而非 $\textstyle{\frac{1}{n}}$,其中 $b$ 表示批次大小1。

Since Lrc(W , P θ) is linear in W , we scale it by ∥W1 ∥ to ensure consistency in the gradient descent method across different datasets and similarity measures.

由于 Lrc(W , P θ) 在 W 上是线性的,我们通过 ∥W1 ∥ 对其进行缩放,以确保在不同数据集和相似性度量下梯度下降方法的一致性。

Metrics To benchmark the performance of our algorithm, we assume that we have access to the true labeling ${\pmb y}=(y_{i})_{i}$ and the algorithm’s clustering $\mathbf{c}$ . We use three different metrics defined as follows:

指标

为了评估我们算法的性能,我们假设可以获得真实标签 ${\pmb y}=(y_{i})_{i}$ 和算法的聚类结果 $\mathbf{c}$。我们使用以下三种不同的指标进行衡量:

• Unsupervised Accuracy (ACC): We use the Kuhn-Munkres algorithm (Munkres, 1957) to find the optimal permutation $\begin{array}{r}{\frac{1}{n}\operatorname*{max}{\sigma\in\sigma_{k}}\sum_{i=1}^{n}1_{y_{i}=\sigma(c_{i})}}\end{array}$ $o_{k}$ is maximized between of ${1,\ldots,k}$ such that the true labeling $(y_{i}){i}$ and the clustering $(c_{i})_{i}$ . • Normalized Mutual Information (NMI) is defined as: NMI(y, c) = $\begin{array}{r}{\mathrm{NMI}(\pmb{y},\pmb{c})=\frac{\mathbb{Z}(\pmb{y},\pmb{c})}{\operatorname*{max}{\mathcal{H}(l),\mathcal{H}(\pmb{c})}}}\end{array}$ Where $\boldsymbol{\mathcal{T}}(\boldsymbol{\textbf{{y}}},\boldsymbol{\textbf{{c}}}$ is the mutual information between ${\bf{y}}$ and $_c$ and $\mathcal{H}$ is the entropy measure. • Adjusted Rand Index (ARI) to evaluates the agreement between the true class labels and the learned clustering. • Ratio Cut (RC) is computed using the formula in Equation (3) using the raw Laplacian matrix.

  • 无监督准确率 (ACC): 使用 Kuhn-Munkres 算法 (Munkres, 1957) 寻找最优排列 $\begin{array}{r}{\frac{1}{n}\operatorname*{max}{\sigma\in\sigma_{k}}\sum_{i=1}^{n}1_{y_{i}=\sigma(c_{i})}}\end{array}$ ,使得真实标签 $(y_{i}){i}$ 与聚类结果 $(c_{i}){i}$ 在 ${1,\ldots,k}$ 上的 $o_{k}$ 最大化。
  • 标准化互信息 (NMI): 定义为 $\begin{array}{r}{\mathrm{NMI}(\pmb{y},\pmb{c})=\frac{\mathbb{Z}(\pmb{y},\pmb{c})}{\operatorname*{max}{\mathcal{H}(l),\mathcal{H}(\pmb{c})}}}\end{array}$ ,其中 $\boldsymbol{\mathcal{T}}(\boldsymbol{\textbf{{y}}},\boldsymbol{\textbf{{c}}}$ 表示 ${\bf{y}}$ 与 $_c$ 的互信息,$\mathcal{H}$ 为熵度量。
  • 调整兰德指数 (ARI): 用于评估真实类别标签与学习所得聚类之间的一致性。
  • 比率割 (RC): 通过原始拉普拉斯矩阵按公式 (3) 计算。

Before we compare our method to spectral clustering and other clustering approaches, we first assess the quality of the clustering when using the perfect similarity measure, where $\mathcal{K}(v_{i},v_{j})=1$ if $v_{i}$ and $v_{j}$ share the same label, and $\mathcal{K}(v_{i},v_{j})=0$ otherwise. For this experiment, we employ a simple Multi-Layer Perceptron (MLP) consisting of 3 layers with 512 hidden units, followed by Gaussian Error Linear Units (GeLU) activations. In the PRCut network, the last layer utilizes the Softmax activation. We compare the two methods across three datasets: MNIST (Lecun et al., 1998), Fashion-MNIST (F-MNIST) (Xiao et al., 2017), and CIFAR10 (Krizhevsky, 2009). The classifier is trained by optimizing the cross-entropy loss (CE).

在将我们的方法与谱聚类及其他聚类方法进行比较之前,我们首先评估使用完美相似性度量时的聚类质量,其中当$v_{i}$和$v_{j}$具有相同标签时$\mathcal{K}(v_{i},v_{j})=1$,否则$\mathcal{K}(v_{i},v_{j})=0$。本实验采用一个简单的多层感知机(MLP),包含3层512个隐藏单元,并采用高斯误差线性单元(GeLU)激活函数。PRCut网络的最后一层使用Softmax激活。我们在三个数据集上比较这两种方法:MNIST (Lecun et al., 1998)、Fashion-MNIST (F-MNIST) (Xiao et al., 2017)和CIFAR10 (Krizhevsky, 2009)。分类器通过优化交叉熵损失(CE)进行训练。

The MLP classifier trained directly on the labeled data and the PRCut clustering using the labeled similarity demonstrate similar performance. This shows that our approach is more versatile and can be used in various learning methods while perfectly reflecting the quality of the similarity in the resulting clustering.

直接在标注数据上训练的MLP分类器与使用标注相似度的PRCut聚类表现出相似的性能。这表明我们的方法更具通用性,可适用于多种学习方法,同时完美反映相似度质量在最终聚类中的体现。

In Table 2 we compare our approach to vanilla Spectral Clustering (SC) using k-neighbor graph adjacency as a similarity measure for $k=150$ using the entire training set. Since the final performance depends on the kmeans initialization, we report the best run for SC.

在表2中,我们将我们的方法与使用k近邻图邻接作为相似性度量的原始谱聚类(SC)进行了比较,其中$k=150$并使用整个训练集。由于最终性能取决于k均值初始化,我们报告了SC的最佳运行结果。

Table 1: Benchmarking PRCut when using label-based similarity

表 1: 基于标签相似度的PRCut基准测试

数据集 方法 ACC NMI
MNIST CE 0.980 0.943
PRCut 0.987 0.938
F-MNIST CE 0.885 0.803
PRCut 0.887 0.789
CIFAR10 CE 0.582 0.369
PRCut 0.571 0.359

Table 2: Comparison between PRCut and Spectral Clustering (SC)

表 2: PRCut 与谱聚类 (SC) 的对比

数据集 方法 ACC NMI RC 运行时间
MNIST SC (最佳) 0.70 0.744 170.1
PRCut 0.821 0.778 150.2
F-MNIST SC (最佳) 0.596 0.593 110.2
PRCut 0.658 0.620 101.5
CIFAR10 SC (最佳) 0.217 0.086 479.3
PRCut 0.243 0.121 440.3

Since the final objective is to minimize the ratio cut, our approach achieves a better ratio cut value compared to vanilla spectral clustering. Whether such improvement translates to an enhancement in the clustering metrics depends on the similarity function. In all the 3 datasets, our approach outperforms the spectral relaxation of the ratio-cut.

由于最终目标是最小化比率切割(ratio cut),我们的方法相比传统谱聚类实现了更优的比率切割值。这种改进能否转化为聚类指标的提升取决于相似度函数。在所有三个数据集中,我们的方法都优于比率切割的谱松弛解法。

For the third set of experiments reported in Table 3, we compare our method to VaDE (Jiang et al., 2017), VMM (Stirn & Knowles, 2024), and Turtle (Gadetsky et al., 2024). We have observed that VaDE does not generalize well across datasets, as its performance drastically degrades when applied to the Fashion-MNIST dataset. The VMM approach is a more consistent variation al auto encoder (VAE) with a mixture model prior, achieving the best reported performance on the Fashion-MNIST dataset in the literature. We also report the results for methods with the suffix ”-D,” which use the pre-trained representation model DINOv2 (Oquab et al., 2024), while the suffix ”-V” denotes methods that utilize the representation from the CLIP (Radford et al., 2021) vision trans- former. In both of these approaches, the k-neighbors graph and the trained neural networks take the samples in the representation spaces as input.

在表3报告的第三组实验中,我们将本方法与VaDE (Jiang et al., 2017)、VMM (Stirn & Knowles, 2024)和Turtle (Gadetsky et al., 2024)进行对比。我们发现VaDE在不同数据集间泛化能力较差,其在Fashion-MNIST数据集上的性能显著下降。VMM方法是一种更稳定的变分自编码器(VAE)与混合先验模型组合,在文献中取得了Fashion-MNIST数据集上的最佳报告性能。我们还报告了带"-D"后缀方法的结果,这些方法使用预训练表征模型DINOv2 (Oquab et al., 2024),而"-V"后缀表示采用CLIP (Radford et al., 2021)视觉Transformer表征的方法。在这两种方法中,k近邻图和训练好的神经网络都以表征空间中的样本作为输入。

The neural network consists of a single linear layer followed by the softmax operator. The learning rate was set to a constant value of $10^{-4}$ and using a $10^{-7}$ weight decay, with a bach size of $b=2048$ . We set the entropy regular iz ation to $\gamma=100.0$ and the moving average parameter to $\beta=0.8$ .

该神经网络由单个线性层和随后的softmax算子组成。学习率设置为恒定值$10^{-4}$,权重衰减为$10^{-7}$,批量大小为$b=2048$。我们将熵正则化系数设为$\gamma=100.0$,移动平均参数设为$\beta=0.8$。

Table 3: Comparison of PRCut to the best-performing clustering methods.

表 3: PRCut与最佳聚类方法的对比

数据集 方法 ACC NMI
MNIST SC 0.701 0.744
VaDE 0.857 0.838
VMM 0.960
Turtle-D 0.573 0.544
PRCut-V 0.771 0.734
F-MNIST SC 0.596 0.593
VaDE 0.352 0.496
VMM 0.712 0.688
Turtle-D 0.764 0.723
PRCut-D 0.791 0.758
CIFAR10 SC 0.217 0.121
Turtle-V 0.972 0.929
PRCut-V 0.975 0.934
Turtle-D 0.806 0.870
CIFAR100 PRCut-D 0.789 0.856

Our method remains competitive compared to Turtle and achieves the best reported performance on Fashion-MNIST, which was designed to be a more challenging dataset than MNIST. Furthermore, our approach does not rely on any assumptions about the structure of the representation spaces, in contrast to Turtle, which is based on the premise that the best clusters are linearly separable—a property that is inherently valid for DINOv2 and CLIP. This may explain why it does not perform as well on grayscale images such as MNIST and Fashion-MNIST, where linear separability is not as prominent. We note that the Turtle results in our benchmark are based on a single representation space for a fair comparison, as the original paper performs best when it combines two representation spaces.

我们的方法在与Turtle相比仍具竞争力,并在Fashion-MNIST上取得了目前最佳报告性能。该数据集被设计为比MNIST更具挑战性。此外,我们的方法不依赖于任何关于表示空间结构的假设,而Turtle则基于最佳聚类是线性可分的前提——这一特性对DINOv2和CLIP具有固有有效性。这可能解释了为何Turtle在MNIST和Fashion-MNIST等灰度图像上表现不佳,因为线性可分性在这些场景中并不显著。需要说明的是,我们基准测试中的Turtle结果基于单一表示空间以确保公平比较,而原论文在结合两个表示空间时才能达到最佳性能。

Table 4: Comparison representations using PRCut

表 4: 使用 PRCut 的比较表示

Dataset Rep ACC NMI
CIFAR10 Raw 0.243 0.121
SimCLR 0.721 0.652
All4One 0.710 0.635
VitL-14 0.975 0.934
DinoV2 0.774 0.797
CIFAR100 Raw 0.054 0.022
SimCLR 0.362 0.483
All4One 0.382 0.511
VitL-14
DinoV2 0.720 0.789 0.755 0.856

In Table 4, we compare the performance of our method using various pre-trained self-supervised representation learning models. In particular, we rely on the solo-learn library (da Costa et al., 2022) to retrieve or fine-tune the SimCLR and All4One models. While these two approaches perform well when evaluated using linear probes or a k-nearest neighbor classifier, the similarities in the embedding space for datasets with a higher number of class labels (CIFAR100) are too sparse to capture useful clusters. In that regard, DinoV2 embedding performs better than CLIP vision transformer.

在表4中,我们比较了使用各种预训练自监督表征学习模型的方法性能。具体而言,我们依托solo-learn库(da Costa et al., 2022)来获取或微调SimCLR和All4One模型。虽然这两种方法在使用线性探针或k近邻分类器评估时表现良好,但对于类别标签数量较多(如CIFAR100)的数据集,其嵌入空间的相似性过于稀疏,难以捕获有效聚类。在这方面,DinoV2嵌入表现优于CLIP视觉Transformer。

6 Conclusion and future work

6 结论与未来工作

We have introduced a novel method that approaches the graph ratio-cut optimization from a probabilistic perspective. Compared to the classical Spectral Clustering based on the raw Laplacian, PRCut achieves better clustering performance and more optimal ratiocut values. However, the performance strongly depends on the sparsity of the global similarity. With the recently developed self-supervised representation models that have proven to be powerful, we have demonstrated that PRCut translates the similarities between samples in the embedding space into high quality clustering and achieve new best clustering s for FashionMNIST while remaining competitive with state-of-theart clustering methods.

我们提出了一种从概率视角解决图比率割优化的新方法。相比基于原始拉普拉斯矩阵的经典谱聚类,PRCut实现了更优的聚类性能和更理想的比率割值。但其性能高度依赖全局相似度的稀疏性。借助近期被证实强大的自监督表征模型,我们证明了PRCut能将嵌入空间中样本间的相似性转化为高质量聚类,在FashionMNIST数据集上创下新的最佳聚类记录,同时保持与最先进聚类方法的竞争力。

While the current work serves as an introduction to this novel approach, the potential extensions of our method appear boundless. For example, PRCut could be leveraged as a dimensionality reduction technique by incorporating a bottleneck layer within the clustering neural network. Furthermore, the methodology could be adapted under the premise of linear separability of the true class labels, similar to the approach adopted by Gadetsky et al. (2024). The issue of dynamically determining the optimal number of clusters remains unresolved, given our current assumption that this information is supplied as an input to our algorithm.

虽然当前工作是对这一新方法的初步探索,但我们的方法具有无限扩展潜力。例如,通过在聚类神经网络中加入瓶颈层,PRCut可被改造为降维技术。此外,该方法可基于真实类别标签的线性可分性前提进行调整,类似Gadetsky等人(2024)采用的方法。由于我们当前假设算法需要输入最优聚类数量,动态确定最佳聚类数的问题仍有待解决。

The algorithm can also be extended to offline learning via Equation (7). Notably, when considering equally likely clusters, the problem simplifies to a quadratic form, with the Hessian being proportional to $-\boldsymbol{W}$ . By leveraging the structure of the similarity matrix, we can derive additional guarantees about the optimal solution. This enables the definition of an iterative approach similar to the Sinkhorn-Knopp algorithm (Benamou et al., 2014), offering promising avenues for further exploration.

该算法还可通过方程(7) 扩展到离线学习。值得注意的是,当考虑等概率聚类时,问题可简化为二次形式,其Hessian矩阵与 $-\boldsymbol{W}$ 成正比。通过利用相似矩阵的结构特性,我们可以推导出关于最优解的额外保证条件。这使我们可以定义一种类似于Sinkhorn-Knopp算法 (Benamou et al., 2014) 的迭代方法,为后续研究提供了有前景的探索方向。

阅读全文(20积分)