Deep Clustering via Probabilistic Ratio-Cut Optimization

基于概率比率割优化的深度聚类

Abstract

摘要

were even shown to be fundamentally equivalent in specific settings (Dhillon et al., 2004).

在特定场景下甚至被证明具有根本等价性 (Dhillon et al., 2004)。

We propose a novel approach for optimizing the graph ratio-cut by modeling the binary assignments as random variables. We provide an upper bound on the expected ratio-cut, as well as an unbiased estimate of its gradient, to learn the parameters of the assignment variables in an online setting. The clustering resulting from our probabilistic approach (PRCut) outperforms the Rayleigh quotient relaxation of the combinatorial problem, its online learning extensions, and several widely used methods. We demonstrate that the PRCut clustering closely aligns with the similarity measure and can perform as well as a supervised classifier when label-based similarities are provided. This novel approach can leverage out-of-the-box self-supervised represent at ions to achieve competitive performance and serve as an evaluation method for the quality of these representations.

我们提出了一种通过将二元分配建模为随机变量来优化图比率割的新方法。我们给出了期望比率割的上界及其梯度的无偏估计，以便在线学习分配变量的参数。我们的概率方法（PRCut）产生的聚类效果优于组合问题的瑞利商松弛、其在线学习扩展以及几种广泛使用的方法。我们证明，PRCut聚类与相似性度量高度一致，并且在提供基于标签的相似性时，其表现可与监督分类器相媲美。这一新方法能够利用现成的自监督表示来实现有竞争力的性能，并可作为评估这些表示质量的方法。

The recent advances in generative models and selfsupervised representation learning have produced powerful embeddings that capture the similarity between the original data samples. This makes leveraging these similarity measures to achieve effective clustering more relevant than ever, as it can serve as pseudo-labels for pre-training class if i ers or eliminate the need for a decoder (Ji et al., 2021) in the training of these generative models. Therefore, it is highly advantageous to develop a method that can transform the similarity information into clustering of equal quality. An efficient extension of spectral clustering to stochastic gradient descent would facilitate the conversion of learned embeddings to cluster assignments without relying on the full dataset or large batches to accurately approximate the graph structure of the embedding space.

生成式模型（Generative AI）和自监督表征学习的最新进展催生了能够捕捉原始数据样本间相似性的强大嵌入表示。这使利用相似性度量实现高效聚类变得前所未有的重要——它既可作为预训练分类器的伪标签，又能在生成模型训练中省去解码器（Ji et al., 2021）。因此，开发能将相似度信息转化为同等质量聚类的方法具有显著优势。若能将谱聚类（spectral clustering）高效扩展至随机梯度下降框架，即可在不依赖完整数据集或大批量样本的情况下，将学习到的嵌入表示转化为聚类分配，从而准确逼近嵌入空间的图结构。

1 INTRODUCTION

1 引言

Unsupervised learning is based on the premise that labels are not necessary for the training process, particularly for clustering tasks where samples that are highly similar are grouped together. Various clustering algorithms (Ezugwu et al., 2022) have been proposed in the context of machine learning and data mining. The K-Means algorithm (Lloyd, 1957) and ratio-cut partitioning (Hagen & Kahng, 1991) were among the first approaches to address the clustering problem. These methods were further refined and extended beyond binary partitioning and Euclidean distances, and they

无监督学习基于训练过程不需要标签的前提，尤其适用于将高度相似的样本聚类的任务。在机器学习和数据挖掘领域，已提出多种聚类算法 [Ezugwu et al., 2022]。K-Means算法 [Lloyd, 1957] 和比率切割分区 [Hagen & Kahng, 1991] 是最早解决聚类问题的方法之一。这些方法后来被进一步改进和扩展，超越了二值划分和欧氏距离的限制。

Several clustering methods have been applied to data streams to provide weak signals for the pre-training of neural networks. Meanwhile, other approaches have been proposed to utilize deep learning specifically for clustering, particularly with auto encoders and generative neural networks (Jiang et al., 2017). Generally, these methods can be categorized into contrastive learning (Li et al., 2020), where samples are grouped based on pairwise distances in a large batch, or generative models such as variation al auto encoders that use a prior with an auxiliary variable as the cluster assignment. The former methods do not fully capture the global structure of the clusters, while the latter may suffer from over fitting the prior due to the complexity of the neural network, with limited options to prevent this without relying on labels. Spectral clustering avoids these issues by clustering the data based on global similarities between clusters rather than just pairs of samples.

多种聚类方法已被应用于数据流中，为神经网络的预训练提供弱信号。与此同时，其他研究提出了专门利用深度学习进行聚类的方法，特别是通过自编码器和生成式神经网络（Jiang et al., 2017）。这些方法通常可分为对比学习（Li et al., 2020）——基于大批量样本间的成对距离进行分组，以及生成模型（如变分自编码器）——使用带有辅助变量的先验分布作为聚类分配依据。前者未能充分捕捉簇的全局结构，后者则可能因神经网络复杂性导致先验过拟合，且在不依赖标签的情况下缺乏有效的预防手段。谱聚类通过基于簇间全局相似性（而非仅样本对）进行数据聚类，规避了上述问题。

Spectral clustering has long resisted attempts to extend it to parametric learning, primarily due to the challenge of handling the spectral decomposition of large matrices. Since it is based on the spectral decomposition of the relaxed ratio-cut rather than the combinatorial version, it requires the projection of the data into a principal eigenspace. A clustering algorithm is then applied to these projections, further making the final performance dependent on the chosen clustering algorithm.

谱聚类长期以来难以扩展到参数化学习，主要因为处理大矩阵谱分解的挑战。该方法基于松弛比率割的谱分解而非组合版本，需要将数据投影到主特征空间。随后对这些投影应用聚类算法，使得最终性能依赖于所选聚类算法。

In this paper, we develop a novel method to optimize the ratio-cut without relying on the spectral decomposition of the Laplacian matrix by employing a probabilistic approach that treats the cluster assignment as random variables. Furthermore, we utilize neural networks to parameter ize the assignment probabilities and address the clustering problem in an online manner using stochastic gradient descent. The result is an online learning algorithm that achieves a better ratio-cut objective than the memory-intensive spectral method applied to the full graph Laplacian. We also demonstrate that our approach achieves comparable performance to a supervised classifier when utilizing supervised similarity (i.e., two samples are considered similar if they share the same label). This showcases a high fidelity between the resulting clustering and the given similarity measure. The proposed algorithm can also serve as a drop-in replacement for any other clustering method used in the pre-training of text or speech transformers, which opens up several opportunities for enhancing the effectiveness of pre-training for downstream tasks.

本文提出了一种新颖的方法，通过采用概率化处理聚类分配为随机变量的策略，在不依赖拉普拉斯矩阵谱分解的情况下优化比率切割 (ratio-cut) 。此外，我们利用神经网络对分配概率进行参数化，并通过随机梯度下降以在线方式解决聚类问题。由此产生的在线学习算法相比应用于完整图拉普拉斯矩阵的内存密集型谱方法，能获得更优的比率切割目标值。实验还表明，当采用监督相似度（即共享相同标签的样本被视为相似）时，本方法能达到与监督分类器相当的性能，这印证了所得聚类与给定相似度度量之间的高度一致性。该算法还可直接替代文本或语音Transformer预训练中使用的其他聚类方法，为提升下游任务预训练效果提供了多种可能性。

2 BACKGROUND

2 背景

In this section, we succinctly present the relevant elements related to the notion of graph ratio-cut. The curious reader may refer to von Luxburg (2007) for a more detailed account of spectral clustering and other types of graph cuts.

在本节中，我们简要介绍与图比率割 (graph ratio-cut) 概念相关的要素。感兴趣的读者可以参考 von Luxburg (2007) 以获取关于谱聚类和其他类型图割的更详细说明。

Let $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{K})$ be an undirected weighted graph where $\mathcal{V}\overset{\mathrm{def}}{=}{v_{i}|1\leq i\leq n}\subset\mathbb{R}^{p}$ is the set of vertices, $\mathcal{E}\subset\mathcal{V}\times\mathcal{V}$ is the set of edges $e_{i j}$ linking vertices $v_{i}$ and $v_{j}$ , and $K:\mathcal{V}\times\mathcal{V}\rightarrow\mathbb{R}^{+}$ is a symmetric nonnegative kernel. Let $W$ be the symmetric $n\times n$ adjacency matrix where $W_{i j}=\mathscr{K}(v_{i},v_{j})$ . The degree of the vertex $v_{i}$ is $\begin{array}{r}{d_{i}=\sum_{j}W_{i j}}\end{array}$ , and the degree matrix D d=ef diag (d1, . . . , dn).

设 $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{K})$ 为一个无向加权图，其中 $\mathcal{V}\overset{\mathrm{def}}{=}{v_{i}|1\leq i\leq n}\subset\mathbb{R}^{p}$ 是顶点集，$\mathcal{E}\subset\mathcal{V}\times\mathcal{V}$ 是连接顶点 $v_{i}$ 和 $v_{j}$ 的边集 $e_{i j}$，$K:\mathcal{V}\times\mathcal{V}\rightarrow\mathbb{R}^{+}$ 是对称非负核函数。设 $W$ 为对称的 $n\times n$ 邻接矩阵，其中 $W_{i j}=\mathscr{K}(v_{i},v_{j})$。顶点 $v_{i}$ 的度为 $\begin{array}{r}{d_{i}=\sum_{j}W_{i j}}\end{array}$，度矩阵定义为 $D \overset{\mathrm{def}}{=} \text{diag}(d_1, ..., d_n)$。

Let $k\geq2$ and ${\mathcal{C}}{k}={\mathbb{C}{\ell}|1\leq\ell\leq k}$ a partitioning of the graph $\mathcal{G}$ into $k$ disjoint clusters. We shall represent the subset $\mathbb{C}{\ell}\subset\mathcal{V}$ using the binary assignment vector where $\mathbf{1}{\mathbb{C}{\ell}}(i)=1$ if and only if $v_{i}\in\mathbb{C}{\ell}$ . The size of $\mathbb{C}{\ell}$ is measured using its cardinality $\left\lvert\mathbb{C}{\ell}\right\rvert=$ in=1 1Cℓ(i), and we denote by f (ℓ) d=ef $\begin{array}{r}{\pmb{f}^{(\ell)}\stackrel{\mathrm{def}}{=}\frac{1}{\sqrt{|\mathbb{C}{\ell}|}}\mathbf{1}{\mathbb{C}{\ell}}}\end{array}$ ratio assignment when $|\mathbb{C}_{\ell}|>0$ .

设 $k\geq2$ 且 ${\mathcal{C}}{k}={\mathbb{C}{\ell}|1\leq\ell\leq k}$ 为图 $\mathcal{G}$ 划分为 $k$ 个不相交簇的集合。我们用二元分配向量表示子集 $\mathbb{C}{\ell}\subset\mathcal{V}$，其中 $\mathbf{1}{\mathbb{C}{\ell}}(i)=1$ 当且仅当 $v_{i}\in\mathbb{C}{\ell}$。$\mathbb{C}{\ell}$ 的大小由其基数 $\left\lvert\mathbb{C}{\ell}\right\rvert=\sum_{i=1}^{n}\mathbf{1}{\mathbb{C}{\ell}}(i)$ 衡量，并定义归一化分配向量 $\begin{array}{r}{\pmb{f}^{(\ell)}\stackrel{\mathrm{def}}{=}\frac{1}{\sqrt{|\mathbb{C}{\ell}|}}\mathbf{1}{\mathbb{C}{\ell}}}\end{array}$，其中 $|\mathbb{C}_{\ell}|>0$。

The ratio-cut for $\mathit{\check{C}}_{k}$ is defined as:

$\mathit{\check{C}}_{k}$ 的比率割 (ratio-cut) 定义为:

$$
\begin{array}{l}{{\displaystyle{\mathrm{RatioCut}}({\mathcal{C}}{k})\stackrel{\mathrm{def}}{=}\frac{1}{2}\sum_{\ell=1}^{k}\frac{1}{|\mathbb{C}{\ell}|}\sum_{\substack{i,j\in\mathbb{C}{\ell}\times\overline{{\mathbb{C}{\ell}}}}}W_{i j}}}\ {~=\displaystyle{\frac{1}{2}\sum_{\ell=1}^{k}\frac{1}{|\mathbb{C}{\ell}|}{\mathbf{1}}{\mathbb{C}{\ell}}^{\top}}W\left({\mathbf{1}}{n}-{\mathbf{1}}{\mathbb{C}_{\ell}}\right),}\end{array}
$$

where $\mathbb{\ K}$ denotes $\nu\backslash\mathbb{A}$ , the complement of $\mathbb{A}$ in $\nu$ .

其中 $\mathbb{\ K}$ 表示 $\nu\backslash\mathbb{A}$，即 $\mathbb{A}$ 在 $\nu$ 中的补集。

We define the un normalized Laplacian matrix as ${\cal L}{u n}{\stackrel{\mathrm{def}}{=}}{\cal D}-{\cal W}$ , which can be used to express the ratio-cut of $\mathit{\mathcal{C}}_{k}$ as:

我们定义未归一化的拉普拉斯矩阵为 ${\cal L}{u n}{\stackrel{\mathrm{def}}{=}}{\cal D}-{\cal W}$，它可用于将 $\mathit{\mathcal{C}}_{k}$ 的比率割表示为：

$$
\mathrm{RatioCut}(\mathcal{C}{k})=\frac{1}{2}\mathrm{Tr}\left[F_{\mathcal{C}{k}}^{\top}L_{u n}F_{\mathcal{C}_{k}}\right],
$$

where $F_{{\mathcal{C}}_{k}}{\stackrel{\mathrm{def}}{=}}\left[\pmb{f}^{(1)},\dots\pmb{f}^{(k)}\right]\in\mathbb{R}^{n\times k}$ is the ratio assignments matrix.

其中 $F_{{\mathcal{C}}_{k}}{\stackrel{\mathrm{def}}{=}}\left[\pmb{f}^{(1)},\dots\pmb{f}^{(k)}\right]\in\mathbb{R}^{n\times k}$ 为比率分配矩阵。

Since the clusters should be disjoint and different from the naive partitioning ${\nu,\emptyset,\dots,\emptyset}$ , we can express these constraints as $\mathbf{}F_{:,\ell}^{\top}\mathbf{1}_{V}\neq1$ for all $1\leq\ell\leq k$ .

由于簇应当是互不相交且不同于朴素划分 ${\nu,\emptyset,\dots,\emptyset}$ ，我们可以将这些约束表示为 $\mathbf{}F_{:,\ell}^{\top}\mathbf{1}_{V}\neq1$ 对于所有 $1\leq\ell\leq k$ 。

The optimization of the ratio-cut is then equivalent to minimizing a Rayleigh quotient on ${0,1}^{n\times k}$ . Solving eq. (4) on the set ${0,1}^{n}$ is generally an NP- hard problem. The optimization is hence relaxed so that the unknown vectors are in the unit sphere . The objective is then:

比率割的优化问题等价于在 ${0,1}^{n\times k}$ 上最小化瑞利商。在集合 ${0,1}^{n}$ 上求解方程(4)通常是一个NP难问题。因此放宽优化条件，使未知向量位于单位球面上。此时目标函数为：

$$
\begin{array}{r l}{\underset{\boldsymbol{F}}{\mathrm{minimize}}}&{\mathrm{Tr}(\boldsymbol{F}^{\top}\boldsymbol{L}\boldsymbol{F})}\ {\mathrm{subject~to}}&{\boldsymbol{F}^{\top}\boldsymbol{F}=\boldsymbol{I}{k}\mathrm{and}\boldsymbol{F}{:,j}^{\top}\mathbf{1}_{V}\neq1}\end{array}
$$

The minimization of the Rayleigh quotient under the outlined constraints yields the $k$ smoothest eigenvectors of the Laplacian matrix (excluding the trivial first ei gen vector ${\bf1}_{V}$ ). Vertices within the same cluster are anticipated to have similar projections onto the solutions of the relaxed problem. As we ascend the Lapla- cian spectrum, the projections onto the ei gen vectors encapsulate more specific (higher frequency) features. Subsequently, the binary assignments are determined through $k$ -means clustering (Ng et al., 2001) of the relaxed problem’s solution.

在所述约束条件下最小化瑞利商(Rayleigh quotient)可得到拉普拉斯矩阵的$k$个最平滑特征向量(不包括平凡的第一特征向量${\bf1}_{V}$)。预计同一聚类中的顶点在松弛问题解上的投影具有相似性。随着沿拉普拉斯谱向上移动，特征向量上的投影会捕获更具体(更高频)的特征。随后通过松弛问题解的$k$-均值聚类(Ng et al., 2001)确定二值分配。

In this paper, we introduce a novel approach to optimizing eq. (4) that circumvents the necessity for spectral decomposition of extensive matrices or kernels, thereby sidestepping the relaxation of the problem into Euclidean space. Rather, we propose to relax the problem into an optimization over a simplex through the parameter iz ation of the cluster assignment distribution.

在本文中，我们提出了一种优化方程(4)的新方法，该方法避免了对大型矩阵或核进行谱分解的必要性，从而绕过了将问题松弛到欧几里得空间的过程。相反，我们建议通过对聚类分配分布的参数化，将问题松弛为单纯形上的优化。

3 相关工作

There have been many attempts to extend Spectral Clustering to streaming settings (Guha et al., 2000), infinitely countable datasets, or continuous input spaces. For instance, Yoo et al. (2016) proposed an extension to non-parametric Spectral Clustering for data streams, particularly when the kernel similarity is bilinear or can be approximated linearly. With the arrival of each new stream batch, a batch-based Singular Value Decomposition (SVD) is used to embed the stream and realign a facility set consisting of anchor points. These anchor points are then utilized to run a batch $K$ -means algorithm (Shindler & Wong, 2011) to cluster the new points.

已有许多尝试将谱聚类(Spectral Clustering)扩展到流式场景(Guha et al., 2000)、无限可数数据集或连续输入空间。例如，Yoo等人(2016)提出了一种针对数据流的非参数谱聚类扩展方法，尤其适用于核相似度为双线性或可线性近似的情况。每当接收到新的流数据批次时，该方法会使用基于批次的奇异值分解(SVD)来嵌入数据流，并重新调整由锚点组成的设施集。随后利用这些锚点运行批处理$K$均值算法(Shindler & Wong, 2011)对新数据点进行聚类。

However, research on extending Spectral Clustering to the domain of parametric learning, where the optimization variable is a mapping of the input space to the eigenspace, is limited. This is primarily due to the sensitivity of the ei gen decomposition (Alam & Bora, 2005) and the proven difficulty of ei gen decomposition in non-parametric settings when the similarity kernel takes specific forms (Mohan & Monteleoni, 2017).

然而，关于将谱聚类 (Spectral Clustering) 扩展到参数化学习领域的研究仍然有限，其中优化变量是输入空间到特征空间的映射。这主要是由于特征分解 (eigen decomposition) 的敏感性 (Alam & Bora, 2005) ，以及当相似性核采用特定形式时，在非参数化设置中特征分解已被证实的困难性 (Mohan & Monteleoni, 2017) 。

One of the early attempts in this direction was made in the context of Deep Reinforcement Learning (Kulkarni et al., 2016), where the Markov Decision Process (MDP) is decomposed into sub-tasks that navigate the representation space (Machado et al., 2017a). These sub-tasks, also called options (Machado et al., 2017b), correspond to ei gen vectors associated with the largest eigenvalues of the graph Laplacian, where the vertices represent elements of the state space and the similarity is the likelihood of the agent moving from one state to another. Extending such an approach to a clustering setting would require constructing an MDP for which the corresponden ce between ei gen vectors and options holds.

该方向的早期尝试之一出现在深度强化学习领域 (Kulkarni et al., 2016) ，其中马尔可夫决策过程 (MDP) 被分解为在表征空间中导航的子任务 (Machado et al., 2017a) 。这些子任务 (也称为选项 (Machado et al., 2017b) ) 对应于与图拉普拉斯矩阵最大特征值相关联的特征向量，其中顶点表示状态空间的元素，相似度则是智能体从一个状态转移到另一个状态的概率。将这种方法扩展到聚类场景需要构建一个MDP，以保持特征向量与选项之间的对应关系。

Spectral Nets (Shaham et al., 2018) were perhaps the first attempt at explicitly training a neural network to approximate ei gen vectors of the Laplacian for largescale clustering. The learned mapping is a neural network $N_{\theta}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{k}$ , parameterized by $\theta$ , that maps the input vertices to their projection on the subspace spanned by the $k$ -smoothest ei gen vectors of the Laplacian. The network is constructed to ensure that the output features are approximately ortho normal:

谱网络 (Spectral Nets) (Shaham et al., 2018) 可能是首次尝试显式训练神经网络来近似拉普拉斯矩阵的特征向量以实现大规模聚类。该学习映射是一个由参数 $\theta$ 定义的神经网络 $N_{\theta}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{k}$ ，它将输入顶点映射到由拉普拉斯矩阵前 $k$ 个最平滑特征向量张成的子空间上的投影。该网络的设计确保输出特征近似正交归一：

$$
{\frac{1}{n}}\sum_{i=1}^{n}N_{\theta}{(v_{i})}^{\top}N_{\theta}{(v_{i})}\approx I_{k}
$$

We denote by $\hat{N}_{\theta}$ the neural network without the last orthogonal iz ation layer, which is based on the Cholesky decomposition of the estimated covariance

我们用 $\hat{N}_{\theta}$ 表示不包含最后正交化层的神经网络，该层基于估计协方差的 Cholesky 分解。

matrix Σθ ∈ Rk×k:

矩阵 Σθ ∈ Rk×k:

$$
\Sigma_{\theta}=\frac{1}{n}\sum_{i=1}^{n}\hat{N_{\theta}}{(v_{i})}^{\top}\hat{N_{\theta}}{(v_{i})}.
$$

Spectral Nets approach is then based on the minimization of:

谱网络方法基于以下最小化：

$$
\mathcal{L}{s p e c t r a l}(\theta)\stackrel{\mathrm{def}}{=}\sum_{i,j=1}^{n}W_{i j}\left\lVert N_{\theta}(v_{i})-N_{\theta}(v_{j})\right\rVert^{2}
$$

The matrix Πθ Rk×k serves as the parametric counterpart of ${\pmb F}^{\top}{\pmb L}{\pmb F}$ derived from eq. (4). Training the neural network equates to an alternating optimization scheme, where one batch is utilized to estimate $\Sigma_{\theta}$ , while the other is employed to minimize the loss $\mathcal{L}_{\mathit{s p e c t r a l}}(\theta)$ .

矩阵 Πθ Rk×k 作为 ${\pmb F}^{\top}{\pmb L}{\pmb F}$ 的参数化对应项，源自等式 (4)。训练神经网络相当于交替优化方案：一个批次用于估计 $\Sigma_{\theta}$，另一个批次用于最小化损失 $\mathcal{L}_{\mathit{s p e c t r a l}}(\theta)$。

During training, should the input exhibit noise, the Spectral Net’s output may converge to a constant mapping, resulting in the spectral loss approaching zero. Consequently, this approach encounters numerical instability in high-dimensional data scenarios. One common strategy to circumvent this involves offsetting the matrix $\Sigma_{\theta}$ by a scaled identity and establishing an appropriate stopping criterion. However, the resulting algorithm remains highly susceptible to input and similarity noise. This sensitivity becomes particularly evident when benchmarked against non-parametric Spectral Clustering. Spectral Nets achieve competitive performance only when the input consists of the code space from another well-performing pre-trained variational encoder (Jiang et al., 2017).

训练过程中，若输入数据存在噪声，Spectral Net 的输出可能收敛至恒定映射，导致谱损失趋近于零。因此，该方法在高维数据场景下会遭遇数值不稳定性。常见的规避策略是对矩阵 $\Sigma_{\theta}$ 施加缩放单位矩阵偏移，并设定合适的停止准则。然而，所得算法仍极易受到输入噪声和相似性噪声的影响。这种敏感性在与非参数化谱聚类 (Spectral Clustering) 的基准对比中尤为明显。仅当输入数据来自另一表现良好的预训练变分编码器 (Jiang et al., 2017) 的代码空间时，Spectral Nets 才能实现具有竞争力的性能。

Spectral Inference Networks (Spin) (Pfau et al., 2020) adopt a distinct strategy for parametric optimization of the Rayleigh quotient. As defined previously, the loss function takes the form:

谱推理网络 (Spectral Inference Networks, Spin) (Pfau等人, 2020) 采用了一种独特的参数化优化策略来处理瑞利商问题。如前所述，其损失函数形式为：

$$
\mathcal{L}{s p i n}\left(\boldsymbol{\theta}\right)=\mathrm{Tr}\left(\boldsymbol{\Sigma_{\boldsymbol{\theta}}}^{-1}\boldsymbol{\Pi_{\boldsymbol{\theta}}}\right)
$$

While primarily utilized within the domain of Deep Reinforcement Learning, targeting the ei gen functions associated with the largest eigenvalues, this approach can be adapted for loss minimization rather than maximization.

虽然这种方法主要应用于深度强化学习领域，目标是针对与最大特征值相关的特征函数，但它也可以调整为用于损失最小化而非最大化。

Upon differentiation (Petersen & Pedersen, 2012), the full gradient can be expressed as:

微分时 (Petersen & Pedersen, 2012)，完整梯度可表示为：

$$
\operatorname{Tr}\left(\boldsymbol{\Sigma_{\theta}}^{-1}\nabla_{\theta}\Pi_{\theta}\right)-\operatorname{Tr}\left(\boldsymbol{\Sigma_{\theta}}^{-1}(\nabla_{\theta}\boldsymbol{\Sigma_{\theta}})\boldsymbol{\Sigma_{\theta}}^{-1}\Pi_{\theta}\right)
$$

The Spin algorithm estimates $\Sigma_{\theta}$ and $\nabla_{\boldsymbol{\theta}}\Sigma_{\boldsymbol{\theta}}$ utilizing moving averages. It addresses the issue of overlapping ei gen vectors by modifying the gradient to ensure sequential independence in updates: the gradients of the first $\textit{l}$ ei gen functions are made independent of gradients from other components $l+1,\ldots,k$ . This alteration significantly slows down training, in addition to the computational burden of computing $\nabla_{\boldsymbol{\theta}}\Sigma_{\boldsymbol{\theta}}$ , which necessitates calculating $k^{2}$ backward passes and storing $k^{2}\times\operatorname{size}(\theta)$ , thereby increasing memory and computational costs.

Spin算法通过移动平均来估计$\Sigma_{\theta}$和$\nabla_{\boldsymbol{\theta}}\Sigma_{\boldsymbol{\theta}}$。该算法通过修改梯度来解决特征向量重叠问题，确保更新时的顺序独立性：前$\textit{l}$个特征函数的梯度被设计为独立于其他组件$l+1,\ldots,k$的梯度。除了计算$\nabla_{\boldsymbol{\theta}}\Sigma_{\boldsymbol{\theta}}$带来的计算负担外（需要执行$k^{2}$次反向传播并存储$k^{2}\times\operatorname{size}(\theta)$大小的数据），这一改动还会显著降低训练速度，从而增加内存和计算成本。

We assert that the performance of these attempts is ultimately hindered by the difficulty of the eigendecomposition of large kernels. Therefore, we take a different approach to optimizing the graph ratio-cut that circumvents the spectral decomposition.

我们断言，这些尝试的性能最终受到大型核特征分解难度的限制。因此，我们采用了一种不同的方法来优化图比率切割（graph ratio-cut），从而规避了谱分解。

4 PROPOSED METHOD

4 提出的方法

In our probabilistic approach, the minimization of the ratio-cut is addressed differently. Instead of the deterministic assignments $\mathbf{1}{\mathbb{C}_{\ell}}$ , we use the random assignment vector $\mathbf{a}^{(\ell)}\in{0,1}^{n}$ under the assumption that $(\mathbf{a}{i}^{(\ell)}){i}$ are independent random variables such that:

在我们的概率方法中，比率切割最小化问题采用了不同的处理方式。我们不再使用确定性分配 $\mathbf{1}{\mathbb{C}{\ell}}$，而是在假设 $(\mathbf{a}{i}^{(\ell)})_{i}$ 为独立随机变量的前提下，采用随机分配向量 $\mathbf{a}^{(\ell)}\in{0,1}^{n}$，其满足以下条件：

$$
\operatorname*{Pr}\left(v_{i}\in\mathbb{C}{\ell}\right)=\operatorname*{Pr}\left(\mathbf{a}{i}^{(\ell)}=1\right){\stackrel{\mathrm{def}}{=}}P_{i,\ell},
$$

where the rows of the matrix $P\in[0,1]^{n\times k}$ sum to 1: $\textstyle\sum_{\ell=1}^{k}P_{i,\ell}=1$ for all $i\in{1,\ldots,n}$ .

矩阵 $P\in[0,1]^{n\times k}$ 的行和为1：对所有 $i\in{1,\ldots,n}$ 满足 $\textstyle\sum_{\ell=1}^{k}P_{i,\ell}=1$。

The random assignment for each vertex $i$ follows a categorical distribution of parameter $P_{i,\phantom{\dagger}}$ : and the random clustering $\mathit{{C_{k}}}$ is thus parameterized by $P\in[0,1]^{n\times k}$ . We define the random ratio-assignment vector $\mathbf{f}^{(\ell)}$ for cluster $\mathbb{C}_{\ell}$ as:

每个顶点 $i$ 的随机分配遵循参数 $P_{i,\phantom{\dagger}}$ 的类别分布：随机聚类 $\mathit{{C_{k}}}$ 因此由 $P\in[0,1]^{n\times k}$ 参数化。我们将簇 $\mathbb{C}_{\ell}$ 的随机比率分配向量 $\mathbf{f}^{(\ell)}$ 定义为：

$$
\mathbf{f}^{(\ell)}={\frac{1}{\sqrt{|\mathbb{C}_{\ell}|}}}\mathbf{a}^{(\ell)},
$$

where $\begin{array}{r}{\widehat{|\mathbb{C}{\ell}|}=\sum_{i=1}^{n}\mathbf{a}_{i}^{(\ell)}}\end{array}$ , and all the elements of $\mathbf{f}^{(\ell)}$ are in thde inte rvPal $[0,1]$ with the convention $\textstyle{\frac{0}{0}}=1$ .

其中 $\begin{array}{r}{\widehat{|\mathbb{C}{\ell}|}=\sum_{i=1}^{n}\mathbf{a}_{i}^{(\ell)}}\end{array}$ ，且 $\mathbf{f}^{(\ell)}$ 的所有元素都在区间 $[0,1]$ 内，按照惯例 $\textstyle{\frac{0}{0}}=1$ 。

The stochastic counterpart of the quantity introduced earlier in Equation (1) is:

式 (1) 中引入的随机对应量为：

$$
\widehat{\mathrm{RatioCut}}(\mathcal{C}{k})=\frac{1}{2}\sum_{\ell=1}^{k}\sum_{1\leq i,j\leq n}W_{i j}(\mathbf{f}{i}^{(\ell)}-\mathbf{f}_{j}^{(\ell)})^{2}.
$$

Equation (3) can also be extended to the stochastic setting similarly.

式 (3) 同样可以类似地推广到随机设定中。

The next step is to compute the expected ratio-cut for a clustering $\mathit{\check{C}}{k}$ parameterized by ${P}$ . Without loss of generality, we only have to compute $\mathbb{E}\left[(\mathbf{f}{1}^{(\ell)}-\mathbf{f}{2}^{(\ell)})^{2}\right]$ as a function of $_{P}$ due to the linearity of the expectation. We prove the following result in Appendix A.1:

下一步是计算由参数 ${P}$ 决定的聚类 $\mathit{\check{C}}{k}$ 的期望比率割。不失一般性，由于期望的线性性质，我们只需计算 $\mathbb{E}\left[(\mathbf{f}{1}^{(\ell)}-\mathbf{f}{2}^{(\ell)})^{2}\right]$ 作为 $_{P}$ 的函数。我们在附录 A.1 中证明了以下结果：

Lemma 4.1 (Difference expectation). Let $\mathbb{C}$ be a subset of $\nu$ and a its random assignment vector parameterized by $p\in[0,1]^{n}$ . Let f be its random ratioassignment vector. Then we have the following:

引理 4.1 (差分期望). 设 $\mathbb{C}$ 为 $\nu$ 的子集，其随机分配向量由参数 $p\in[0,1]^{n}$ 确定。设 f 为其随机比率分配向量，则有以下结论：

$$
\mathbb{E}\left[(\mathbf{f}{1}-\mathbf{f}{2})^{2}\right]=(p_{1}+p_{2}-2p_{1}p_{2})\mathbb{E}\left[\frac{1}{1+\widehat{\vert\mathbb{C}^{\perp(1,2)}\vert}}\right],
$$

where $\mathbb{C}^{\bot(i,j)}=\mathbb{C}\backslash{v_{i},v_{j}}$

其中 $\mathbb{C}^{\bot(i,j)}=\mathbb{C}\backslash{v_{i},v_{j}}$

To avoid excluding pairs $(i,i){1\leq i\leq n}$ from the summation in Equation (6) each time, we will henceforth assume that $W_{i i}=0$ for all $i\in{1,\ldots,n}$ .

为避免每次从方程(6)的求和中排除 $(i,i){1\leq i\leq n}$ 对，我们此后假设对于所有 $i\in{1,\ldots,n}$ 都有 $W_{i i}=0$。

The random variable $\begin{array}{r}{|\widehat{\mathbb{C}^{\bot(1,2)}}|=\sum_{i=3}^{n}\mathbf{a}_{i}}\end{array}$ is the sum of (n-2) independent (but not ide ntical) Bernoulli random variables, also known as a Poisson binomial. It coincides with the binomial distribution when the Bernoulli random variables share the same parameter. See Appendix A.3 for a compilation of properties of this distribution including the proof of Lemma 4.2.

随机变量 $\begin{array}{r}{|\widehat{\mathbb{C}^{\bot(1,2)}}|=\sum_{i=3}^{n}\mathbf{a}_{i}}\end{array}$ 是 (n-2) 个独立 (但不同分布) 伯努利随机变量的和，也称为泊松二项分布。当伯努利随机变量具有相同参数时，该分布退化为二项分布。关于该分布的属性汇编及引理 4.2 的证明详见附录 A.3。

Lemma 4.2 (Poisson binomial expectation). Let Z be a Poisson binomial random variable of parameter $\pmb{\alpha}=(\alpha_{1},\ldots,\alpha_{m})\in[0,1]^{m}$ . Then we have:

引理 4.2 (泊松二项期望). 设 Z 为参数 $\pmb{\alpha}=(\alpha_{1},\ldots,\alpha_{m})\in[0,1]^{m}$ 的泊松二项随机变量，则有：

$$
\mathbb{E}\left[\frac{1}{1+\mathrm{Z}}\right]=\int_{0}^{1}\prod_{i=1}^{m}(1-\alpha_{i}t)d t
$$

We can now express the expected ratio-cut by combining the results from Lemma 4.1 and Lemma 4.2.

现在我们可以结合引理4.1和引理4.2的结果来表达期望的比率割。

Theorem 4.1 (Ratio-cut expectation). The expected ratio-cut of the random clustering $\mathcal{C}{k}$ parameterized by $P\in[0,1]^{n\times k}$ can be computed as $\mathbb{E}\left[\widehat{\mathrm{RatioCut}}(\mathcal{C}{k})\right]=$ $\textstyle\sum_{\ell=1}^{k}\operatorname{RC}(P_{:,\ell})$ where:

定理 4.1 (Ratio-cut期望). 由参数 $P\in[0,1]^{n\times k}$ 定义的随机聚类 $\mathcal{C}{k}$ 的期望 ratio-cut 可计算为 $\mathbb{E}\left[\widehat{\mathrm{RatioCut}}(\mathcal{C}{k})\right]=$ $\textstyle\sum_{\ell=1}^{k}\operatorname{RC}(P_{:,\ell})$ 其中:

$$
\begin{array}{l}{\displaystyle\mathrm{RC}(\boldsymbol{p})=\frac{1}{2}\sum_{i,j=1}^{n}W_{i j}\left(p_{i}+p_{j}-2p_{i}p_{j}\right)\mathbb{I}(p,i,j)}\ {\displaystyle\mathbb{I}(p,i,j)\overset{d e f}{=}\int_{0}^{1}\prod_{m\neq i,j}^{n}{(1-p_{m}t)d t}}\end{array}
$$

The expression for $\operatorname{RC}(\pmb{p})$ can be further simplified by consolidating the terms linear in $\pmb{p}$ . However, we retain the current formulation as it will prove useful when sampling random pairs $(i,j)$ in the implementation.

$\operatorname{RC}(\pmb{p})$ 的表达式可以通过合并 $\pmb{p}$ 中的线性项进一步简化。不过，我们保留了当前公式，因为在实现过程中对随机对 $(i,j)$ 进行采样时，这将非常有用。

We will drop the cluster index $\ell$ in the next sections and use $p\in[0,1]^{n}$ as the parameter of the random assignment. The goal now is to optimize $\operatorname{RC}(\pmb{p})$ .

我们将在接下来的章节中省略簇索引 $\ell$，并使用 $p\in[0,1]^{n}$ 作为随机分配的参数。现在的目标是优化 $\operatorname{RC}(\pmb{p})$。

4.1 Computation of the expected ratio-cut

4.1 期望比率割的计算

A straightforward method to estimate the expected ratio-cut from Theorem 4.1 involves disc ret i zing the

一种直接估计定理4.1中期望比率割(ratio-cut)的方法是离散化

interval $[0,1]$ to approximate the integral. By using a uniform disc ret iz ation with a step size of $\textstyle{\frac{1}{T}}$ for $T>0$ , we can apply the formula:

区间 $[0,1]$ 来近似积分。通过采用步长为 $\textstyle{\frac{1}{T}}$ 的均匀离散化（其中 $T>0$），可应用公式：

$$
\mathbb{I}(\pmb{p},i,j)=\frac{1}{T}\sum_{t=1}^{T}\prod_{m\neq i,j}(1-p_{i}\frac{t}{T})
$$

The quality of the approximation depends on $T$ , but since the integrated function is polynomial in $t$ , we have the luxury of using a weighted disc ret iz ation scheme that ensures the exact computation of the integral using the following quadrature method proven in Appendix A.4.

近似质量取决于 $T$，但由于被积函数是关于 $t$ 的多项式，我们可以采用加权离散化方案，确保通过附录 A.4 证明的以下求积方法精确计算积分。

Lemma 4.3 (Integral computation via quadrature). Let $m>0$ be an integer and $c_{m}\stackrel{d e f}{=}\left\lfloor\frac{m}{2}\right\rfloor+1$ (the smallest integer such that $2c_{m}\geq m+1.$ ), then there exist $c_{m}$ tuples $(s_{q},t_{q}){1\leq q\leq c{m}}$ such that:

引理 4.3 (基于求积分的积分计算). 设整数 $m>0$ 且 $c_{m}\stackrel{d e f}{=}\left\lfloor\frac{m}{2}\right\rfloor+1$ (这是满足 $2c_{m}\geq m+1$ 的最小整数), 则存在 $c_{m}$ 个元组 $(s_{q},t_{q}){1\leq q\leq c{m}}$ 使得:

$$
\int_{0}^{1}\prod_{i=1}^{m}(1-p_{i}t)d t=\sum_{q=1}^{c_{m}}s_{q}\prod_{i=1}^{m}(1-p_{i}t_{q}),
$$

for all $\pmb{p}\in[0,1]^{m}$ .

对于所有 $\pmb{p}\in[0,1]^{m}$。

The computation of $\mathbb{I}(\pmb{p},i,j)$ for a given pair $(i,j)$ costs $O(n^{2})$ . We can still obtain the total $\operatorname{RC}(\pmb{p})$ in $O(c_{n}n^{2})$ instead of $O(n^{4})$ by computing the $c_{n}$ quantities (Pnm=1 log (1 − pmtq))1≤q≤cn in advance. Another challenge encountered when using the quadrature is the instability of the computations due to the potentially large number of elements in the product. Even with a tight approximation using batch-based estimates, it still costs $O(b^{3})$ , where $b$ is the batch size. Furthermore, the higher the number of clusters, the larger the batch size should be to obtain a good estimate of the integral for different clusters. These challenges are to be expected since the original combinatorial problem is generally NP-hard. See Appendix A.5 for a detailed analysis of the batch-based estimate. Instead, we prove in Appendix A.6 that the expected ratio cut can be upper-bounded by a much more accessible quantity, as shown in Lemma 4.4.

对于给定的一对 $(i,j)$，计算 $\mathbb{I}(\pmb{p},i,j)$ 的复杂度为 $O(n^{2})$。通过预先计算 $c_{n}$ 个量 $(Pnm=1 \log (1 - pmtq)){1\leq q\leq c_n}$，我们仍能在 $O(c_{n}n^{2})$ 而非 $O(n^{4})$ 的时间内获得总 $\operatorname{RC}(\pmb{p})$。

使用数值积分时遇到的另一个挑战是，由于乘积中可能包含大量元素，计算过程可能不稳定。即使采用基于批次的估计进行紧密逼近，其复杂度仍为 $O(b^{3})$，其中 $b$ 是批次大小。此外，聚类数量越多，为了对不同聚类获得良好的积分估计，所需的批次大小也应越大。这些挑战是可以预见的，因为原始组合问题通常是 NP 难的。关于基于批次的估计的详细分析，请参阅附录 A.5。

相反，我们在附录 A.6 中证明了期望比率割可以被一个更容易计算的量上界约束，如引理 4.4 所示。

Lemma 4.4 (Integral upper-bound). We adopt the same notation of Lemma 4.2 where $\mathbf{\delta\alpha}{\alpha}::::::::$ (α1, . . . , αm) ∈ [0, 1]m, and we assume that α d=ef $\textstyle{\frac{1}{m}}\sum_{i=1}^{m}\alpha_{i}>0$ . Then we have:

引理 4.4 (积分上界). 我们采用与引理 4.2 相同的符号表示，其中 $\mathbf{\delta\alpha}{\alpha}::::::::$ (α1, ..., αm) ∈ [0, 1]m，并假设 α d=ef $\textstyle{\frac{1}{m}}\sum_{i=1}^{m}\alpha_{i}>0$。那么我们有：

$$
\int_{0}^{1}\prod_{i=1}^{m}(1-\alpha_{i}t)d t\leq\frac{1}{(m+1)\overline{{\alpha}}}.
$$

The assumption $\overline{{\alpha}}>0$ is not necessary if we extend the property to $\mathbb{R}\cup{+\infty}$ . We can now use Lemma 4.4 to upper-bound $\operatorname{RC}(\pmb{p})$ .

假设 $\overline{{\alpha}}>0$ 在将性质扩展到 $\mathbb{R}\cup{+\infty}$ 时并非必要条件。现在我们可以利用引理4.4来上界 $\operatorname{RC}(\pmb{p})$。

Lemma 4.5 (RC upper-bound). Using the notations of Theorem 4.1 and Lemma 4.4, we have:

引理 4.5 (RC 上界). 使用定理 4.1 和引理 4.4 的符号表示, 我们有:

$$
\mathrm{RC}({\pmb p})\leq\frac{e^{2}}{2n}\frac{1}{\overline{{\pmb p}}}\sum_{1\leq i,j\leq n}w_{i j}(p_{i}+p_{j}-2p_{i}p_{j})
$$

Interpretation Before we proceed with determining an unbiased gradient estimator for the bound in Lemma 4.5, we seek to understand the significance of the various quantities introduced previously. In contrastive learning, the objective can be written as $-2\textstyle\sum_{i j}W_{i j}p_{i}p_{j}$ , whose minimization aims to assign hig hly similar samples to the same cluster. The objective in ratio-cut can be expressed as $\begin{array}{r}{\sum_{i j}W_{i j}\left[p_{i}(1-p_{j})+p_{j}(1-p_{i})\right]}\end{array}$ , which is minimized when dissimilar samples are separated into different clusters, in alignment with the initial goal of minimizing the cut. It is worth noting that the minimum of the term $W_{i j}\left[p_{i}(1-p_{j})+p_{j}(1-p_{i})\right]$ is $W_{i j}$ , and it is achieved when either $v_{i}$ and $v_{j}$ are in different clusters.

解读
在继续确定引理4.5中边界无偏梯度估计量之前，我们需要理解先前引入的各种量的意义。对比学习的目标可表述为 $-2\textstyle\sum_{i j}W_{i j}p_{i}p_{j}$ ，其最小化旨在将高度相似的样本分配到同一簇。比率切割 (ratio-cut) 的目标可表示为 $\begin{array}{r}{\sum_{i j}W_{i j}\left[p_{i}(1-p_{j

[论文翻译]基于概率比率割优化的深度聚类

原文地址：https://arxiv.org/pdf/2502.03405

Deep Clustering via Probabilistic Ratio-Cut Optimization

基于概率比率割优化的深度聚类

Abstract

摘要

1 INTRODUCTION

1 引言

2 BACKGROUND

2 背景

3 相关工作

4 PROPOSED METHOD

4 提出的方法

4.1 Computation of the expected ratio-cut

4.1 期望比率割的计算

[论文翻译]基于概率比率割优化的深度聚类

原文地址：https://arxiv.org/pdf/2502.03405

Deep Clustering via Probabilistic Ratio-Cut Optimization

基于概率比率割优化的深度聚类

Abstract

摘要

1 INTRODUCTION

1 引言

2 BACKGROUND

2 背景

3 RELATED WORK

3 相关工作

4 PROPOSED METHOD

4 提出的方法

4.1 Computation of the expected ratio-cut

4.1 期望比率割的计算