Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

增强遥感视觉-语言模型的零样本场景分类能力

1UCLouvain, Belgium 2UMons, Belgium 3ETS Montreal, Canada

1比利时鲁汶大学 2比利时蒙斯大学 3加拿大蒙特利尔高等技术学院

Abstract—Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pre training. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through trans duct ive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: https://github.com/elkhouryk/RS-TransCLIP

摘要—得益于广泛的预训练，遥感领域的视觉语言模型 (Vision-Language Models) 已展现出广阔的应用前景。然而，传统零样本场景分类方法仍将大幅图像分割为小块进行独立预测 (即归纳推理)，这种忽略宝贵上下文信息的做法限制了模型性能。我们提出一种新方法：通过文本提示生成的初始预测和图像编码器提取的局部关联关系，以转导推理方式增强零样本能力。该方法无需监督且计算开销极小。在10个遥感数据集上采用最先进视觉语言模型的实验表明，其分类准确率较归纳式零样本方法有显著提升。源代码已开源：https://github.com/elkhouryk/RS-TransCLIP

Index Terms—remote sensing, scene classification, visionlanguage models, zero-shot, trans duct ive inference

索引术语—遥感、场景分类、视觉语言模型、零样本、转导推理

I. INTRODUCTION

I. 引言

Remote Sensing (RS) imagery has become an effective tool for monitoring the surface of the Earth. It has given rise to several applications, ranging from environmental monitoring [1, 2], to precision agriculture [3, 4], as well as emergency disaster response [5, 6]. All of these tasks require precise and quick scene classification to extract useful insights from highly complex visual data.

遥感 (RS) 影像已成为监测地球表面的有效工具。它催生了多种应用，从环境监测 [1, 2] 到精准农业 [3, 4]，再到应急灾害响应 [5, 6]。所有这些任务都需要精确快速的场景分类，以便从高度复杂的视觉数据中提取有用信息。

Linking images with text descriptions has been an effective approach for learning granular visual representations [7, 8]. While this idea seemed powerful, pioneering works in the field of RS [9, 10] were limited by computational budgets and the quantity of available RS data, both of which have been significant bottlenecks for generalization and robustness capabilities [11]. More recently, Vision-Language Models (VLMs) like CLIP [12] have overcome these limitations by leveraging a new pre training paradigm that uses large-scale image-text pair datasets for unsupervised contrastive learning. These models have demonstrated high capability for numerous downstream tasks, including efficient zero-shot image classification by prompting arbitrary candidate class descriptions, e.g., "a satellite photo of a [class].", sometimes even surpassing supervised competitors [12]. Inspired by these promising results, the RS community has worked on developing large image-text RS datasets [13–16] leading to rapid progress in zero-shot scene classification benchmarks [17].

将图像与文本描述关联一直是学习细粒度视觉表征的有效方法 [7, 8]。尽管这一思路颇具潜力，但遥感 (RS) 领域的开创性研究 [9, 10] 受限于计算预算和可用遥感数据量，这两者都严重制约了模型的泛化与鲁棒性 [11]。近年来，CLIP [12] 等视觉语言模型 (VLM) 通过采用基于大规模图文对数据集的无监督对比学习预训练范式，成功突破了这些限制。这些模型在下游任务中展现出卓越性能，包括通过提示任意候选类别描述（如"一张[类别]的卫星照片"）实现高效的零样本图像分类，有时甚至超越有监督模型 [12]。受此启发，遥感学界致力于构建大规模遥感图文数据集 [13–16]，推动零样本场景分类基准测试取得快速进展 [17]。

Fig. 1: Top-1 accuracy of RS-TransCLIP, on ViT-L/14 RS VLMs, for zero-shot scene classification across 10 datasets.

图 1: RS-TransCLIP 在 ViT-L/14 RS VLMs 上针对 10 个数据集的零样本场景分类任务的 Top-1 准确率。

In remote sensing scene classification, both the large size of the images and the need for granular information pose challenges. To make high-resolution inference tractable, it is common practice to divide the images into smaller patches and generate predictions for each patch individually; this is known as inductive inference. Another paradigm known as transductive inference [18, 19], has shown that jointly considering multiple instances at prediction time can improve the prediction accuracy by accounting for the statistical distribution of instances in the embedding space [20, 21]. Despite its large potential, trans duct ive inference has been largely overlooked in RS within the context of VLMs. We aim to address this gap by introducing an efficient trans duct ive method that operates exclusively within the embedding space, i.e., in a black box setup after feature extraction.

在遥感场景分类中，图像的大尺寸和细粒度信息需求都带来了挑战。为处理高分辨率图像的推理问题，通常做法是将图像分割为小块并单独预测每个小块，这被称为归纳推理 (inductive inference)。另一种称为转导推理 (transductive inference) [18,19] 的范式表明，在预测时联合考虑多个实例可以通过嵌入空间中实例的统计分布来提高预测精度 [20,21]。尽管潜力巨大，转导推理在视觉语言模型 (VLM) 的遥感领域长期被忽视。我们旨在通过引入一种高效的转导方法来填补这一空白，该方法仅作用于嵌入空间（即特征提取后的黑箱设置中）。

In a zero-shot classification setting, class-specific textual prompts are mapped to a shared embedding space generating individual pseudo-label for each image patch. In a traditional inductive inference process, predictions are generated by utilizing initial pseudo-labels to identify the most confident class, with each patch predicted individually. In contrast, our work envisions trans duct ive inference in a zero-shot classification setting. As shown in Fig. 2, this approach leverages the data structure within the feature space to account for instance relations, enabling collective prediction of all points simultaneously. Our proposed objective function can be viewed as a regularized maximum-likelihood estimation, constrained by a Kullback-Leibler divergence penalty that integrates the aforementioned initial pseudo-labels and a Laplacian term that constraints similar patches to have similar predictions.

在零样本分类设置中，特定类别的文本提示会被映射到一个共享的嵌入空间，为每个图像块生成独立的伪标签。在传统的归纳推理过程中，预测是通过利用初始伪标签来识别最可信的类别，每个图像块单独进行预测。相比之下，我们的工作设想在零样本分类设置中进行转导推理。如图2所示，这种方法利用了特征空间内的数据结构来考虑实例关系，从而能够同时对所有点进行集体预测。我们提出的目标函数可以视为一种正则化的最大似然估计，其约束条件包括一个整合上述初始伪标签的Kullback-Leibler散度惩罚项，以及一个约束相似图像块具有相似预测的拉普拉斯项。

Contribution: We introduce RS-TransCLIP, a trans duct ive algorithm that enhances RS VLMs without requiring any labels, only incurring a negligible computational cost to the overall inference time. Fig. 1 highlights the significant boost that RS-TransCLIP offers on state-of-the-art RS VLMs.

贡献：我们提出了RS-TransCLIP，这是一种无需任何标注即可增强遥感视觉语言模型(RS VLMs)的转导算法，仅需为整体推理时间增加可忽略的计算成本。图1展示了RS-TransCLIP为最先进的RS VLMs带来的显著提升。

II. 相关工作

A. Vision-Language Models for Remote Sensing

A. 遥感领域的视觉语言模型

Due to foundation models being trained on natural images, there is an active research effort to build domain-specific versions of these models. This is prevalent in the medical imaging where VLMs have shown promising results [22, 23] in improving image-text retrieval and few-shot classification. The RS community has followed suit, working on creating extensive image-text datasets by scraping and filtering public satellite and UAV imagery sources [13–16]. This has led to the development of several fine-tuned VLMs on various downstream tasks [24–28], with many of them showing strong performances in zero-shot scene classification [11, 13, 15].

由于基础模型是在自然图像上训练的，目前正积极开展研究来构建这些模型的领域专用版本。这一趋势在医学影像领域尤为显著，视觉语言模型(VLM)在提升图文检索和少样本分类方面已展现出良好效果[22, 23]。遥感领域随即跟进，通过爬取和筛选公开卫星与无人机影像源，构建了大规模图文数据集[13-16]，进而催生出多个针对不同下游任务微调的视觉语言模型[24-28]，其中许多模型在零样本场景分类中表现出色[11, 13, 15]。

B. Trans duct ive inference in Vision-Language Models

B. 视觉语言模型中的转导推理

In the few-shot literature, transduction leverages both the few labeled samples and unlabeled test data outperforming inductive methods [29–32]. However, when applied to VLMs, these trans duct ive methods face significant performance drops [20, 21] since they are based solely on the vision features. This motivated very recent trans duct ive methods in computer vision to explicitly leverage the textual modality alongside image embeddings – a capability not present before the emergence of VLMs [20, 21, 33]. Building on these advances and the trans duct ive-inference zero-shot objective described in [21], our work enhances the predictive accuracy of pretrained RS VLMs without the need of any supervision.

在少样本学习研究中，转导方法 (transduction) 同时利用少量标注样本和未标注测试数据，其性能优于归纳方法 [29-32]。然而，当应用于视觉语言模型 (VLM) 时，这些转导方法会出现显著性能下降 [20, 21]，因为它们仅基于视觉特征。这一现象推动了计算机视觉领域最新转导方法的出现，这些方法明确结合文本模态与图像嵌入进行联合优化——这种能力在视觉语言模型出现之前并不存在 [20, 21, 33]。基于这些进展以及文献 [21] 提出的转导推理零样本目标，我们的工作提升了预训练遥感视觉语言模型 (RS VLM) 的预测准确性，且无需任何监督信号。

Fig. 2: (a) VLMs assign each image to its closest text embedding and (b) RS-TransCLIP exploits the image-text structure to enhance the predictions without any additional labels.

图 2: (a) 视觉语言模型(VLMs)将每张图像分配到最接近的文本嵌入向量, (b) RS-TransCLIP利用图像-文本结构增强预测而无需额外标注。

A. Variable Definition

A. 变量定义

In an inductive approach, predictions are made individually using only the initial pseudo-label $\hat{\bf y}$ . Conversely, in the proposed trans duct ive approach, predictions are made simultaneously by modeling the feature space using three variables, $\mathbf z,\mu$ and $\pmb{\Sigma}$ which can be split into two categories:

在归纳方法中，预测仅使用初始伪标签 $\hat{\bf y}$ 单独进行。相反，在所提出的转导方法中，通过使用 $\mathbf z,\mu$ 和 $\pmb{\Sigma}$ 这三个变量对特征空间建模来同时进行预测，这些变量可分为两类：

Assignment variables — where $\mathbf{z}$ is defined as:

赋值变量——其中 $\mathbf{z}$ 定义为：

图片.png

with $K$ the number of classes, $\mathcal{Q}$ the sample indices set and $\Delta_{K}$ the K-dimensional probability simplex (prediction space).

设 $K$ 为类别数， $\mathcal{Q}$ 为样本索引集， $\Delta_{K}$ 为 K 维概率单纯形 (预测空间)。

Gaussian Mixture Model (GMM) variables — where the mean $\pmb{\mu}$ and the covariance $\pmb{\Sigma}$ are defined as:

高斯混合模型 (Gaussian Mixture Model, GMM) 变量——其中均值 $\pmb{\mu}$ 和协方差 $\pmb{\Sigma}$ 定义为:

图片.png

with $d$ the embedding dimension. Note that $\pmb{\Sigma}$ is shared among classes to decrease the number of parameters.

其中 $d$ 为嵌入维度。注意 $\pmb{\Sigma}$ 在类别间共享以减少参数量。

B. RS-TransCLIP objective function

B. RS-TransCLIP 目标函数

The goal is to minimize the objective function $\mathcal{L}$ composed of three terms: an unsupervised GMM clustering term, an affinity-based Laplacian regular iz ation term and a divergencedriven Kullback-Leibler $(K L)$ regular iz ation term. The terms of $\mathcal{L}$ , written in Eq. (1), are detailed hereafter:

目标是最小化由三项组成的目标函数 $\mathcal{L}$ ：无监督GMM聚类项、基于亲和力的拉普拉斯正则化项和基于散度的Kullback-Leibler $(KL)$ 正则化项。式(1)中的 $\mathcal{L}$ 各项详细说明如下：

图片.png

III. METHOD

III. 方法

The trans duct ive approach employed by RS-TransCLIP is based on the hypothesis that the data structure within the feature space can be modeled as a mixture of Gaussian distributions. As a result, the RS-TransCLIP objective function integrates this hypothesis alongside affinity relationships among patches and initial text-based pseudo-labels to minimize prediction deviation. The intuition behind the proposed trans duct ive approach is depicted in Fig. 2.

RS-TransCLIP采用的转导式方法基于一个假设：特征空间内的数据结构可建模为高斯混合分布。因此，RS-TransCLIP的目标函数整合了该假设、图像块间的亲和关系以及基于文本的初始伪标签，以最小化预测偏差。图2展示了所提转导式方法的原理示意图。

GMM clustering — The goal of this term is maximizing the similarity between the assignment variables $\mathbf{z}{i} $and the likelihood$ \mathbf{p}{i} $. In our case, we model the likelihood of target data as a balanced mixture of$ K $multivariate Gaussian distributions. Each distribution represents a class$ k $with an associated mean vector$ \pmb{\mu}{k} $and a covariance matrix$ \pmb{\Sigma} $. Defining$ \mathbf{f}{i}\in\mathbb{R}^{d} $as the image embedding of sample$ i $, we set$ p_{i,k} $the probability that sample$ i $is generated by the Gaussian distribution of class$ k$ :

GMM聚类——该术语的目标是最大化分配变量$\mathbf{z}{i} $与似然$ \mathbf{p}{i} $之间的相似性。在我们的案例中，我们将目标数据的似然建模为$ K $个多元高斯分布的平衡混合。每个分布代表一个类$ k $，具有相关的均值向量$ \pmb{\mu}{k} $和协方差矩阵$ \pmb{\Sigma} $。定义$ \mathbf{f}{i}\in\mathbb{R}^{d} $为样本$ i $的图像嵌入，我们设$ p_{i,k} $为样本$ i $由类$ k$的高斯分布生成的概率：

图片.png

Laplacian regular iz ation — The aim of this term is to favor pairs of samples with high affinity to have similar assignment variables. In our case, we define non-negative affinities $w_{i j}$ using the cosine similarities between image embeddings of each sample (see line 2 in Algorithm 1). Note that affinities can be tailored for each specific use-case, provided the affinity matrix is positive semi-definite. This ensures the concavity of the term, which in turn guarantees the convergence of the decoupled updates (refer to [21] for details).

拉普拉斯正则化 (Laplacian regularization) —— 该项的目的是使具有高亲和力的样本对具有相似的分配变量。在本研究中，我们使用每个样本图像嵌入的余弦相似度来定义非负亲和力 $w_{i j}$ (参见算法1第2行)。需要注意的是，只要亲和矩阵是半正定的，亲和力就可以根据具体用例进行调整。这确保了该项的凹性，从而保证了解耦更新的收敛性 (详见 [21])。

KL regular iz ation — The purpose of this term is to prevent the assignment variables to deviate significantly from the initial pseudo-labels. In our case, we obtain the pseudo-labels $(\hat{\mathbf{y}}{i}){1\leq i\leq\ Q} $by applying the softmax function to the vector whose components are obtained by computing the dot product between the image embeddings$ \mathbf{f}{i} $and all text embeddings$ \mathbf{t}{k}~\in\mathbb{R}^{d} $, scaled by the temperature factor$ \tau$ used during VLM pre training (see line 1 in Algorithm 1). This allows us to integrate the text-knowledge into the optimization process.

KL正则化 (KL regularization) — 该项的目的是防止分配变量显著偏离初始伪标签。在本研究中，我们通过对向量应用 softmax 函数获得伪标签 $(\hat{\mathbf{y}}{i}){1\leq i\leq\ Q} $，该向量的分量是通过计算图像嵌入$ \mathbf{f}{i} $与所有文本嵌入$ \mathbf{t}{k}~\in\mathbb{R}^{d} $的点积，并经过 VLM 预训练时使用的温度系数$ \tau$ 缩放得到 (参见算法 1 第 1 行)。这种方法使我们能够将文本知识整合到优化过程中。

C. Solving procedure

C. 求解步骤

We refer to [21] for the derivation and optimization details of the convergence procedure for the objective function $\mathcal{L}$ . The pseudo-code for the RS-TransCLIP procedure is outlined in Algorithm 1. Note that image and text embeddings are only computed once at the start. After the affinity $w_{i j}$ and the pseudo-labels $\hat{\mathbf{y}}{i} $are determined, the assignment variables$ \mathbf{z}{i} $and the GMM variables$ \pmb{\mu}_{k} $and$ \pmb{\Sigma}$ are then initialized and updated, according to the update rules listed in Eq. (2), (3) and (4) respectively. The update rules vary depending on the two variable categories:

关于目标函数 $\mathcal{L}$ 收敛过程的推导与优化细节，请参阅文献[21]。RS-TransCLIP算法的伪代码如算法1所示。需注意，图像和文本嵌入仅在初始阶段计算一次。当亲和度 $w_{ij}$ 与伪标签 $\hat{\mathbf{y}}{i} $确定后，分配变量$ \mathbf{z}{i} $及GMM变量$ \pmb{\mu}_{k} $和$ \pmb{\Sigma}$ 将根据式(2)、(3)和(4)分别列出的更新规则进行初始化与更新。更新规则根据以下两类变量有所不同：

Iterative decoupled updates. — The assignment variable $\mathbf{z}{i} $is updated at each iteration$ l $as it depends on its neighbors$ \mathbf{z}{j} $. Note that the update rule of$ \mathbf{z}_{i}$ can be parallel i zed, which makes the convergence procedure computationally efficient.

迭代解耦更新。— 每次迭代 $l$ 都会更新分配变量 $\mathbf{z}{i} $，因为它依赖于其邻居$ \mathbf{z}{j} $。注意$ \mathbf{z}_{i}$ 的更新规则可以并行化，这使得收敛过程在计算上更高效。

图片.png

Closed-form updates. — With $\mathbf{z}{i} $fixed, obtained following the iterative decoupled updates, we can calculate the closed-form updates for GMM variables$ \pmb{\mu}{k} $and$ \pmb{\Sigma}$ .

闭式更新。— 当 $\mathbf{z}{i} $固定时，通过迭代解耦更新获得后，我们可以计算 GMM 变量$ \pmb{\mu}{k} $和$ \pmb{\Sigma}$ 的闭式更新。

![图片.png](http://aiqi

[论文翻译]增强遥感视觉-语言模型的零样本场景分类能力

原文地址：https://arxiv.org/pdf/2409.00698v2

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

I. INTRODUCTION

I. 引言

II. 相关工作

A. Vision-Language Models for Remote Sensing

A. 遥感领域的视觉语言模型

B. Trans duct ive inference in Vision-Language Models

B. 视觉语言模型中的转导推理

A. Variable Definition

A. 变量定义

B. RS-TransCLIP objective function

B. RS-TransCLIP 目标函数

III. METHOD

III. 方法

C. Solving procedure

C. 求解步骤

[论文翻译]增强遥感视觉-语言模型的零样本场景分类能力

原文地址：https://arxiv.org/pdf/2409.00698v2

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

I. INTRODUCTION

I. 引言

II. RELATED WORK

II. 相关工作

A. Vision-Language Models for Remote Sensing

A. 遥感领域的视觉语言模型

B. Trans duct ive inference in Vision-Language Models

B. 视觉语言模型中的转导推理

A. Variable Definition

A. 变量定义

B. RS-TransCLIP objective function

B. RS-TransCLIP 目标函数

III. METHOD

III. 方法

C. Solving procedure

C. 求解步骤