A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION

声音描述：探索提示模板与类别描述以增强零样本音频分类

ABSTRACT

摘要

Audio-text models trained via contrastive learning offer a practical approach to perform audio classification through natural language prompts, such as “this is a sound of” followed by category names. In this work, we explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. First, we find that the formatting of the prompts significantly affects performance so that simply prompting the models with properly formatted class labels performs competitively with optimized prompt templates and even prompt ensembling. Moreover, we look into complementing class labels by audio-centric descriptions. By leveraging large language models, we generate textual descriptions that prioritize acoustic features of sound events to disambiguate between classes, without extensive prompt engineering. We show that prompting with class descriptions leads to state-of-the-art results in zero-shot audio classification across major ambient sound datasets. Remarkably, this method requires no additional training and remains fully zero-shot.

通过对比学习训练的音频-文本模型提供了一种实用方法，可通过自然语言提示（如"这是...的声音"后接类别名称）进行音频分类。本研究探索了零样本音频分类的替代提示模板，证明了存在更高性能的选项。首先，我们发现提示格式显著影响性能，仅使用正确格式化的类别标签提示模型，其表现即可与优化提示模板甚至提示集成相媲美。此外，我们研究了通过音频中心描述补充类别标签的方法。通过利用大语言模型，我们生成优先考虑声音事件声学特征的文本描述，无需大量提示工程即可消除类别歧义。实验表明，在主流环境声音数据集中，使用类别描述提示的方法实现了零样本音频分类的最先进结果。值得注意的是，该方法无需额外训练，且完全保持零样本特性。

Index Terms— Zero-shot audio classification, audio-text models, contrastive language-audio pre training, in-context learning

索引术语— 零样本音频分类、音频-文本模型、对比语言-音频预训练、上下文学习

1. INTRODUCTION

1. 引言

Multimodal contrastive pre training has been used to train multimodal representation models on large amounts of paired data. This approach leverages contrastive learning to align representations across different modalities, promoting a shared embedding space that improves semantic understanding across modalities. Examples include Contrastive Language-Image Pre training (CLIP) [1], which aligns visual and textual representations, and the more recent Contrastive Language-Audio Pre training (CLAP), which extends these principles to align audio and textual representations [2, 3, 4, 5].

多模态对比预训练已被用于在大量配对数据上训练多模态表征模型。该方法利用对比学习来对齐不同模态的表征，促进共享嵌入空间，从而提升跨模态的语义理解。典型例子包括对齐视觉与文本表征的对比语言-图像预训练 (CLIP) [1]，以及近期将这一原则扩展到音频与文本对齐的对比语言-音频预训练 (CLAP) [2, 3, 4, 5]。

Following pre training, CLAP exhibits a well-structured feature space, yielding robust, general-purpose representations well-suited for downstream training. Moreover, it also demonstrates exceptional transfer ability as evidenced by its impressive zero-shot performance across classification, captioning, retrieval, and generation tasks [3, 6, 7].

预训练完成后，CLAP展现出结构良好的特征空间，能生成适用于下游训练的鲁棒通用表征。此外，其在分类、描述生成、检索和生成任务中表现出的卓越零样本性能 [3,6,7] ，也证明了其出色的迁移能力。

Extensive research on CLIP has revealed that classification scores are significantly influenced by alterations in prompt formulation and language nuances. For instance, varying the description of a concept, using synonyms, or modifying the grammatical structure or wording, substantially affects performance outcomes [8, 9, 10]. Besides, prompts offering more context or specificity tend to yield more accurate results [11, 12, 13].

关于CLIP的广泛研究表明，分类分数显著受到提示表述调整和语言细微差别的影响。例如，改变概念的描述、使用同义词，或调整语法结构和措辞，都会大幅影响性能表现 [8, 9, 10]。此外，提供更多上下文或具体细节的提示往往能产生更准确的结果 [11, 12, 13]。

Similarly, CLAP inherits sensitivity to prompting from its contrastive pre training approach. Yet, the systematic exploration of prompt robustness in CLAP remains limited, despite few works highlighting the sensitivity of classification to prompt variations [14, 15]. These works, primarily conducted on the ESC50 dataset and limited to up to five prompt templates, shed initial light on these variations. However, robustness to prompt changes is likely to vary across different datasets. Addressing this gap, recent efforts have explored alternative approaches, such as prompt tuning strategies and lightweight adapters, to mitigate the reliance on manually engineered prompts [16, 17] with an explicit focus on adapting CLAP to downstream tasks or new domains.

同样，CLAP 因其对比预训练方法而继承了对于提示(prompt)的敏感性。然而，尽管有少数研究强调了分类对提示变化的敏感性 [14, 15]，但针对 CLAP 提示鲁棒性的系统性探索仍然有限。这些研究主要在 ESC50 数据集上进行，且仅限于最多五个提示模板，初步揭示了这些变化。然而，对于提示变化的鲁棒性可能因不同数据集而异。针对这一空白，近期研究探索了替代方法，例如提示调优策略和轻量级适配器，以减少对手工设计提示的依赖 [16, 17]，并明确侧重于使 CLAP 适应下游任务或新领域。

In this work, we propose a tuning-free approach that prompts CLAP models with descriptions of class labels to enhance zero-shot audio classification. While using keywords such as “audio,” “hear,” and “sound” in prompt templates primes the text encoder to focus on audio-related concepts, we hypothesize that enriching prompts with explicit class descriptions can further enhance the model’s ability to clarify the meaning of class labels, particularly in scenarios where labels are ambiguous. Ambiguity stems from both the textual and audio aspects of the data. Textual ambiguity arises from homonyms, where words possess multiple meanings, and from the lack of contextual clues (e.g., ”bat” as both an animal and sports equipment). On the audio side, ambiguity arises from acoustically similar sound categories, such as distinguishing between bird vocalizations (e.g., raven vs. crow calls) and musical instruments (e.g., violin vs. viola). Thus, detailed prompts may clarify sounds heavily reliant on context, and help disambiguate acoustically similar sounds. Such descriptions can also disambiguate abstract sounds such as “white noise” and compensate for knowledge gaps or limited exposure to certain terms. For instance, clarifying “Geiger counter”, as “a detection device that clicks or beeps when detecting radiation” could improve correlations of audio and text features.

在本研究中，我们提出了一种无需调参的方法，通过向CLAP模型输入类别标签描述来增强零样本音频分类性能。虽然提示模板中使用"audio"、"hear"、"sound"等关键词能使文本编码器聚焦音频相关概念，但我们假设通过显式类别描述丰富提示词可进一步提升模型解析标签含义的能力，尤其适用于标签存在歧义的场景。这种歧义性来自数据的文本和音频两个维度：文本歧义源于同形异义词（如"bat"可指动物或运动器材）和上下文线索缺失；音频歧义则来自声学相似的类别（如乌鸦与渡鸦鸣叫、小提琴与中提琴音色）。因此，详细提示词既能阐明高度依赖上下文的声响，也能帮助区分声学相似的音频。此类描述还可消除"白噪音"等抽象声音的歧义，并弥补特定术语的知识缺口（如将"盖革计数器"明确描述为"检测到辐射时会发出咔嗒声或蜂鸣声的探测设备"），从而改善音频与文本特征的相关性。

To validate our hypothesis, we leverage Large Language Models (LLMs) for their knowledge of sound semantics. Specifically, we used Mistral1 to describe the acoustic properties of class labels. Our study demonstrates that using audio-centric descriptions of class labels as prompts helps CLAP better ground acoustic features with semantic descriptions, significantly boosting zero-shot classification scores across major environmental sound datasets. Remarkably, our method even outperforms learnable prompt strategies, all without the need for additional training, while remaining entirely zero-shot.

为验证我们的假设，我们利用大语言模型(LLM)对声音语义的理解能力。具体而言，我们使用Mistral1来描述类别标签的声学特性。研究表明，采用以音频为中心的类别标签描述作为提示词(prompt)，能帮助CLAP模型更好地将声学特征与语义描述对齐，从而显著提升在主流环境声音数据集上的零样本分类准确率。值得注意的是，该方法甚至优于可学习提示策略，且完全无需额外训练，始终保持零样本特性。

2. METHODOLOGY

2. 方法论

We first describe the zero-shot audio classification task, then our adaptive class selection strategy and finally we motivate our LLMgenerated class descriptions.

我们首先描述零样本 (Zero-shot) 音频分类任务，接着介绍自适应类别选择策略，最后阐述采用大语言模型生成类别描述的原因。

Table 1: Example descriptions of randomly sampled class labels from the datasets considered in this work, generated with Mistral-7B [18].

表 1: 本研究中随机采样数据集类别标签的示例描述，由 Mistral-7B [18] 生成。

类别	基础描述	上下文描述	本体描述
曼陀林	一种用拨片演奏的弦乐器，特点是体积小、音调高且具有独特的鼻音音色。	具有独特鼻音音色的弦乐器，常与民谣或蓝草音乐相关联。通常通过拨动或扫弦演奏，产生明亮悦耳的旋律。	具有独特鼻音音色的弦乐器，常用于民谣和流行音乐。
铁路运输	火车在轨道上行驶的声音，特点是车轮的咔嗒声和引擎的轰鸣声。	火车沿轨道行驶的声音，特点是持续有节奏的咔嗒声。常见于建有铁路设施的城乡区域。	火车在轨道上移动时产生的隆隆声与金属碰撞声，其强度随速度变化，属于交通相关声音类别。
哨声	通过小孔吹气产生的高音短促声响，常用作信号或警告。	通过哨子或乐器等小孔吹气产生的尖锐短促声响。	哨子或其他乐器产生的高音短促声响，常用作信号或警告。
溪流	水流或其他液体的连续流动，常以其流过岩石等障碍物时的声音为特征。	常见于河流、湖泊或瀑布等自然环境的持续水流声，特点是水流经岩石或其他表面时产生的声音。	具有特定节奏模式和音色的连续声音流，属于自然环境声音类别。

2.1. Zero-shot audio classification

2.1. 零样本音频分类

Given a set of target categories $C$ and a query audio sample $a$ , the zero-shot audio classification protocol in CLAP defines the classification problem as a nearest neighbor retrieval task. The predicted category $\hat{c}$ is determined as follows:

给定一组目标类别 $C$ 和一个查询音频样本 $a$，CLAP 中的零样本音频分类协议将该分类问题定义为最近邻检索任务。预测类别 $\hat{c}$ 按如下方式确定：

$$
\hat{c}=\arg\operatorname*{max}{c\in C}\sin(\phi_{\mathrm{A}}(a),\phi_{\mathrm{T}}(c)),
$$

where $C$ represents the set of class labels, $a$ denotes the input audio, and $\phi_{\mathrm{A}}$ and $\phi_{\mathrm{T}}$ are the audio and text encoders, respectively. The function $\mathrm{sim}(\cdot,\cdot)$ corresponds to the similarity metric, typically the cosine similarity.

其中 $C$ 代表类别标签集合，$a$ 表示输入音频，$\phi_{\mathrm{A}}$ 和 $\phi_{\mathrm{T}}$ 分别为音频编码器和文本编码器。函数 $\mathrm{sim}(\cdot,\cdot)$ 对应相似性度量，通常指余弦相似度。

To enhance zero-shot audio classification, we propose using both class labels and their descriptions to resolve ambiguities. Given a set of target categories $C$ , definitions $D$ , the predicted category $\tilde{c}$ is determined by:

为提升零样本音频分类性能，我们提出同时利用类别标签及其描述来消除歧义。给定目标类别集合 $C$ 、定义集 $D$ ，预测类别 $\tilde{c}$ 由以下公式确定：

$$
\tilde{c}=\arg\operatorname*{max}{c\in C}\sin(\phi_{\mathrm{A}}(a),\phi_{\mathrm{T}}(c+d_{c}))),
$$

where $d_{c}\in D$ is the description corresponding to class $c$ , and the $^+$ operator denotes the textual combination of the class label $c$ and its description $d_{c}$ .

其中 $d_{c}\in D$ 是对应类别 $c$ 的描述，运算符 $^+$ 表示类别标签 $c$ 与其描述 $d_{c}$ 的文本组合。

2.2. Adaptive class description selection

2.2. 自适应类别描述选择

We devise an adaptive strategy that incorporates descriptions selectively for classes potentially ambiguous to the text encoder. Let $\mathrm{P_{class-only}}$ and $\mathrm{P_{class}}$ -description represent the classification performance for class $c$ using setups involving classes only or classes with descriptions as in Equations (1) and (2),

我们设计了一种自适应策略，有选择性地为文本编码器可能混淆的类别加入描述。设 $\mathrm{P_{class-only}}$ 和 $\mathrm{P_{class}}$ -description 分别表示如式(1)和式(2)中仅使用类别或类别加描述时类别 $c$ 的分类性能。

The function $\mathbf{M}(c)$ decides whether a class should include a description based on cross-validation of results.

函数 $\mathbf{M}(c)$ 通过结果的交叉验证来决定一个类是否应包含描述。

2.3. Generation of audio-centric descriptions with LLMs

2.3. 基于大语言模型 (LLM) 的音频中心描述生成

Given audio event class labels, we propose to use Large Language Models (LLMs) to generate audio-centric descriptions for them auto mati call y, as manual collection of descriptions entails a laborintensive endeavor. LLMs, trained on vast text data, have a deep understanding of language, which we exploit for their knowledge of sound semantics. Our method, adapted from [19], involves three steps. First, we provide a general description of the task. Second, we combine these instructions with in-context demonstrations, including a few paired label-description examples. Finally, we provide the LLM with the class labels, heuristic constraints, and specific output format details to generate audio-centric descriptions.

给定音频事件类别标签，我们提出利用大语言模型(LLM)自动生成以音频为中心的描述，因为人工收集描述需要耗费大量劳力。大语言模型通过海量文本数据训练获得对语言的深刻理解，我们借此挖掘其对声音语义的认知能力。该方法改编自[19]，包含三个步骤：首先提供任务概述；其次将指令与上下文示例(包含少量成对的标签-描述样本)结合；最后向大语言模型提供类别标签、启发式约束和特定输出格式要求，以生成音频导向的描述。

Using this method, we generated three types of descriptions: base descriptions, context-aware descriptions, and ontology-aware descriptions. All are audio-centric. Base descriptions reflect the acoustic properties and characteristic sounds of the class labels. Context-aware descriptions add details about the typical locations and circumstances of encountering the sounds, including the physical environment, associated objects, and the function of the sound within its context. Ontology-aware descriptions capture the acoustic properties and characteristic sounds of each class label while also considering their relationships with coarse high-level concepts. Table 1 provides a few examples of the generated descriptions. The complete list of class descriptions and the prompts used to generate them are available on our companion website.2

采用这种方法，我们生成了三种类型的描述：基础描述、上下文感知描述和本体感知描述。所有描述均以音频为核心。基础描述反映了类别标签的声学特性及特征声音。上下文感知描述补充了声音出现的典型位置和情境细节，包括物理环境、相关对象以及声音在其上下文中的功能。本体感知描述在捕捉每个类别标签的声学特性与特征声音的同时，还考虑了它们与高层粗粒度概念的关系。表1展示了部分生成描述示例。完整的类别描述列表及生成所用的提示词可在我们的配套网站查阅。

3. EXPERIMENTAL SETUP

3. 实验设置

We detail our experimental approach, including model and dataset selection, evaluation metrics, and experiments to explore different prompt strategies and their impact on classification.

我们详细介绍了实验方法，包括模型和数据集选择、评估指标，以及探索不同提示策略及其对分类影响的实验。

3.1. Models

3.1. 模型

We adopt two state-of-the-art audio-text models pre-trained via contrastive learning, namely LAION-CLAP (LA) and Microsoft CLAP 2023 (MS). The former utilizes RoBERTa [20] as its text encoder, while the latter leverages GPT-2 [21]. Both models rely on HTS-AT [22] as their audio encoder.

我们采用了两款通过对比学习预训练的先进音频-文本模型：LAION-CLAP (LA) 和 Microsoft CLAP 2023 (MS)。前者使用 RoBERTa [20] 作为文本编码器，后者则采用 GPT-2 [21]。两款模型均基于 HTS-AT [22] 构建音频编码器。

3.2. Datasets and evaluation metrics

3.2. 数据集与评估指标

Downstream datasets. We select six major environmental sound datasets tailored for either single-class or multi-label classification. These include: ESC50 [23], which contains 50 environmental sound classes with 2k labeled samples of 5 seconds each; US8K [24], comprising 10 urban sound classes and 8k labeled sound excerpts of 4 seconds each; TUT2017 [25], consisting of 15 acoustic scenes classes and 52k files of 10 seconds each; FSD50K [26], featuring 51K audio clips of variable length (from 0.3 to 30 seconds each) curated from Freesound and comprising 200 classes;

下游数据集。我们选择了六个专为单类或多标签分类而设计的主要环境声音数据集，包括：ESC50 [23]，包含50个环境声音类别，每个类别有2k个标记样本，每段样本时长为5秒；US8K [24]，涵盖10个城市声音类别和8k段标记声音片段，每段时长为4秒；TUT2017 [25]，包含15个声学场景类别和52k个文件，每个文件时长为10秒；FSD50K [26]，收录了51K段时长不等（从0.3秒到30秒）的音频片段，这些片段来自Freesound并涵盖200个类别；

AudioSet [27], a large-scale dataset encompassing 527 classes, with over 2 million human-labeled sound clips of 10 seconds from YouTube videos; and DCASE17-T4 [25], a subset of AudioSet focused on 17 classes related to warning and vehicle sounds, containing $30\mathrm{k\Omega}$ audio clips of 10 seconds each.

AudioSet [27]，一个包含527个类别的大规模数据集，拥有超过200万条来自YouTube视频的10秒人工标注音频片段；以及DCASE17-T4 [25]，它是AudioSet的一个子集，专注于与警报和车辆声音相关的17个类别，包含$30\mathrm{k\Omega}$条10秒音频片段。

Evaluation setup and metrics. In our evaluation we consider all available splits (train/val/test) or folds, except for AudioSet, where only the test set was used. Note that some datasets do not allow for a fully zero-shot approach, as some audio files used in the evaluation were part of the pre training data of the considered frozen CLAP models (e.g., AudioSet and FSD50K). We believe that it is still interesting to analyse the corresponding results, bearing this fact in mind during the discussion. We use accuracy as the metric for single-class classification datasets (ESC50, US8K and TUT2017) and mean Average Precision (mAP) for multi-label classification datasets (FSD50K, AudioSet and DCASE17-T4). For experiments involving class-specific descriptions, a 5-fold cross-validation setting is employed. These folds were constructed on the data considered for evaluation i.e., all splits/folds for all datasets, except for AudioSet where the test set is used. In this approach, training folds are used to derive the mapping M from Equation (3), while test folds are used to assess its generalization. Directly evaluating the mapping without cross-validation would yield overly optimistic results due to over fitting.

评估设置与指标。在我们的评估中，我们考虑了所有可用的分割（训练/验证/测试）或折数，除了AudioSet仅使用测试集。需要注意的是，某些数据集无法实现完全的零样本方法，因为评估中使用的一些音频文件是被考虑的冻结CLAP模型（如AudioSet和FSD50K）预训练数据的一部分。我们认为在讨论时牢记这一事实，分析相应的结果仍然是有意义的。对于单类别分类数据集（ESC50、US8K和TUT2017），我们使用准确率作为指标；对于多标签分类数据集（FSD50K、AudioSet和DCASE17-T4），则采用平均精度均值（mAP）。涉及类别特定描述的实验采用5折交叉验证设置。这些折数是在评估所考虑的数据上构建的，即所有数据集的所有分割/折数，除了AudioSet使用测试集。在这种方法中，训练折用于从方程（3）导出映射M，而测试折用于评估其泛化能力。如果不进行交叉验证直接评估映射，由于过拟合会导致结果过于乐观。

3.3. Zero-shot audio classification experiments

3.3. 零样本 (Zero-shot) 音频分类实验

Prompting with class labels only We explore zero-shot audio classification using prompts with sanitized class labels (i.e., replacing underscores in original labels with spaces, e.g., dog barking becomes dog barking). This is motivated by the fact that in our early experiments we observed that this strategy performs competitively compared to prompting with “This is a sound of”, which has been preferred in the literature [14, 4]. Here, we system a tic ally study the impact of using only class labels as prompts on classification performance. We examine four different formats to construct the start and end of a prompt: uppercase with a period (e.g., Dog barking.), uppercase without a period (e.g., Dog barking), lowercase with a period (e.g., dog barking.), and lowercase without a period (e.g., dog barking). The format yielding the highest performance for each model, termed as CLS, was selected as a reference for subsequent experiments involving class descriptions.

仅使用类别标签进行提示
我们探索了使用经过清理的类别标签（即用空格替换原始标签中的下划线，例如将dog_barking转换为dog barking）进行零样本音频分类。这一方法的动机源于早期实验中观察到，与文献[14, 4]中偏好的"This is a sound of"提示策略相比，该策略表现具有竞争力。在此，我们系统地研究了仅使用类别标签作为提示对分类性能的影响。我们测试了四种不同的提示首尾构建格式：带句点的大写形式（如Dog barking.）、不带句点的大写形式（如Dog barking）、带句点的小写形式（如dog barking.）以及不带句点的小写形式（如dog barking）。为每个模型选择性能最高的格式（称为CLS）作为后续涉及类别描述实验的基准。

Prompting with templates. Inspired from CLIP [1], we explore a set of prompt templates as plausible alternatives to “This is a sound of”, all tailored for the zero-shot audio classification task. We curated a set of 33 distinct prompts, drawing some from prior studies [14, 4, 15]. Our objective is to systematically evaluate the performance of these alternative prompts and their ensemble across multiple datasets. Each prompt follows the format Template $^+$ class label, e.g., “A sound clip of dog barking.”. We thus analyse the performance of three prompt configurations: $\mathrm{PT_{Baseline}}$ : The baseline prompt template “This is a sound of”. $\mathrm{PT}_{\mathrm{Best}}$ : The most effective prompt template identified among the 33 manually crafted alternatives. PTEnsemble: Ensembling text embeddings from all considered prompt templates. Each prompt template begins with an uppercase letter and concludes with a period.

模板提示。受 CLIP [1] 启发，我们探索了一组提示模板作为"This is a sound of"的替代方案，这些模板均针对零样本音频分类任务定制。我们筛选了33种不同的提示模板，部分参考了先前研究[14,4,15]。我们的目标是系统评估这些替代提示及其组合在多个数据集上的性能。每个提示遵循"模板$^+$类别标签"的格式，例如"A sound clip of dog barking."。我们分析了三种提示配置的性能：$\mathrm{PT_{Baseline}}$：基准提示模板"This is a sound of"；$\mathrm{PT}_{\mathrm{Best}}$：从33个人工设计的备选模板中识别出的最佳模板；PTEnsemble：集成所有考虑提示模板的文本嵌入。每个提示模板以大写字母开头并以句点结尾。

Prompting with class-specific descriptions. We investigate the impact of combining class labels and their descriptions generated by LLMs. The experimental setups include: CLS: Class label only. $\mathrm{CD}{\mathrm{Base}}$ : Audio-centric definitions generated by Mistral. $\mathrm{CD}_{\mathrm{Context}}{}^{3}$ : Context-aware descriptions. CDOntology: Ontological information related to the class label. CD Dictionary: Definitions (non audio-centric) sourced from the Cambridge Dictionary of English.4

基于类别特定描述的提示。我们研究了结合类别标签与大语言模型生成的描述所产生的影响。实验设置包括：CLS：仅使用类别标签。$\mathrm{CD}{\mathrm{Base}}$：由Mistral生成的以音频为中心的定义。$\mathrm{CD}_{\mathrm{Context}}{}^{3}$：上下文感知描述。CDOntology：与类别标签相关的本体论信息。CD Dictionary：源自《剑桥英语词典》的定义（非以音频为中心）。

4. RESULTS AND DISCUSSION

4. 结果与讨论

In this section, we present and discuss the outcomes of our experiments, shedding light on the impact of various prompting strategies and the role of class descriptions in classification

[论文翻译]声音描述：探索提示模板与类别描述以增强零样本音频分类

原文地址：https://arxiv.org/pdf/2409.13676v1