BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification

BTS: 桥接文本与声音模态的元数据辅助呼吸音分类

Abstract

摘要

Respiratory sound classification (RSC) is challenging due to varied acoustic signatures, primarily influenced by patient demographics and recording environments. To address this issue, we introduce a text-audio multimodal model that utilizes metadata of respiratory sounds, which provides useful complementary information for RSC. Specifically, we fine-tune a pretrained text-audio multimodal model using free-text descriptions derived from the sound samples’ metadata which includes the gender and age of patients, type of recording devices, and recording location on the patient’s body. Our method achieves state-of-the-art performance on the ICBHI dataset, surpassing the previous best result by a notable margin of $1.17%$ . This result validates the effectiveness of leveraging metadata and respiratory sound samples in enhancing RSC performance. Additionally, we investigate the model performance in the case where metadata is partially unavailable, which may occur in real-world clinical setting.

呼吸音分类(RSC)由于声学特征差异大而具有挑战性，主要受患者人口统计数据和录音环境影响。为解决这一问题，我们提出一种利用呼吸音元数据的文本-音频多模态模型，为RSC提供有效的补充信息。具体而言，我们使用来自音频样本元数据的自由文本描述(包括患者性别年龄、录音设备类型及身体录音部位)对预训练的文本-音频多模态模型进行微调。我们的方法在ICBHI数据集上取得了最先进的性能，以1.17%的显著优势超越之前的最佳结果。这一结果验证了利用元数据和呼吸音样本提升RSC性能的有效性。此外，我们还研究了元数据部分缺失(可能发生在真实临床环境中)时的模型表现。

Index Terms: Respiratory Sound Classification, Pretrained Language-Audio Model, ICBHI, Metadata

关键词: 呼吸音分类, 预训练语言-音频模型, ICBHI, 元数据

1. Introduction

1. 引言

Identifying abnormal respiratory sounds is pivotal for diagnosing and providing timely interventions for respiratory conditions. Automated detection of abnormal respiratory sounds has great potential to improve health and quality of life for those affected by respiratory diseases by identifying risks early and expediting first aid for potentially life-threatening conditions, such as pneumonia or chronic obstructive pulmonary disease. Machine learning approaches have been regarded as a promising way for automated detection of abnormal respiratory sounds. Recently, a number of studies [1, 2, 3, 4, 5, 6, 7, 8, 9] have tackled the respiratory sound classification (RSC) task and notably increased the performance by utilizing models that have been pretrained on large non-medical datasets [10, 11], and then fine-tuned on a respiratory sound dataset [12].

识别异常呼吸音对于诊断呼吸系统疾病和及时干预至关重要。通过早期识别风险并加速对肺炎或慢性阻塞性肺病等潜在危及生命状况的急救，异常呼吸音的自动检测技术有望显著改善呼吸疾病患者的健康水平和生活质量。机器学习方法被视为实现异常呼吸音自动检测的重要途径。近期，多项研究[1, 2, 3, 4, 5, 6, 7, 8, 9]通过利用在大型非医疗数据集[10, 11]上预训练、再在呼吸音数据集[12]上微调的模型，显著提升了呼吸音分类(RSC)任务的性能。

Nevertheless, the inherent heterogeneity of respiratory sound data presents an obstacle to further performance improvement in RSC. The heterogeneity arises from differences in patient demographics, recording devices, and environmental conditions, which can significantly impact the acoustic properties of respiratory sounds [1]. This may lead to poor generalization on unseen data, particularly in cases underrepresented by the training data. ICBHI [12], one of the widely adopted respiratory sound datasets, provides metadata that associates the recorded audio with attributes of patients and recording environments.

然而，呼吸音数据固有的异质性对RSC性能的进一步提升构成了障碍。这种异质性源于患者人口统计学特征、录音设备和环境条件的差异，这些因素会显著影响呼吸音的声学特性 [1]。这可能导致模型在未见数据上泛化能力较差，尤其是在训练数据中代表性不足的病例中。被广泛采用的呼吸音数据集之一ICBHI [12] 提供了将录音与患者属性及录音环境相关联的元数据。

Such metadata may be useful for addressing difficulties caused by heterogeneity.

此类元数据可能有助于解决由异构性引起的难题。

Some previous work has adapted the metadata associated with respiratory sound for RSC to mitigate the heterogeneity issue. For instance, incorporating demographic information of patients such as age and gender into the pre training process provides better representations of respiratory audio samples [5]. Moreover, metadata concerning the recording environment (i.e., stethoscope) also provides useful information. SG-SCL [1] employed domain-transfer techniques to reduce the effect of heterogeneity by regarding different types of recording devices as distinct domains. Despite the potential benefits of leveraging the metadata, these previous works did not fully incorporate it as text data into the model inputs.

一些先前的研究通过调整与呼吸音相关的元数据来缓解RSC中的异质性问题。例如，在预训练过程中加入患者的年龄和性别等人口统计信息，能更好地表示呼吸音频样本[5]。此外，关于录音环境(如听诊器)的元数据也提供了有用信息。SG-SCL[1]采用域迁移技术，将不同类型的录音设备视为不同域，以减少异质性的影响。尽管利用元数据具有潜在优势，但这些先前工作并未将其作为文本数据充分整合到模型输入中。

Recent developments in multimodal models, exemplified by Contrastive Language-Image Pre training (CLIP) [13] for text and image data and Contrastive Language-Audio Pre training (CLAP) [14, 15] for text and audio data, offer a flexible framework for integrating text data with non-textual data. Several studies [16, 17, 18, 19] have demonstrated the effectiveness of language-EEG multimodal models for the sentiment classification and EEG-to-text decoding tasks. Recognizing the success of multimodal models and the demonstrated benefits of multimodal data in healthcare tasks, it is compelling to consider them for RSC, where such method has not yet been explored.

多模态模型的最新进展，如针对文本和图像数据的对比语言-图像预训练 (CLIP) [13] 以及针对文本和音频数据的对比语言-音频预训练 (CLAP) [14, 15]，为整合文本与非文本数据提供了灵活框架。多项研究 [16, 17, 18, 19] 已证明语言-脑电图 (EEG) 多模态模型在情感分类和脑电图到文本解码任务中的有效性。鉴于多模态模型的成功及其在医疗任务中展现的优势，将这类方法应用于尚未探索的 RSC 领域具有显著意义。

In this paper, we take a step into a new direction and fully make use of the respiratory audio metadata by adapting a textaudio multimodal model, aiming not only to leverage the metadata as an additional learning signal, but to benefit from the further context during the inference stage. Building on the foundation of contrastive language-audio pretrained models, our work incorporates the respiratory audio metadata alongside the sound recordings. To this end, we format the patient’s metadata into descriptions derived from key attributes including age, gender, recording device, and recording location on the body, and encode them with respiratory sound data into shared feature represent ation by the pretrained encoders. With these joint representations, we train a classification head for the RSC task.

本文迈出了新方向的一步，通过适配文本-音频多模态模型，充分利用呼吸音频元数据。我们的目标不仅是将元数据作为额外的学习信号，更要在推理阶段从更丰富的上下文中获益。基于对比式语言-音频预训练模型的基础，我们的工作将呼吸音频元数据与录音数据相结合。为此，我们将患者元数据格式化为基于关键属性（包括年龄、性别、录音设备和身体录音位置）的描述，并通过预训练编码器将其与呼吸音数据共同编码为共享特征表示。利用这些联合表征，我们为呼吸音分类任务训练了一个分类头。

Our approach, which we name the BTS (Bridging the Text and Sound modalities), a method that leverages multimodal text-audio model to fully exploit the potential of respiratory audio metadata, achieves the state-of-the-art (SOTA) result on the ICBHI dataset, outperforming upon the previous best [4] by $1.17%$ . Our results reveal the capability of contrastive language-audio pre training to improve RSC both in audio-only and multimodal settings. Moreover, we demonstrate that our method retains its performance gains in the absence of metadata during the inference. This result suggests that our approach can be adopted for practical clinical settings where additional information other than audio signals may be unavailable. Our main

我们的方法命名为BTS（Bridging the Text and Sound modalities），该方法利用多模态文本-音频模型充分挖掘呼吸音元数据的潜力，在ICBHI数据集上取得了当前最优（SOTA）结果，以1.17%的优势超越了此前的最佳表现[4]。结果表明对比式语言-音频预训练能够提升纯音频和多模态场景下的呼吸音分类（RSC）性能。更重要的是，我们证明了该方法在推理阶段缺失元数据时仍能保持性能优势，这意味着该方案可适用于仅含音频信号的临床实际场景。我们的主要...

Figure 1: An overall illustration of the proposed BTS architecture. The pretrained text and audio encoders extract feature representations of text description derived from metadata and respiratory sound samples, respectively. After the projection, the representations are integrated by a concatenation operation and used for RSC.

图 1: 提出的 BTS 架构整体示意图。预训练文本和音频编码器分别从元数据衍生的文本描述和呼吸音样本中提取特征表示。经过投影后，这些表示通过拼接操作进行整合，并用于呼吸音分类 (RSC)。

contributions are as follows:

贡献如下：

2. Method

2. 方法

We introduce Bridging Text and Sound modalities (BTS), an approach that leverages multimodal text-audio model to fully exploit the potential of respiratory audio metadata. To mitigate the heterogeneity of respiratory sounds, we propose to explicitly utilize the metadata, which we expect to capture the significant sources of acoustic variability. By integrating this metadata, we aim to reduce the heterogeneity issue and improve the RSC performance. Toward this goal, we propose the adoption of a multimodal text-audio model for RSC, as depicted in Figure 1.

我们介绍了桥接文本与声音模态(BTS)方法，该方法利用多模态文本-音频模型充分挖掘呼吸音元数据的潜力。为缓解呼吸音异质性问题，我们提出显式利用元数据来捕捉声学变异的主要来源。通过整合这些元数据，我们旨在降低异质性并提升呼吸音分类(RSC)性能。为此，我们提出采用多模态文本-音频模型进行RSC，如图1所示。

2.1. CLAP Model

2.1. CLAP模型

While the metadata of respiratory sounds can be employed for RSC in several different ways, a free text format is flexible and easily applicable to human-produced data such as medical records. For instance, the metadata can be described by a vector of numeric values where each element indicates different metadata attributes. However, this approach is usually vulnerable to changes in input sources, such as missing data and unseen data types. In contrast, encoders for free text data are trained to understand the given input, which makes approaches utilizing input data in a free text format robust to the changes. For this reason, we use CLAP (Contrastive Language-Audio Pretraining) [15] as our starting point. The CLAP model includes both text and audio encoders, which are trained on the large-scale LAION-Audio-630K [15] dataset including diverse audio data.

虽然呼吸音的元数据可以通过多种不同方式用于RSC，但自由文本格式灵活且易于应用于人工生成的数据（如医疗记录）。例如，元数据可以用数值向量描述，其中每个元素代表不同的元数据属性。然而，这种方法通常对输入源的变化（如数据缺失和未知数据类型）较为敏感。相比之下，自由文本数据编码器经过训练能够理解给定输入，这使得采用自由文本格式输入数据的方法对这些变化具有更强的鲁棒性。因此，我们以CLAP（对比语言-音频预训练）[15]作为起点。CLAP模型包含文本和音频编码器，这些编码器在包含多样化音频数据的大规模LAION-Audio-630K [15]数据集上进行了训练。

Table 1: Examples of generated text descriptions derived from metadata. ‘All’ is the case that includes all attributes: age, sex, recording location, and recording device.

表 1: 从元数据生成的文本描述示例。"All"表示包含所有属性的情况：年龄、性别、记录位置和记录设备。

元数据	生成的文本描述
Age	这是一位成年患者。
Sex	这是一位男性患者。
Loc	这段声音是从左前胸记录的。
Dev	这段声音是用Meditron听诊器记录的。
Age-Loc-Dev	这段声音是用Meditron听诊器从左前胸的成年患者身上记录的。
All	这段声音是用Meditron听诊器从左前胸的成年男性患者身上记录的。

Given the text and audio data denoted as $X_{i}^{t}$ and $X_{i}^{a}$ where $i\in[1,N]$ indicates the data index within a batch of size $N$ , CLAP processes the text and audio data independently through dedicated encoders $f_{t}(\cdot)$ and $f_{a}(\cdot)$ for each modality. The embedding vectors produced by the encoders are projected onto a $d$ -dimensional shared embedding space through projection layers $h_{t}(\cdot)$ and $h_{a}(\cdot)$ .

给定文本和音频数据分别表示为$X_{i}^{t}$和$X_{i}^{a}$，其中$i\in[1,N]$表示批次大小为$N$的数据索引。CLAP通过专用编码器$f_{t}(\cdot)$和$f_{a}(\cdot)$独立处理各模态数据，编码器生成的嵌入向量通过投影层$h_{t}(\cdot)$和$h_{a}(\cdot)$映射到$d$维共享嵌入空间。

$$
\begin{array}{l}{{z_{t}=h_{t}(f_{t}(X_{i}^{t})),}}\ {{z_{a}=h_{a}(f_{a}(X_{i}^{a})).}}\end{array}
$$

The CLAP model is trained to maximize the similarity between the text and audio embeddings by contrasting them with negative samples (i.e., mismatched text or audio embeddings obtained from $X_{j\in[1,N];j\neq i}^{t}$ or $X_{j\in[1,N];j\neq i}^{a})$ .

CLAP模型通过将文本和音频嵌入与负样本（即从$X_{j\in[1,N];j\neq i}^{t}$或$X_{j\in[1,N];j\neq i}^{a}$获取的不匹配文本或音频嵌入）进行对比，训练目标是最大化两者之间的相似度。

2.2. Text Description Generation for Metadata

2.2. 元数据的文本描述生成

Among the metadata available in the ICBHI [12] dataset, we choose four types of data as follows: age (adult or pediatric) and gender (male or female) of patients, recording location on the chest of the patients (trachea, anterior left, anterior right, posterior left, posterior right, lateral left, or lateral right), and type of recording devices (Meditron, LittC2SE, Litt3200, or AKGC417L). Using the attributes, we construct simple text descriptions. A generated description can include any combination of the attributes, totaling 644 unique texts. Table 1 illus- trates a few examples with different combinations of metadata.

在ICBHI [12]数据集的可用元数据中，我们选择以下四类数据：患者年龄（成人或儿童）和性别（男性或女性）、患者胸部录音位置（气管、前左、前右、后左、后右、左侧或右侧）以及录音设备类型（Meditron、LittC2SE、Litt3200或AKGC417L）。利用这些属性，我们构建了简单的文本描述。生成的描述可以包含这些属性的任意组合，共计644种独特文本。表1展示了几个不同元数据组合的示例。

2.3. Bridging Text and Sound Modalities

2.3. 桥接文本与声音模态

As shown in Figure 1, we train the text and audio encoders of CLAP for RSC by using the respiratory sound samples and generated text descriptions. For classification, we concatenate text and audio representations $z_{t}$ and $z_{a}$ from text and audio pipelines as described in Figure 1. Consequently, we can obtain the multimodal combined representations $z=c o n c a t(z_{t},z_{a})$ where $\boldsymbol{z}\in\mathbb{R}^{N\times2d}$ . We then simply add a 4-dimensional linear layer for classifier $g(\cdot)$ followed by softmax function and train it with the Cross-Entropy loss $\mathcal{L}_{\mathrm{CE}}$ (division by $N$ is omitted):

如图 1 所示，我们使用呼吸音样本和生成的文本描述来训练 CLAP 的文本和音频编码器以进行 RSC。对于分类任务，我们按照图 1 所述将文本和音频流水线中的表征 $z_{t}$ 和 $z_{a}$ 进行拼接。因此，我们可以获得多模态组合表征 $z=c o n c a t(z_{t},z_{a})$，其中 $\boldsymbol{z}\in\mathbb{R}^{N\times2d}$。随后，我们简单地添加一个 4 维线性层作为分类器 $g(\cdot)$，后接 softmax 函数，并使用交叉熵损失 $\mathcal{L}_{\mathrm{CE}}$ 进行训练（省略了除以 $N$ 的步骤）：

$$
\mathcal{L}{\mathrm{CE}}=-\sum_{i=1}^{n}y_{i}\log\big(\hat{y_{i}}\big),
$$

where $n$ is number of samples, $y$ is the respiratory sound lab $\mathrm{{el}\in{n o r m a l}}$ , crackle, wheeze, ${\mathrm{both}}$ , and $\hat{y}$ is the predicted probabilities obtained by the classifier.

其中 $n$ 为样本数量，$y$ 是呼吸音标签 $\mathrm{{el}\in{normal}$、crackle、wheeze、$\mathrm{both}}$，$\hat{y}$ 为分类器预测的概率值。

3. Experimental Setup

3. 实验设置

3.1. Dataset

3.1. 数据集

We utilized the ICBHI Respiratory dataset [12]. The dataset contains a total of approximately 5.5 hours of respiratory sound recordings with pre-defined and balanced splits for training $(60%)$ and test $(40%)$ without patient overlap. There are 4,142 training and 2,756 testing respiratory cycles across four classes. Table 2 illustrates the details of the ICBHI dataset. We binarize the age as the adult (over 18 years old) or pediatric (18 years old or under) for simplicity. Other than the age, we follow the metadata information of the official ICBHI records. Body mass index (BMI) data, which was provided only for adult patients, are solely employed for further analysis. For non-adults, we calculated it using their weight and height data.

我们使用了ICBHI呼吸音数据集[12]。该数据集共包含约5.5小时的呼吸音录音，并预先定义了平衡的训练集$(60%)$和测试集$(40%)$划分，且患者无重叠。数据涵盖四个类别，包含4,142个训练呼吸周期和2,756个测试呼吸周期。表2展示了ICBHI数据集的详细信息。为简化处理，我们将年龄二分为成人(18岁以上)和儿童(18岁及以下)。除年龄外，我们严格遵循官方ICBHI记录中的元数据信息。仅针对成年患者提供的身体质量指数(BMI)数据被单独用于进一步分析，对于非成年患者，我们根据其体重和身高数据计算得出该指标。

3.2. Training Details

3.2. 训练细节

Following the data pre-processing described in [1, 3, 4, 9], we extracted the respiratory cycles from the waveform samples and standardized them to have a duration of 8 seconds. We then conducted resampling to 48kHz to match the pre training data of CLAP. We employed the CLAP [15] model pretrained on the LAION-Audio-630K [15] dataset for all experiments. The maximum length of the text descriptions is limited to 64 tokens, which was sufficient for avoiding truncation of text. We finetuned the models using the Adam optimizer [20] with an initial learning rate of 5e–5. The learning rate was adjusted by cosine scheduling through a total of 50 epochs of training with a batch size of 8. To reduce the impact of random initialization, we conducted the experiments with five different random seeds.

按照[1, 3, 4, 9]中描述的数据预处理方法，我们从波形样本中提取呼吸周期并将其标准化为8秒时长。随后进行48kHz重采样以匹配CLAP的预训练数据。所有实验均采用基于LAION-Audio-630K [15]数据集预训练的CLAP [15]模型。文本描述的最大长度限制为64个token (Token)，这足以避免文本截断。我们使用Adam优化器[20]进行微调，初始学习率为5e–5，通过余弦调度在50个训练周期内调整学习率，批量大小为8。为降低随机初始化的影响，实验采用五种不同随机种子进行。

3.3. Metrics

3.3. 指标

We adapt the Specificity $(S_{p})$ , Sensitivity $(S_{e})$ , and their average (Score) as performance metrics for RSC, following the definitions in [12]. All reported values of $S_{p}$ , $S_{e}$ , and $S$ core are the mean and variance from the five runs with different seeds.

我们采用特异性 $(S_{p})$、灵敏度 $(S_{e})$ 及其平均值（Score）作为RSC的性能指标，遵循[12]中的定义。所有报告的 $S_{p}$、$S_{e}$ 和 $S$ core值均为五次不同种子运行结果的均值与方差。

3.4. Baselines

3.4. 基线方法

We compare the proposed method with previous studies including the current SOTA method [4], which uses Audio Spectrogram Transformer (AST) [21] as a backbone model. We also consider the result based solely on the audio embedding of CLAP $\scriptstyle{z_{a}}$ in Equation (1)) as an additional baseline, which we denote as Audio-CLAP.

我们将所提出的方法与之前的研究进行比较，包括当前的最先进方法(SOTA) [4] (该方法使用音频频谱Transformer (AST) [21] 作为骨干模型)。我们还考虑了仅基于CLAP音频嵌入 $\scriptstyle{z_{a}}$ (公式(1))的结果作为额外基线，记为Audio-CLAP。

Table 2: Details of the ICBHI dataset including the number of audio samples for each class and the types of metadata. L/R stands for left or right.

表 2: ICBHI数据集的详细信息，包括每个类别的音频样本数量和元数据类型。L/R表示左或右。

	Label	Train	Test	Sum
Lung Sound	Normal	2,063	1,579	3,642
	Crackle	1,215	649	1,864
	Wheeze	501	385	886
	Both	363	143	506
	Type	MetadataLabel
Metadata	Age	Adult, Pediatric
	Sex	Male, Female
	Location	Trachea, L/RAnterior, L/RPosterior, L/RLateral
	Stethoscope	Meditron, LittC2SE, Litt3200, AKGC417L
	Others	BMI (Adult only), Weight/Height (Pediatric only)

4. Results

4. 结果

4.1. Main Results

4.1. 主要结果

Table 3 presents comprehensive ICBHI results including our method. Our method achieves a new SOTA by $1.17%$ improvement fr

[论文翻译]BTS: 桥接文本与声音模态的元数据辅助呼吸音分类

原文地址：https://arxiv.org/pdf/2006.04868v2