[论文翻译]STEAD:面向时间和计算敏感应用的时空高效异常检测


原文地址:https://arxiv.org/pdf/2503.07942v1


STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive Applications

STEAD:面向时间和计算敏感应用的时空高效异常检测

Abstract— This paper presents a new method for anomaly detection in automated systems with time and compute sensitive requirements, such as autonomous driving, with unparalleled efficiency. As systems like autonomous driving become increasingly popular, ensuring their safety has become more important than ever. Therefore, this paper focuses on how to quickly and effectively detect various anomalies in the aforementioned systems, with the goal of making them safer and more effective. Many detection systems have been developed with great success under spatial contexts; however, there is still significant room for improvement when it comes to temporal context. While there is substantial work regarding this task, there is minimal work done regarding the efficiency of models and their ability to be applied to scenarios that require real-time inference, i.e., autonomous driving where anomalies need to be detected the moment they are within view. To address this gap, we propose STEAD (Spatio-Temporal Efficient Anomaly Detection), whose backbone is developed using $({\bf2+1}){\bf D}$ Convolutions and Performer Linear Attention, which ensures computational efficiency without sacrificing performance. When tested on the UCF-Crime benchmark, our base model achieves an AUC of $91.34%$ , outperforming the previous state-of-the-art, and our fast version achieves an AUC of $88.87%$ , while having $99.70%$ less parameters and outperforming the previous state-of-theart as well. The code and pretrained models are made publicly available at https://github.com/agao8/STEAD.

摘要—本文提出了一种在具有时间和计算敏感需求的自动化系统(如自动驾驶)中进行异常检测的新方法,其效率无与伦比。随着自动驾驶等系统日益普及,确保其安全性变得比以往任何时候都更为重要。因此,本文重点研究如何快速有效地检测上述系统中的各种异常,旨在使其更安全、更高效。许多检测系统在空间场景下已取得巨大成功;然而在时间场景方面仍有显著改进空间。尽管已有大量相关研究工作,但关于模型效率及其在需要实时推理场景(如必须在异常出现瞬间完成检测的自动驾驶)中应用能力的研究却很少。为填补这一空白,我们提出了STEAD (时空高效异常检测),其主干网络采用$({\bf2+1}){\bf D}$卷积和Performer线性注意力机制,在保证计算效率的同时不牺牲性能。在UCF-Crime基准测试中,我们的基础模型实现了$91.34%$的AUC,优于先前最优方法;快速版本则实现了$88.87%$的AUC,参数量减少$99.70%$的同时同样超越先前最优方法。代码和预训练模型已公开于https://github.com/agao8/STEAD

I. INTRODUCTION

I. 引言

The rapid proliferation of automated systems powered by artificial intelligence and machine learning has underscored the critical need for robust anomaly detection mechanisms. In those applications, the ability to detect anomalous events in real time is paramount to ensuring safety, security, and operational continuity. As such, anomaly detection (AD) and video anomaly detection (VAD) systems, which identify deviations, abnormalities, and anomalies within normal patterns has emerged as a pivotal research area in autonomous driving. As image classification models become increasingly powerful, it is no surprise that anomaly detection models, which perform a sub-task of image classification, have seen great success on spatial scenarios without temporal contexts, with numerous models reaching near perfect results on popular datasets like the MTtec Anomaly Detection Dataset [1].

人工智能和机器学习驱动的自动化系统迅速普及,凸显了对强大异常检测机制的迫切需求。在这些应用中,实时检测异常事件的能力对于确保安全、安保和运营连续性至关重要。因此,异常检测(AD)和视频异常检测(VAD)系统已成为自动驾驶领域的关键研究方向,它们能够识别正常模式中的偏差、异常和反常现象。随着图像分类模型日益强大,作为其子任务的异常检测模型在无时序背景的空间场景中取得了显著成功,这一点不足为奇——众多模型在MTtec异常检测数据集[1]等流行数据集上已接近完美表现。

“What is an anomaly in autonomous driving?”

自动驾驶中的异常是什么?

In the context of autonomous driving, the definition of what an “anomaly” is is relative to the definition of “normal”, and unfortunately, what is considered “normal” is also dependent on the scenario. Yet, while many previous works operate solely under spatial contexts, it is unreasonable to build an understanding of what is “normal” in a completely unseen environment without a temporal context in the context of autonomous driving. For example, in Figure 1 which depicts some frames from a video of an active road [2], just by evaluating the video at frame $t$ , it is impossible to accurately conclude that there is an anomaly present in the frame. However, once the surrounding frames are brought into consideration through a temporal context, it can be seen that all the vehicles are moving from the top of the frame to the bottom, except for the vehicle highlighted in red in the top left. It is apparent that the location of the vehicle highlighted in red should not be immobile, and therefore it is to be considered an anomaly, yet that conclusion could not be reached just by evaluating the video at time frame $t$ . Despite scenarios like this, the challenges and benefits presented by the temporal dimension inherent to video data remain inadequately addressed to ensure a safety driving environment.

在自动驾驶领域,"异常"的定义相对于"正常"而言,而遗憾的是,"正常"的界定也取决于具体场景。尽管先前许多研究仅基于空间背景展开,但在自动驾驶情境中,若缺乏时间背景的支撑,要在完全陌生的环境中建立对"正常"的认知是不合理的。例如图1展示了某段道路实况视频的若干帧 [2] ,若仅评估时间点 $t$ 的视频帧,根本无法准确判定该帧是否存在异常。但一旦通过时间背景引入周边帧信息,就能观察到除左上角红色标记车辆外,所有车辆都从画面顶部向底部移动。显然红色标记车辆不应保持静止状态,因此应被视为异常——这个结论仅凭时间点 $t$ 的帧评估是无法得出的。尽管存在此类场景,视频数据固有的时间维度所带来的挑战与优势仍未得到充分研究,以确保安全驾驶环境。


Fig. 1. Normal vs. abnormal: without any temporal context information, the frame at time-step $t$ seems completely normal. However, when considering the frames at other time-steps, it can be seen that there is a vehicle in the top left, highlighted in red, that is not moving despite being on the road, while all the other vehicles are moving as expected. As such, the entire video should be labeled to contain an abnormal event, yet that conclusion cannot be reached from any 1 frame of the video.

图 1: 正常与异常对比:在缺乏时序上下文信息的情况下,时间步 $t$ 处的帧看起来完全正常。然而当结合其他时间步的帧观察时,可发现左上角有一辆红色标记的车辆停留在道路上未移动,而其他车辆均按预期行驶。因此整个视频应被标记为包含异常事件,但这一结论无法通过视频的任一单帧得出。

Traditional approaches to identify abnormal behavior in autonomous driving relied on handcrafted features and statistical models, to model normal behavior and flag deviations. However, these methods often falter in complex, dynamic environments due to their limited capacity to capture highdimensional s patio temporal patterns. The advent of deep learning revolutionized the field, with Convolutional Neural Networks being the dominant force behind most tasks within computer vision and more recently, Vision Transformers, which are gaining popularity and success. Despite these advancements, existing models frequently prioritize accuracy at the expense of computational efficiency, rendering them impractical for real-time applications, where their high computational cost and memory footprint hinder deployment in latency-sensitive scenarios.

传统方法识别自动驾驶异常行为主要依赖手工特征和统计模型,通过建模正常行为并标记偏差。然而这些方法在复杂动态环境中往往失效,因其难以捕捉高维时空模式。深度学习的兴起彻底改变了这一领域,卷积神经网络(CNN)长期主导计算机视觉任务,而近年来视觉Transformer(ViT)正获得越来越多的成功。尽管取得这些进展,现有模型往往以牺牲计算效率为代价追求精度,导致其高计算成本和内存占用难以满足实时应用需求,在延迟敏感场景中部署受阻。

A critical gap in the field lies in balancing efficiency with performance. Many state-of-the-art methods, such as transformer-based architectures, suffer from quadratic complexity in attention mechanisms or excessive parameter counts, limiting s cal ability. Furthermore, the reliance on inefficient feature extractors like I3D [3] exacerbates these challenges. Recent efforts to address efficiency, such as Linformer [4] and Performers [5], demonstrate the potential of linear-time attention mechanisms but have not been fully exploited in identifying abnormal behavior in autonomous driving scenarios.

该领域的一个关键缺口在于如何平衡效率与性能。许多前沿方法(例如基于Transformer的架构)存在注意力机制二次方复杂度或参数量过大的问题,从而限制了可扩展性。此外,对I3D [3]等低效特征提取器的依赖进一步加剧了这些挑战。近期针对效率优化的尝试(如Linformer [4]和Performers [5])展示了线性时间复杂度注意力机制的潜力,但尚未在自动驾驶异常行为识别场景中得到充分应用。

This paper introduces STEAD (Spatio-Temporal Efficient Anomaly Detection), a novel architecture designed to bridge the efficiency-performance divide in monitoring and identifying abnormal behavior in autonomous driving. Moreover, we developed two different versions: STEAD-Base, which focuses on achieving high performance, and STEAD-Fast, which prioritizes efficiency without sacrificing performance too much. In summary, our contributions are threefold:

本文介绍了一种名为STEAD (Spatio-Temporal Efficient Anomaly Detection) 的创新架构,旨在解决自动驾驶行为监测与异常识别中效率与性能难以兼顾的问题。我们开发了两个版本:注重高性能表现的STEAD-Base版本,以及在不显著牺牲性能前提下优先考虑效率的STEAD-Fast版本。我们的贡献主要体现在三个方面:

These advancements position our framework as a practical solution for real-time applications, where computational efficiency and accuracy are equally critical. By addressing the dual challenges of temporal context modeling and resource constraints, this work advances the frontier of abnormal behavior detection in autonomous driving toward deployable, scalable systems.

这些进步使我们的框架成为实时应用的实用解决方案,其中计算效率和准确性同样关键。通过解决时间上下文建模和资源限制的双重挑战,这项工作将自动驾驶异常行为检测的前沿推向可部署、可扩展的系统。

II. RELATED WORK

II. 相关工作

A. Video Anomaly Detection

A. 视频异常检测

Detecting abnormal behavior in autonomous driving has garnered significant attention in recent years due to its importance in ensuring safety. This method can also be applied in applications in surveillance, security, and behavior analysis. Early approaches often relied on handcrafted features and statistical models. For instance, Mahadevan et al. introduced a robust framework that utilizes Gaussian Mixture Models (GMM) for background modeling, enabling the detection of deviations from learned normal patterns in videos [8].

近年来,自动驾驶中的异常行为检测因其对安全保障的重要性而备受关注。该方法也可应用于监控、安防和行为分析等领域。早期方法通常依赖手工特征和统计模型。例如,Mahadevan等人提出了一种利用高斯混合模型(GMM)进行背景建模的鲁棒框架,能够检测视频中与学习到的正常模式的偏差[8]。

In recent years, deep learning has revolutionized the field, with researchers such as Ahmed et al., who proposed a deep learning-based framework that employs Convolutional Neural Networks to automatically extract features from video data, achieving state-of-the-art results in detecting anomalies at the time [9]. Another notable approach that has shown promise is the use of generative models. Zenati et al. employed Generative Adversarial Networks (GANs) for anomaly detection, demonstrating that the model could effectively learn the distribution of normal events and highlight anomalies based on reconstruction errors [10]. Additionally, Liu et al. proposed a two-stream network architecture that separately processes spatial and temporal information, further enhancing detection accuracy [11]. More modern approaches seek to exploit the success of Vision Transformers for anomaly detection tasks, such as the Trans Anomaly model proposed by Yuan et al. [12]. Furthermore, researchers also sought to build Siamese Networks on top of vision transformers to learn feature representations. For instance, Chen et al. built a Siamese Network that utilizes a magnitude contrastive loss to encourage feature se par ability [13].

近年来,深度学习彻底改变了这一领域。例如Ahmed等人提出了一种基于深度学习的框架,该框架采用卷积神经网络(CNN)从视频数据中自动提取特征,在当时实现了最先进的异常检测效果[9]。另一种表现出前景的显著方法是使用生成式模型。Zenati等人采用生成对抗网络(GAN)进行异常检测,证明该模型能有效学习正常事件的分布,并根据重构误差突出异常[10]。此外,Liu等人提出了分别处理空间和时间信息的双流网络架构,进一步提高了检测精度[11]。更现代的方法试图利用Vision Transformer在异常检测任务中的成功,例如Yuan等人提出的Trans Anomaly模型[12]。此外,研究人员还尝试在视觉Transformer基础上构建孪生网络以学习特征表示。例如Chen等人构建了一个采用幅度对比损失来增强特征可分性的孪生网络[13]。

B. Feature Extraction

B. 特征提取

Extracting spatio-temporal features is critical in identifying abnormal behavior in autonomous driving. Transfer learning via feature extraction has become a cornerstone of modern machine learning due to its efficiency and effectiveness, with previous works on this topic primarily using the Inflated 3D ConvNet (I3D) model as their feature extractor [13]–[16]. The I3D model sought to addresses the challenge of s patio temporal feature learning in video recognition by extending successful 2D convolutional architectures into 3D [3]. To accomplish this, the model “inflated” 2D convolutional kernels into 3D by replicating and pooling them along the temporal dimension [3]. The model also incorporates a two-stream design, both RGB and optical flow, to enhance feature modeling [3].

提取时空特征是识别自动驾驶异常行为的关键。通过特征提取进行迁移学习 (transfer learning) 因其高效性已成为现代机器学习的基石,该领域的先前研究主要使用膨胀3D卷积网络 (Inflated 3D ConvNet,I3D) 模型作为特征提取器 [13]–[16]。I3D模型通过将成功的2D卷积架构扩展至3D领域,旨在解决视频识别中时空特征学习的挑战 [3]。为实现这一目标,该模型通过沿时间维度复制并池化2D卷积核,将其"膨胀"为3D结构 [3]。该模型还采用RGB与光流双流设计以增强特征建模能力 [3]。

However, the I3D model suffers from a high computational cost in both memory requirements as well as processing power, making it impracticable for real-time inference applications. Alternatively, while taking note of the trade-off between computational efficiency and accuracy, the X3D model sought to progressively expand a minimal 2D backbone along the temporal duration, frame rate, spatial resolution, channel width, network depth, and bottleneck width axes [7]. Throughout training, at each step, the methodology would instruct the model to expand each axis, in isolation to each other, and then select a single axis to keep expanded for future steps based on which axis expansion achieved the greatest computation to accuracy trade-off after training and validation. This continues until the model reaches a desired computational cost [7]. As a result, the X3D family of models boasts a notably higher top-1 accuracy on popular datasets such as Kinetics-400 [17], while requiring less than a quarter of the parameters used by the I3D model. Consequently, our work will utilize the X3D model to extract video features over the I3D model used by previous works.

然而,I3D模型在内存需求和计算能力方面都存在较高的计算成本,使其难以应用于实时推理场景。作为替代方案,X3D模型在权衡计算效率与精度的基础上,提出沿时间长度、帧率、空间分辨率、通道宽度、网络深度和瓶颈宽度六个维度逐步扩展最小化2D主干网络的策略 [7]。该训练方法会在每一步独立扩展各个维度后,根据训练验证阶段计算量与精度的最佳平衡点,选择单一维度进行持续扩展,直至模型达到预期计算成本 [7]。因此,X3D系列模型在Kinetics-400 [17]等主流数据集上实现了显著更高的Top-1准确率,且参数量不足I3D模型的四分之一。基于此,本研究将采用X3D模型替代先前工作中使用的I3D模型进行视频特征提取。

C. Vision Transformers

C. Vision Transformers

Convolutional Neural Networks (CNNs) have been extensively studied and applied in various domains, particularly in computer vision tasks. Since then, countless architectures and been proposed and tested in the pursuit of more powerful and efficient models. One particular method useful for video processing is the $(2+1)\mathrm{{D}}$ Convolution, which applies a spatial 2D convolution proceeded by a temporal 1D convolution over directly applying a 3D convolution, which reduces computational complexity and over fitting [18]. However, while CNNs have been at the forefront for many years, vision transformer models are becoming evermore prevalent, with their performance matching or even exceeding that of CNNs.

卷积神经网络 (Convolutional Neural Networks, CNNs) 已在多个领域得到广泛研究和应用,尤其在计算机视觉任务中。此后,人们提出并测试了无数架构,以追求更强大高效的模型。其中一种适用于视频处理的特殊方法是 $(2+1)\mathrm{{D}}$ 卷积,该方法先进行空间二维卷积再进行时间一维卷积,而非直接应用三维卷积,从而降低计算复杂度并减少过拟合 [18]。然而,尽管 CNN 多年来一直处于前沿地位,视觉 Transformer 模型正变得越来越普遍,其性能已与 CNN 相当甚至更优。

Transformers based on self-attention mechanisms were initially designed for and utilized in natural language processing tasks such as machine translation [19]. Since then, they have become the dominant method in the vast majority of natural language processing tasks, consistently outperforming other methods for state-of-the-art performance.

基于自注意力机制的Transformer最初是为机器翻译等自然语言处理任务设计并应用的[19]。此后,它已成为绝大多数自然语言处理任务的主导方法,始终以超越其他方法的性能保持最先进水平。

Inspired by their success in natural language processing, researchers sought to expand the application of transformers to 2D computer vision tasks, kick-started by the introduction of the Vision Transformer, which applied the transformer architecture directly on a sequence of image patches for image classification [20]. This effort proved to successful as transformer based models went on to utilized in a variety of computer vision tasks such as image classification, action recognition, object detection, image segmentation, image and scene generation [21]–[25].

受自然语言处理领域成功的启发,研究人员开始尝试将Transformer架构拓展至二维计算机视觉任务。这一探索始于Vision Transformer的提出——该模型直接将Transformer架构应用于图像块序列以完成图像分类任务 [20]。基于Transformer的模型随后被成功应用于各类计算机视觉任务,包括图像分类、动作识别、物体检测、图像分割以及图像与场景生成等 [21]–[25]。

However, the original transformer architecture is limited by a quadratic increase in both time and space complexity with image size, making it a bottleneck for large-scale applications. In response to this issue, researchers proposed models such as Linformer [4], which lowered computational costs to $\mathcal{O}(n k)$ , where $k$ is a lower-rank projection of $n$ , the length of the sequence, and the Performer [5], which used Fast Attention Via Orthogonal Random Features to approximate the kernel in linear time. Our work explores these techniques as we utilize their efficiency for higher dimensional video data.

然而,原始Transformer架构在时间和空间复杂度上会随图像尺寸呈二次方增长,这成为大规模应用的瓶颈。针对此问题,研究者提出了Linformer [4](将计算成本降至$\mathcal{O}(n k)$,其中$k$是序列长度$n$的低秩投影)和Performer [5](通过正交随机特征实现快速注意力,以线性时间近似核函数)等模型。我们的工作探索了这些技术,利用其高效性处理更高维度的视频数据。

D. Differentiating Abnormal from Normal Behavior

D. 区分异常与正常行为

Siamese networks were first introduced for signature verification, which introduced the concept of training dual neural networks to assess the similarity between pairs of inputs [26]. However, they were not popularized until much later with the advent of deep learning, with the publications made by Koch and Schroff et al., who developed models for one-shot image recognition and face recognition respectively [27], [28]. Koch’s model emphasized the significance of similarity learning through the usage of a contrastive loss function [27]. On the other hand, Schroff et al. improved upon the contrastive loss function by developing triplet loss, which incorporated an additional anchor input in the calculation of similarity, which allowed the model to explicitly learn the relative distances between embeddings, ensuring that the similarity between the anchor and positive was minimized relative to that between the anchor and negative [28]. This concept would then be refined by Hoffer Ailon [29] who proposed the Triplet Network as an extension of the Siamese

Siamese网络最初被引入用于签名验证,其提出了训练双神经网络来评估输入对之间相似性的概念 [26]。然而直到深度学习兴起后,Koch与Schroff等人分别发表了用于单样本图像识别和人脸识别的模型,这类网络才得到广泛普及 [27][28]。Koch的模型通过对比损失函数(contrastive loss)的使用,强调了相似性学习的重要性 [27]。Schroff等人则改进对比损失函数,提出了三元组损失(triplet loss)——该函数在相似度计算中引入额外的锚点输入,使模型能显式学习嵌入之间的相对距离,确保锚点与正样本的相似度始终小于锚点与负样本的相似度 [28]。Hoffer和Ailon随后完善了这一概念,提出将三元组网络(Triplet Network)作为Siamese网络的扩展 [29]。


Fig. 2. An overview of the triplet network. Three inputs are passed through the network and independently processed to obtain feature embeddings. The inputs consist of an anchor (typically the same class as the negative) $(\mathbf{x}^{A})$ , a negative $(\mathbf{x}^{-})$ , and a positive sample $(\mathbf{x}^{+})$ . Once feature embeddings have been obtained, the results are passed to a classifier to construct a classification score and to a Triplet Loss function to compute loss.

图 2: 三元组网络概览。三个输入经过网络并独立处理以获取特征嵌入。输入包含锚点样本 (通常与负样本同类) $(\mathbf{x}^{A})$、负样本 $(\mathbf{x}^{-})$ 和正样本 $(\mathbf{x}^{+})$。获得特征嵌入后,结果被传递至分类器以构建分类分数,并输入三元组损失函数 (Triplet Loss) 计算损失。

Network that utilizes a trio of neural networks instead of the Siamese Network’s traditional use of two as shown in figure 2.

网络利用三个神经网络,而非如图2所示的孪生网络(Siamese Network)传统使用的两个。

This family of networks has since been proven to be highly successful in tasks where similarity metrics are a core component or can be exploited if they are not, such as image similarity ranking, sentence similarity learning, and sentence embedding [30]–[32].

这类网络已被证明在相似性度量为核心组件或可被有效利用的任务中表现卓越,例如图像相似度排序、句子相似性学习以及句子嵌入 [30]–[32]。

III. ABNORMAL BEHAVIOR DETECTION

III. 异常行为检测

In this section, we will first introduce our architecture for identifying abnormal behavior in autonomous driving, and then delve into the different parts, such as feature extraction, in the following sections.

在本节中,我们将首先介绍用于识别自动驾驶异常行为的架构,随后深入探讨特征提取等不同组成部分。

A. Overall Architecture

A. 总体架构

The overall architecture of our model for abnormal behavior detection in autonomous driving is shown in Figure 3. Before being used by our model, videos are first transformed with a center crop and then uniformly sampled, splitting an individual video into clips, which then have their features extracted by the X3D model. The resulting per clip features of shape $C\times T\times H\times W$ , where $C,T,H$ , and $W$ denote the number of channels, temporal dimension, height, and width of the features respectively, are then pooled together to obtain the most prominent features for each video. From here, the video features are passed as input to our model where they are processed by a decoupled spatio-temporal feature enhancer (a) to refine the features in preparation for the spatio-temporal attention block (b) where long range connections between features are modeled using linear selfattention. After the features of each video are processed by each block, T, H, W, and d are pooled to result in a final video feature embedding of size $d$ , where $d$ is the size of the channel dimension of the last layer. Instead of evaluating our model only based on classification metrics, e.g., cross entropy, we also employ a triplet loss as shown in equation (4), designed to maximize numerical se par ability between features of normal and abnormal samples and will be illustrated later in this section.

我们的自动驾驶异常行为检测模型整体架构如图 3 所示。在模型使用前,视频首先经过中心裁剪处理,然后进行均匀采样,将单个视频分割为片段,随后通过 X3D 模型提取特征。每个片段输出的特征形状为 $C\times T\times H\times W$ ,其中 $C,T,H$ 和 $W$ 分别表示通道数、时间维度、高度和宽度。这些特征经过池化后获得每个视频最显著的特征。随后,视频特征作为输入传递至我们的模型,先经过解耦时空特征增强器 (a) 进行特征优化,再进入时空注意力块 (b) 通过线性自注意力建模特征间的长程关联。每个视频特征经各模块处理后,对 T、H、W 和 d 进行池化,最终生成尺寸为 $d$ 的视频特征嵌入,其中 $d$ 是末层通道维度大小。除分类指标(如交叉熵)外,我们还采用公式 (4) 所示的三元组损失,旨在最大化正常与异常样本特征间的数值可分离性,具体说明见本节后续内容。


Fig. 3. An overview of our proposed network architecture. Input videos are first split into clips via Uniform Temporal Sampling and then each clip has its features extracted using X3D; these features of each clip for a given video are then pooled together to obtain a single feature representation for the whole video. Next, (a) decouples the spatial and temporal dimensions to enhance the individual feature dimensions. Then (b) captures long range relations between the temporal and spatial dimensions. Final video embeddings are obtained after a pooling across both the temporal and spatial dimensions. Finally, a loss is computed by combining the triplet loss of the embeddings and the cross entropy loss of the classifications.

图 3: 我们提出的网络架构概览。输入视频首先通过均匀时间采样 (Uniform Temporal Sampling) 拆分为片段,随后每个片段使用 X3D 提取特征;同一视频所有片段的特征经过池化操作后生成整段视频的单一特征表示。接着,(a) 对空间和时间维度进行解耦以增强各自特征维度。随后(b) 捕获时空维度间的长程关联关系。最终视频嵌入 (embedding) 通过时空维度的双重池化获得。损失函数由嵌入的三元组损失 (triplet loss) 和分类的交叉熵损失 (cross entropy loss) 组合计算得出。


Fig. 4. An illustration of s patio temporal decoupling for an input of shape $T\times H\times W$ , where $T$ denotes the temporal dimension and $H$ and $W$ denote the spatial dimensions. Channels are not shown for simplicity. (a) A traditional 3D convolution with a 3D filter. (b) S patio temporal decoupled convolution where convolution is split into a 2D spatial convolution followed by a 1D temporal convolution.

图 4: 形状为 $T\times H\times W$ 的输入时空解耦示意图,其中 $T$ 表示时间维度,$H$ 和 $W$ 表示空间维度。为简化起见未显示通道。(a) 使用3D滤波器的传统3D卷积。(b) 时空解耦卷积,将卷积拆分为2D空间卷积和1D时间卷积。

B. Abnormal Behavior Detection

B. 异常行为检测

Our decoupled spatio-temporal feature enhancer follows the $(2+1)\mathrm{D}$ Convolution architecture [18], which reduces computational complexity over directly performing 3D convolutions. As illustrated in Figure 4, our feature enhancer consists of a $(2+1)\mathrm{D}$ Convolution, which is followed by a Feed Forward Network containing two fully connected layers with GELU (Gaussian Error Linear Unit) non-linearity in between. A LayerNorm is applied before each module and a residual connection is added after each module.

我们的解耦时空特征增强器遵循 $(2+1)\mathrm{D}$ 卷积架构 [18],相比直接进行3D卷积降低了计算复杂度。如图 4 所示,该增强器由 $(2+1)\mathrm{D}$ 卷积层和随后的前馈网络组成,其中前馈网络包含两个全连接层,中间采用GELU (Gaussian Error Linear Unit) 非线性激活函数。每个模块前应用LayerNorm归一化,模块后添加残差连接。

The spatio-temporal attention block utilizes the Performer attention architecture, promoting linear efficiency over the original transformer [5]. The original attention proposed by [19] is shown below in equation (1), where $Q,K,V\in\mathbb{R}^{L\times\dot{C}}$ are the query, key, and value, $L=(T\times H\times W)$ .

时空注意力模块采用了Performer注意力架构,相比原始Transformer实现了线性计算效率提升[5]。文献[19]提出的原始注意力机制如公式(1)所示,其中$Q,K,V\in\mathbb{R}^{L\times\dot{C}}$分别表示查询向量、键向量和值向量,$L=(T\times H\times W)$为时空维度乘积。

$$
{\mathrm{Attention}}(\mathbf{K},\mathbf{Q},\mathbf{V})={\frac{A V}{D}},
$$

$$
{\mathrm{Attention}}(\mathbf{K},\mathbf{Q},\mathbf{V})={\frac{A V}{D}},
$$

where

其中

$$
A=\exp\left(\frac{Q K^{T}}{\sqrt{C}}\right),
$$

$$
A=\exp\left(\frac{Q K^{T}}{\sqrt{C}}\right),
$$

and

$$
D=\mathrm{diag}(A\cdot{\mathbb{1}}_{L}).
$$

$$
D=\mathrm{diag}(A\cdot{\mathbb{1}}_{L}).
$$

The variable $A\in\mathbb{R}^{L\times L}$ computes an attention matrix from $Q$ and $K$ such that $A V$ normalized by $D$ , defines the attention operation, where $D\in\mathbb{R}^{L\times L}$ is a normalization matrix that results in a softmax operation once applied. The variable $\mathbb{1}_{L}$ denotes a vector of length $L$ containing all 1s. However, this suffers from a quadratic computation cost awttitehn tiroesnp escotl vteos $L$ h isw hbeyn fco a rlc mu ull a at ti in ngg $\mathrm{exp}({\frac{Q K^{\top}}{\sqrt{C}}})$ . k Peer rn fe ol r maper- proxima tion as follows:

变量 $A\in\mathbb{R}^{L\times L}$ 通过 $Q$ 和 $K$ 计算注意力矩阵,使得经 $D$ 归一化的 $A V$ 定义了注意力操作,其中 $D\in\mathbb{R}^{L\times L}$ 是应用后产生 softmax 运算的归一化矩阵