Contrastive-Regularized U-Net for Video Anomaly Detection

Video anomaly detection aims to identify anomalous segments in a video. It is typically trained with weakly supervised video-level labels. This paper focuses on two crucial factors affecting the performance of video anomaly detection models. First, we explore how to capture the local and global temporal dependencies more effectively. Previous architectures are effective at capturing either local and global information, but not both. We propose to employ a U-Net like structure to model both types of dependencies in a unified structure where the encoder learns global dependencies hierarchically on top of local ones; then the decoder propagates this global information back to the segment level for classification. Second, overfitting is a non-trivial issue for video anomaly detection due to limited training data. We propose weakly supervised contrastive regularization which adopts a feature-based approach to regularize the network. Contrastive regularization learns more generalizable features by enforcing inter-class separability and intra-class compactness. Extensive experiments on the UCF-Crime dataset shows that our approach outperforms several state-of-the-art methods.


I. INTRODUCTION
Recent studies have shown that closed circuit television (CCTV) camera, when strategically installed, leads to a significant drop in crime rate [1]. However, large-scale deployment of CCTV may lead to data overload, making it difficult for the surveillance operators to pick up suspicious or abnormal activities hidden amidst the enormous streams of live CCTV footages. Therefore, there is a need for intelligent systems to automatically detect suspicious or anomalous activities.
Given a video, video anomaly detection (VAD) aims to localize the abnormal segments within a video. Unfortunately, abnormal events are rare and difficult to collect, leading to scarcity of positive samples. To overcome this issue, The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar . video anomaly detectors are normally trained with unsupervised learning [2], [3], [4] or weakly supervised methods [5], [6], [7], [8]. In the weakly supervised setting, the labels are provided at the video level. The label only specifies if a video contains anomalous event which may occur at any segments in the video. Fig. 1 shows the pipeline of a weakly supervised learning for VAD. The backbone network, e.g., C3D [9] and I3D [10], extracts segment-level generic features. Then, the anomaly feature extraction block transforms the generic features into specialized features, which in turn are used by the anomaly classifier to generate segment-level anomaly scores. In the absence of segment-level labels, the multiple-instance loss (MIL) formulation is normally applied on the segment scores for each video.
In this work, we focus on improving two aspects of the VAD network in a weakly supervised setting. First, we explore how to model the local and global temporal dependencies when generating specialized features in the anomaly feature extraction block. Local information captures immediate anomalous traits, e.g., abrupt motion, suspicious human actions and changes in the environment, while global information provides the global context allowing the network to contrast normal and abnormal scenes in the videos.Previous methods such as stacked RNN [11], temporal consistency [12] and ConvLSTM [13] can only capture short range dependencies. GCN-based methods [14], [15] can model long-range dependencies but they are slower and more difficult to train. RTFM [7] captures both the short and long temporal dependencies using two parallel structures, one for each type. However, the two dependencies are considered separately, neglecting the close relationship between them. In this aspect, we propose to use U-Net like structure [16] to model both local and global dependencies for specialized features generation.
Second, we explore contrastive regularization as a new strategy to reduce overfitting. Overfitting is a dominant issue encountered when training VAD models due to the scarcity of positive samples. Traditionally, regularization is achieved by suppressing the complexity of the network [17], [18], injecting noise into the network [19], [20], [21], [22] or data [23], [24], and augmenting the training set [25]. For VAD, previous work has also applied special heuristics such as sparsity constraint and temporal smoothness [5] to regulate the output of the network. Our model adopts a feature-based approach to regularization where the strategy is to learn more generalizable features. Enhanced separability between normal and abnormal features makes the network less vulnerable to overfitting. To achieve this, we reformulate the contrastive regularization in [26] for VAD.
The main contributions of this work are highlighted as follows: • We propose a U-Net like structure [16] to perform specialized feature extraction. U-Net has mainly been applied to image segmentation and trained in a supervised setting. In our model, U-Net is novelly used to localize abnormal segments in a video and trained with a weakly supervised setting. The network learns to generate segment-level pseudo labels to facilitate training with only video-level labels. The interaction between the two types of dependencies are embedded naturally in the network structure -the encoder learns global dependencies on top of local dependencies through successive convolution operations while the decoder propagates these global information back to the local level through transposed convolutions.
• We propose a novel weakly supervised contrastive regularization technique to reduce overfitting. Previously, contrastive regularization [26] was applied in the image domain under a supervised setting. For VAD, contrastive regularization is extended to a weakly supervised setting. Contrastive regularization is reformulated as a multiple-instance learning (MIL) problem where each video is a bag of segments. A negative bag (normal video) only comprises negative instances (segments) while a positive bag (abnormal video) contains both positive and negative instances whose labels are unknown. The loss function learns two sets of centers to represent normal and abnormal events, respectively. By enforcing intra-center compactness and inter-center separability among the samples, contrastive regularization enhances the discriminability and generalizability of the learnt feature. The model has good explainability since the centers represent different kinds of events by mapping a segment to the nearest center, we can justify why the segment is classified as such.
• Our work achieves state-of-the art performance on the UCF-Crime bench-mark. In the experiments, the proposed U-Net-based architecture captures the temporal information more effectively than existing methods, while the contrastive regularization learns more generalizable features, resulting in less overfitting and improved test performance.
The remaining of this paper is organized as follows. Section II discusses related works. Section III explains the proposed U-Net based feature extractor (Section III-A), the anomaly classification block (Section III-B) and weakly supervised contrastive-based regularization (Section III-C). Section IV presents the experimental results over the UCF-Crime benchmark. Finally, Section V summarizes the paper.

II. LITERATURE REVIEW
Video anomaly detection (VAD) is challenging due to the absence of training samples. To alleviate the issue, VAD has VOLUME 11, 2023 36659 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
traditionally relied on unsupervised [27], [28], [29] or weakly supervised learning [5], [6], [7] for training. Regularization is another critical step to overcome the overfitting issue prevalent in VAD models. In this section, we review related works in these areas.

A. UNSUPERVISED ANOMALY DETECTION
Unsupervised anomaly detection focuses on one-class classification where the model is trained exclusively on normal training data. The strategy is to build a model that specializes at reconstructing normal samples with small reconstruction error. These methods make the assumption that unseen abnormal videos are difficult to reconstruct accurately and regard samples with high reconstruction errors as an anomaly. Reconstruction can be done through sparse coding [27], [28], [30], [31] or auto-encoder [2], [12], [29], [32]. Sparse coding encodes normal patterns with a dictionary and the sample is reconstructed by linearly combining the dictionary bases [30], [31] such that the reconstructed feature is as close to the original feature. A sparse representation allows the model to represent high-dimensional samples with less training data. For the auto-encoders, the encoder compresses a sample into an encoded representation, which in turn is used by the decoder to reconstruct it. More complex schemes such ALOCC [35] and AVID [36] use adversarial training to train the encoder-decoder network to reconstruct an input data that fools the discriminative network into thinking that it is the original one. To do this, ALOCC enhances the inliers and distorts the outliers to enhance their separability while AVID in-paints the input data to remove pixel-wise irregularity from the input frame. To enhance the result, [32] proposes a two-stage cascade classifier based on sparse filtering and auto-encoder network such that anomalous regions have low sparsity value and high reconstruction cost. To tackle the lack of positive samples, G2D [37] uses generative adversarial network (GAN) to generate outliers. However, the generated outliers are not based on true realistic anomalous events. In general, unsupervised models are unable to handle complex or unseen environments. They typically suffer from higher false positive since it is unrealistic to capture all normal samples.

B. WEAKLY SUPERVISED ANOMALY DETECTION
Current state-of-the-art VAD systems are based on weakly supervised approaches [5], [6], [7], [14], [15] where video-level labels are leveraged to train the network. The multiple-instance learning (MIL) formulation is typically employed to cater for the absence of segment-level labels. For example, the classical multi-instance ranking loss trains the network to rank the top segment in a positive (abnormal) video to be higher than that in the negative (normal) video [5]. However, the top segment may not be an abnormal segment as desired, and the max operator cannot handle videos with multiple abnormal segments. To resolve this, Zhu et al. [38] extends the ranking loss by incorporating a temporal mechanism to localize anomalies. Several recent works trains the network to generate segment-level pseudo labels as supervisory signals [6], [7]. This allows the VAD to be trained with classical supervised learning. For example, Yu Tian et al. [7] selects top-k segments with the highest feature magnitude whereas Feng et al. [6] trains a MLP-based structure to generate pseudo-labels via multiple instance learning.

C. TEMPORAL DEPENDENCIES
It has been shown that temporal relationship between segments is critical towards the performance of VAD models [7], [11], [12], [14], [15]. Different networks have been used to capture the temporal information including recurrent models in [11], graph convolutional network methods (GCN) in [14], [15] and transformer in [7]. The recurrent models mainly model short-term relationship effectively whereas the transformer focuses on long-range relationship. Reference [15] enforces temporal consistency to clean up noisy labels where by propagating supervisory signals from high-confidence snippets to its neighbouring low-confidence snippets. Highorder Context Encoding [8] enriches the features by using a moving window to capture the local dynamics of a video. RTFM [7] models the local and global temporal dependencies explicitly with two parallel structures. The pyramid of dilated convolutions (PDC) [39] is used to model local temporal dependency. The other branch uses the temporal self-attention module (TSA) [40] to capture the global temporal dependencies. However, RTFM neglect the close relationship between the two dependencies. Although local dependency is good at capturing local temporal dynamics variations, some anomalies are subtle and may not be discernable unless viewed in relationship to the other parts of the video. In this paper, we propose a unique approach to model global and local dependencies in a unified structure based on the U-Net architecture.
Regularization. Overfitting is a long-standing issue for VAD systems, mainly due to the scarcity of positive samples. Therefore, regularization is the key to successful training of VAD models. Structure-based regularization methods regularize the networks by manipulating the network structure. For example, L2 penalty [17] or elastic net loss [18] are imposed to suppress the complexity of the network. More recent methods drop subnet [41], path [42], channels [22] or layers [43] to reduce co-adaptation between the various computational units in the network during training. On the other hand, data-based regularization methods increase the diversity and size of the training set by creating transformed version of the training samples [25]. To increase the robustness of the system to handle real-world noise, it is useful to inject artificial noise to the data. For example, cutoff [23] and random erasing [24] cut out random portion of the image to avoid co-adaptation of the features and learn stronger features. Recently, feature-based regularization has been shown to improve generalization performance. Contrastive regularization [26], [44], [45] imposes geometric constraints such that the learnt features display good intra-class compactness and inter-class separability. However, contrastive regularizations have mainly been studied under the supervised learning for image classification. In this work, we extend contrastive regularization technique to weakly supervised setting for video anomaly classification.

III. THE PROPOSED METHOD: CONTRASTIVE-REGULARIZED U-NET
where each video X i ∈ R T ×D g is a sequence of T segment-level features each with a dimensionality D g , and y i ∈ {0, 1} is the corresponding video-level label to indicate the absence or presence of anomalous segments in the video. The input features are generic features extracted from a pre-trained 3D-CNN network such as C3D [9] or I3D [10].
The segment-level input features are first processed by the U-Net feature extractor to capture the local and global temporal dependencies between the segments in the video. The local dependencies are captured in the lower layers and the global dependencies in the higher layers. The output of the U-Net feature extractor is the U-Net features U i ∈ R T ×D u which has been enriched with temporal information. Note that U i preserves the temporal resolution as in X i .
The U-Net features are then passed to the anomaly classification block which generates the segment-level anomalous scores s i ∈ R T indicating if each segment is normal or anomalous. The block also outputs the segment-level features F i ∈ R T ×D f which are used to regularize the network.
The network is trained with a weakly supervised setup where only video-level annotations are provided. To do this, the U-Net based classifier is used to generate pseudo-labels. For positive videos, the top-k segments with the highest anomalous scores are selected as pseudo-positive samples and the bottom-k segments as pseudo-negative samples. The features and scores of the selected pseudo-positive and pseudo-negative segments are denoted as are selected and they represent hard negative normal segments that the network has more difficulty fitting (their anomaly scores are higher although they should ideally be lower). The anomaly scoresŝ = ŝ + i ,ŝ − i are used to compute the data loss and train the model. Meanwhile, the generated featuresF = F + i ,F − i are used to perform contrastive regularization to reduce overfitting.

A. MODELING LOCAL AND GLOBAL TEMPORAL DEPENDENCIES WITH U-NET
Temporal relationship has been shown to be critical to the performance of VAD models [7], [11], [12], [14], [15]. Local temporal dependencies capture short-term and tangible anomalous cue (e.g., abrupt motion or scene change, and suspicious actions) while global temporal dependencies provide the global context to expose more subtle abnormal events. Different structures have been proposed to model temporal dependencies, e.g., pyramid of dilated convolutions (PDC) [39], temporal self-attention module (TSA) [40], High-order Context Encoding (HCE) [8], Temporal Consistency Graph (TCG) [15]. However, each of them specializes at capturing either local and global information, but not both. For example, PDC and HCE focuses on local temporal information, while TSA is more effective at modeling global relationship. TCG captures both relationships but the model is inefficient and difficult to train.
In this section, we novelly employ the U-Net [16] to capture both types of dependencies in a unified manner. While  U-Net has been widely adopted for image segmentation tasks such as medical imaging [46], our work is the first to apply U-Net to localize anomalous segment in a video. Fig. 3 shows the proposed U-Net adapted for VAD. The input to the network is X i ∈ R T ×D g which stores the sequence of snippetlevel features from the backbone network. U-Net captures the temporal dependencies among the snippets to enrich the output features U i ∈ R T ×D u . The network comprises a downstream path (encoder) which encodes the input features into a temporally compact representation, and an upstream path (decoder) which restores the segment-level features. The encoder network captures the local dependencies at the shallower layers and global dependencies at the deeper layers. This is because the effective receptive fields in a CNN are local at shallower layers and grows gradually as the network grows deeper [47]. At the decoder network, the activation maps from deeper layers are concatenated with those of the current layer to learn more fine-grained representation. Thus, the decoder combines the global and local information.
The height of the network is fixed to L = 4 where the temporal resolution T (l) = T 2 l−1 , l = {1, . . . , L} is halved from one height level to another. Different from conventional U-Net which increases the channels with increasing height, our network uses a regular channel size of D u for all blocks. The network is constructed with a common residual block structure. The block contains 3 1-D convolutional layers activated by the ReLU function and has a skip connection to facilitate training. The convolutional layers are configured such that the activation maps in the same block have a regular shape of T (l) × D u .

1) ENCODER
The encoder learns the local and global temporal dependencies in the input sequence. The local dependencies are captured through 1-D convolutional operations where the optimal kernel size is empirically determined to be 5. Stacking multiple convolutional operations allows the network to learn the global context in a hierarchical manner. The effective receptive field in the network grows incrementally from one layer to the next, allowing the network to learn increasingly global information from local ones in previous layers. Following conventional U-Net design, the temporal resolution is halved from one layer to another through max pooling. In the end, the output of the encoder is a temporally compact representation with high semantic value and global temporal coverage. This encoding process has been shown to be effective at removing noise and capturing common patterns that are representative of the training samples.

2) DECODER
The decoder takes the encoded message and restores it to segment-level features. Transposed convolution is applied to increase the temporal resolution from one layer to the next. The up-sampled features are concatenated with the encoder output at the same height level, and then passed to the decoder block for further feature extraction. Through this process, high-level global information is propagated through the layers back to the local segments. Thus, the segment-level features generated by decoder are infused with high-level local and global temporal information.

3) DESIGN CONSIDERATION
Both Convolutional Neural Networks (CNN) and Transformers are able to capture local and global information in their layer representations. However, they have very different characteristics. By design, a CNN-based network such as U-Net attends only locally in the lower layers and gradually builds up global-level attention at higher layers. In contrast, the transformer receptive field spans the whole temporal sequence in a single layer. It has been shown that a properly trained Visual Transformers (ViT) contains local heads and global heads even in the lowest layers [47]. However, without large-scale pre-training, ViT is empirically found to be weak at attending locally in the earlier layers, leading to much lower performance. For VAD, training data is scarce and therefore, U-Net structure is better suited for VAD than transformers under such constraint. For a more detailed reading on the difference between how transformers and CNN attends to local and global information, please refer to [47].
Another design consideration is the number of channels. Conventional U-Net increases the channels in the downstream path, and then decreases it in the upstream path. In our network, the number of channels in all blocks is fixed to D u . This is because the input to a conventional U-Net are lowlevel 3-dimensional features (raw image input) whereas our network receives high-level high-dimensional features from the backbone network. While the conventional U-Net is originally designed to convert low-level input to high-level features, our network serves a different purpose of enriching an already high-level input features with temporal information. In addition, doubling the channels results in an unreasonably massive network and aggravates overfitting. Another option is to add a reduction layer before U-Net. However, the design results in information I and leads to lower performance in practice. Among these options, fixing the number of channels to D u is found to yield the best performance.

B. SEGMENT-LEVEL ANOMALY CLASSIFICATION
The output of the U-Net is then fed into the anomaly classification block to generate the anomalous scores s i ∈ R T for all T segments in the video. The block is a simple 3-layered multi-layer perceptron (MLP) network. ReLU activation is used for the first and second layers which functions as feature extractor, and sigmoid activation for the last layer which functions as a binary classifier. In addition, the features from the second layer F i ∈ R T ×D f are extracted and subjected to contrastive regularization to reduce overfitting. To train the model, we adopt a weakly supervised framework where only weak video level annotations are available to train the segment-level classification network. This will be discussed in the next section.

C. WEAKLY SUPERVISED CONTRASTIVE REGULARIZATION
Since positive samples are scarce, VAD are especially vulnerable to overfitting issues. Therefore, regularization is a crucial step to improve generalization performance of the trained model. In this work, we propose a feature-based approach for regularization. The network is trained to generate more robust feature where events of the same nature generate similar features which are very different from features from other types of events. The more robust the generated features are, the more generalizable and noise-tolerant the model becomes.

1) TRAINING UNDER UNCERTAINTY
We extend contrastive regularization [26] to a weakly supervised setting where only the video label y i ∈ {0, 1} is provided. A video X i ∈ R T ×D g is labeled as positive if any of its segments is anomalous, and negative if none. Since segmentlevel labels are not available, the location of the abnormal event in a positive video is unknown. To deal with the uncertainty, the segment-level anomaly scores s i ∈ R T output by the anomaly classification head can be employed to generate pseudo labels for each segment following the multi-instance learning (MIL) framework. Given a positive video, we extract pseudo-positive samples and pseudo-negative samples. The former is the set of top-k segments with the highest anomaly scores, while the latter is the set of bottom-k segments. For negative video, only the top-k segments are extracted. They represent hard negative samples that the network has difficulty classifying and needs more training. Despite possessing strong labels, not all segments from a negative video are selected to avoid data imbalance issue.
With the generated labels, the network can now be trained with traditional supervised learning. The multi-instance learning framework assumes that there exists common pattern (e.g., sudden movements, scene change, abnormal actions) among positive (abnormal) segments. When correctly selected as pseudo-positive samples, they update the network parameters in a coherent manner. In contrast, negative (normal) segments in different videos tends to be dissimilar to one another (e.g., different scenes and activities). When erroneously selected as pseudo-positive samples, they update the network in an incoherent manner. Consequently, over time, the network gradually becomes more adept at identifying the positive segments.  The contrastive regularization regularizes the network by learning discriminative features so that normal features are well separated from abnormal ones. To do this, we define C centers for each class to explicitly model different types VOLUME 11, 2023

2) CONTRASTIVE REGULARIZATION WITH PSEUDO-LABELS
are the set of all positive and negative centers, respectively. The proposed contrastive regularization is given by: where λ is the compactness strength, β is the separability strength, |·| is the cardinality of a set, and ∥·∥ 2 is the L2-norm.
The first two terms enforce intra-center compactness which minimizes the distance between the sample f i and its nearest center h i of the same class. The first term ensures that pseudo-positive samples from positive videos are positioned near positive centers, whereas the second term ensures negative samples are distributed around negative centers. There are two types of negative samples, namely pseudo-negative samples from positive videos and hard-negative samples from negative videos. The pseudo-positive and pseudo-negative segments, both from positive videos, are pulled towards different types of class centers, allowing them to distinguish between the abnormal and normal events within the same video. This phenomena is indeed observed in our experiments (cf. Section IV).
The third term imposes inter-center separability such that all centers are well separated. If the distance between any two centers is smaller than m, it will incur a cost. The term ensures that the network learns a more diverse set of features, each representing different types of anomaly or normal events. In addition, explainability of the network is enhanced as the network's classification decision can be associated with the events associated with the nearest class center.
The role of the learnt centers is to generalize the different common patterns within each class and at the same time distinguish the two classes. Therefore, when the network generates feature that are near such centers, this results in reduction of overfitting.

D. LOSS FUNCTION
During training, the network generates pseudo-labels and assembles the batchF = F + ,F − with the corresponding anomaly scoresŜ = Ŝ + , S − . To reduce overfitting, contrastive regularization is performed with the learnt class centers H = H + , H − . The training loss function is defined 36664 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
as follows: L Ŝ is the binary cross entropy loss which minimizes the data loss to train the VAD model. R contrast F , H is the contrastive regularization used to regulate the network and reduce overfitting (Eq. 1). The third term is the temporal smoothness constraint such that the anomaly score of adjacent segments should be similar to each other. This avoids the anomaly scores from fluctuating irregularly. Lastly, the fourth term is the sparsity constraint which assumes that only a few segments in a positive video are anomalous. The strength of these different constraints can be finetuned by adjusting the intra-center compactness strength λ , inter-center separability strength β, temporal smoothness strength γ and sparsity strength η.

IV. EXPERIMENTS AND EVALUATION A. DATASET DESCRIPTION AND EVALUATION MEASURE 1) UCF-CRIME
We evaluate our proposed network UCF-Crime [5], currently one of the largest public VAD dataset. The dataset consists of 1900 surveillance videos with equal number of abnormal and normal videos. It is split into 1610 training videos and 290 testing videos. The dataset is weakly labeled where the training set comes only with video-level labels whereas the testing set comes with frame-level labels for evaluation. The dataset is diverse and challenging. The videos are real-world surveillance data with complex and diverse background. There is a total of 13 types of anomalies in the dataset including explosion, arrest, abuse, fighting, shoplifting, stealing, and vandalism. The videos are untrimmed with a wide range of duration from 8 seconds to 9 hours.

2) EVALUATION
Following previous work on VAD [5], [6], [7], we evaluate our system (CR-U-Net) using the frame-level AUC measure, i.e., the area under the ROC curve. A larger AUC implies better performance. The AUC is computed at the frame level.
Since the anomaly score is at the segment (clip) level, the same score is expanded to all frames in the segment for AUC computation. We also provide some qualitative result to evaluate the localization performance of our system and the effect of contrastive regularization. Comparison is made with unsupervised methods [2], [12], [30], [48], [49], [50] and weakly-supervised methods [5], [6], [7], [8], [15] including recent state-of-the-art methods such as MIST [6], RTFM [7] and HCE [8]. In particular, the evaluated weakly-supervised methods employ different structures to model temporal information. For example, GCN [15] uses a similarity graph, RTFM uses a combination of dilated convolution and transformers, and HCE uses windowing approach to encode local variations in time series.

B. IMPLEMENTATION DETAILS
Two types of backbone network, namely C3D [9] and I3D [10] are used to extract the segment-level features from a video. The network extracts features for every 16-frame clip in the video. For C3D, the feature dimension D g = 2048, and for I3D, D i = 1024. Following [5], the clips in a training video are grouped into 32 non-overlapping segments. The segment-level feature is obtained by averaging the clip features in each segment. For short videos (<32 clips), duplicate clips are inserted to the video. Through the process, all training videos have a uniform temporal resolution of 32 segments. For the testing set, no grouping is performed. Each clip is essentially a segment. So, the test videos have varying temporal length. For data augmentation, 10-crop augmentation is performed to bolster both the training data and testing data. The cropped frame size is 210×280 pixels which are 87.5% of the original size. For the U-Net feature extractor, the number of channels D u is set to 1024 for all blocks. For the classification layer, the number of neurons in the MLP is set to 512 units, 32 units and 1 unit, respectively. Dropout layer (drop_ratio = 0.7) is inserted between the fully connected layers.
Unless specified otherwise, the learning rate is set to 0.0001 and the model is trained using Adam optimizer for 500 epochs with a weight decay of 0.01 and batch size of 64. In each batch, we ensure that the number of normal and abnormal samples are equivalent to ensure balanced training. For contrastive regularization, the following settings are used: the number of centers C = 16, the intra-center compactness strength λ = 0.0001, the inter-center separability strength β = 0.0001, and the margin m = 1.25. The smoothness and sparsity constraints are set to γ = η = 0.008. Table 1 shows the AUC results on UCF-Crime, which is currently one of the most realistic and largest VAD benchmark dataset. For the UCF-Crime dataset, we compare our system against unsupervised methods [2], [30], [48] and weakly supervised methods [5], [6], [7], [8], [15]. We use I3D features to build our model.

C. RESULTS ON UCF-CRIME
In general, the weakly supervised methods far outperform the unsupervised methods [2], [30], [48]. This shows that additional supervisory labels, even weak ones, are indispensable to train better models. In addition, I3D features delivers superior performance than C3D features due to a more powerful backbone network and large-scale pre-training.
The proposed CR-U-Net achieves state-of-the-art performance with an AUC of 85.24%. This is the second highest AUC among the evaluated methods. It outperforms MIL-Ranking [5]   by 0.94%. Out of all evaluated methods, HCE [8] delivers the highest AUC of 85.38%. However, HCE's superior performance is attributed to its augmentation strategy where hand-crafted anomalies (HC) and noise simulation (NS) are injected into the training set. Without HC and NS, the performance of HCE drops to 84.44%. Therefore, the proposed CR-U-Net actually outperforms HCE by 0.71% on the same original training set. HCE mainly captures local temporal information. This makes CR-U-Net one of the top performing models for the UCF-Crime dataset. This indicates the efficacy of U-Net whose structure is able to capture both local and global temporal information. In comparison, Although RTFM captures both local and global dependencies, they are implemented in two parallel independent structures and are unable to model the interaction between the two dependencies. In contrast, U-Net models both dependencies more effectively, resulting in state-of-the-art performance.

D. COMPARATIVE AND ABLATION STUDY
We performed an ablation study on UCF-Crime to evaluate the effectiveness of the proposed U-Net anomaly classifier as well as contrastive regularization. First, we replace the U-Net anomaly classifier with other kinds of structure. The first is a 3-layered MLP network from [5] which does not model any temporal information. The second is a combination of pyramid of dilated convolution (PDC) and transformer structure (TSA) [7]. The former models the local temporal dependencies while the latter models the global temporal dependencies separately in two branch. To ensure fairness, the training of the compared models are standardized using binary cross entropy loss and top-k pseudo labels without contrastive regularization. Lastly, the fourth model applies contrastive regularization on top of the network with U-Net-based anomaly classifier, which represents our optimal model. Table 2 shows the results of the AUC of the three model. As expected, the MLP-based classifier has the lowest AUC of 81.74%. When the MLP is replaced with PDC and TSA, the AUC improves by 0.46% to 82.20%. This shows that learning temporal dependencies is useful to detect anomalies in VAD. However, PDC+TSA models the local and global temporal dependencies separately, neglecting the close relationship between them. The proposed U-Net-based network resolves this problem by modeling both dependencies in a single structure. Compared to PDC+TSA, the U-Net structure improves the AUC by 1.23% to 83.43%.
Next, we evaluate the effectiveness of contrastive regularization aimed at reducing overfitting by learning more generalizable features. When contrastive regularization is applied, the AUC jumps significantly by 1.81% to 85.24%. This shows that features that are more class separable are indeed more generalizable. Section IV-F provides more analysis on this property of contrastive regularization.

E. FINE-RUNNING THE MODEL
In this section, we evaluate the impact of channel and filter size in the U-Net classification block, and the number of centers in contrastive regularization.

1) NUMBER OF CHANNELS
To study the impact of the channel sizes D u in U-Net, we evaluate with the following channel sizes: 256, 512, 1024 and 2048. Different from traditional U-Net, in our design, all  convolutional layers have the same channel size. In our implementation, the input to U-Net from the backbone network has a channel size of 1024. So, the first two settings (f = 256 or 512) shrink the channels while the last setting (f = 2048) expands the channels. Table 3 shows the experimental result.
The best AUC is achieved when U-Net uses the same channel size as the backbone network feature (D u = 1024). Setting to a lower channel size, D u = 256 or 512, lowers the AUC performance. This may be due to information bottleneck where useful information are lost when the channels are compressed. Setting a bigger channels size, D u = 2048, decreases the performance as well because a bigger network is more difficult to optimize and easier to overfit without sufficient training samples.

2) FILTER SIZE
Next, we evaluate the impact of filter size in U-Net. The filter sizes of 3, 5 and 7 are evaluated. The filter size affects the range of the temporal coverage. The filter size should neither be too big nor too small.
As shown in Table 4, the optimal filter size is f = 5. Using a larger filter size f = 7 makes it less sensitive to subtle local cues. It also incurs more parameters, and therefore takes longer and more resources to train and is easier to overfit when training samples are insufficient. On the other hand, a smaller filter size f = 3 has a limited temporal range, hindering it from extracting longer-range temporal dependencies effectively with the current depth level.

3) NUMBER OF CENTERS
The number of centers C in contrastive regularization signifies the number of representative events captured by the model. We evaluate C = 0, 2, 4, 8, 16 and 24. When C = 0, contrastive regularization is disabled. Table 5 shows the experimental results.
When the contrastive regularization is enabled (C ≥ 2), the performance of the model improves. With C = 2 (one center per class) the AUC improves by 0.87%. The optimal number of centers is C = 16 where the AUC improves significantly by 2.1% to 85.24%. Increasing the number of centers any further to 24 does not yield any further performance improvement. This is because 24 centers are more difficult to train and 16 centers are sufficient to explain the anomalies.

F. QUALITATIVE ANALYSIS
In this section, we perform qualitative analysis of the model. Fig. 5 shows the frame anomaly scores in the test videos produced by our model. Fig. 5(a) -(e) shows the scores for six anomalous videos. In general, the anomaly scores align successfully with the ground truth where the network outputs high anomaly scores for anomalous regions and low scores for normal regions. Fig. 5(f) shows that the network correctly outputs low anomaly scores for all frames in a normal video. Our system is not perfect. Fig. 5(g) shows a missed detection where the network misses a shop-lifting event where a man put a stolen watch into his pocket. Our model fails to detect the shop-lifting event because the action is subtle and the stolen item is occluded, making it difficult to detect even for a non-observant human. Fig. 5(h) shows a false alarm when a group of people made some body contact. The model confuses it to be a fighting event for a short time duration. Fig. 6 looks deeper into the chronology of events for 2 test videos containing burglary and explosion events. The first video ''Burglary037'' starts with an empty cashier counter (Frame 78). Then, a man appears and climbs over the counter (Frame 257), passes several stolen wines to his accomplice (Frame 796), ransacks the cash register and then fleets (Frame 1815). The network correctly predicts the initial scene as normal and a higher anomaly scores constantly after the burglar appears. The model predicts a lower anomaly score after the burglars leave. The second video ''Explosion033'' contains two different explosion events. The anomaly scores are lower for the scenes preceding the two explosions at their respective locations (frames 499 and 1505) and high anomaly scores when the explosions occur (frames 1075 and 2095). This shows that the model is able to detect anomalous events in the frame despite being trained with video-level labels.

G. IMPACT OF CONTRASTIVE REGULARIZATION
To evaluate the effectiveness of contrastive regularization, we compare the distance between a segment feature f to its nearest normal center h − ∈ H − and its nearest anomaly center h + ∈ H + . The relative distance is computed as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  A negative value indicates that f is nearer to a normal center than to an abnormal center. Conversely, a positive value indicates that it is nearer to an abnormal center. Fig. 7 shows the computed relative distance of the segments in eight test videos. Fig. 7(a) -(e) show 5 test videos with anomaly events while (f) show a test video without anomaly events. The embedding feature of the normal event (white region) is closer to the normal center (below the horizontal line) while the embedding feature of the anomaly event (red zone) is closer to the anomaly center (above the horizontal line). This shows that the features generated wth contrastive regularization are more discriminative and aligns very well to the correct event type. Fig. 7(g)-(h) show 2 failure cases on a test video with and without an anomaly event, respectively. For the former, the features clearly fails to capture the subtle anomalous event. For the latter, the features generated shows that the normal actions are occasionally confused as abnormal due to high level of activities in the scene.

V. CONCLUSION
This work novelly applies U-Net to capture both local and global temporal information for detecting anomalous segments in a video. U-Net learns global temporal dependencies on top of local dependencies in the encoder, and this global information is then propagated back to the local level in the decoder. This intricate way of modeling the two types of dependencies are found to be superior to RTFM which models the two types of dependencies separately. To reduce overfitting issues, we propose weakly supervised contrastive regularization. The loss function enforces inter-class separability and intra-class compactness. The resultant features are found to be discriminative and generalizable, resulting in improved test performance. For future work, it is interesting to explore 3D U-Net structure which additionally considers the spatial dimension, and to study how different distance metrices affect contrastive regularization.