Toward an Online Continual Learning Architecture for Intrusion Detection of Video Surveillance

With recent advances in deep learning technologies, many commercialized video surveillance systems have adopted Artificial Intelligence (AI)-powered video analytics technologies as a way to make our life smarter and safer. Nevertheless, there is no robust architecture with an appropriate network model for commercial services considering both high accuracy and low computational cost. Existing deep learning technologies would not be enough to model and represent the dynamics of the real-world scene, so it is difficult to satisfy all environments using a generic model. Appropriate training data from false-alarm and/or missed cases can address this limitation but is rarely available due to legal issues relating to the privacy of personal data and the unpredictability of new incoming data. In this paper, we propose a novel end-to-end hybrid video surveillance architecture for reliable object detection, consisting of front-end and back-end intelligence. For the intelligent front-end, we propose a new object detector with a Multi-scale ResBlock scheme to consider the scalability and flexibility of the system. We are also developing a new domain adaptation method to replace the generic model with each camera’s individual personal model by understanding real-time space and context information for intelligent back-end architecture. It is an iterative and continuous process in which new upcoming data and previous models are consistently engaged in a continuous improvement process. We conducted a series of experiments, including an interesting proof-of-concept tests called the Chameleon project, which demonstrated the high accuracy and versatility of the new architecture, while producing robust results that can be implemented in practice.


I. INTRODUCTION
Recent successful technological advances in deep learningbased image/video recognition haves led to an explosion in consumer demand for Artificial Intelligence (AI)-based products and services in many real-world applications such as autonomous driving [1], medical image processing [2], virtual training simulator [3] and video surveillance [4]. Among these applications, much attention has been paid to intelligent video surveillance based on AI to ensure rapid The associate editor coordinating the review of this manuscript and approving it for publication was Hossein Rahmani . and precise exchange of safety information. This advanced technology enables us to solve various problems related to our lives and property. In this context, NVIDIA has been running for several years a workshop entitled AI City Challenge, which ranges from constructing training and test data to designing architecture between the edge and the cloud server to make our cities smarter and safer [5].
Much work has been focused on improving the performance of intelligent video analytics (IVAs) to detect, trace and recognize important or dangerous objects [6], [7]. Unlike conventional feature-based approaches, Convolutional Neural Network (CNN)-based object detection methods automatically extract meaningful feature representations from the training data to simultaneously find the location and class information of objects [8]. Broadly speaking, it can be divided according to the number of object detection steps, one of each is a two-stage detector and the other is a onestage detector. For the two-stage object detector, R-CNN [9] leverages CNN to classify the object after generating sparse propositions that could potentially include objects inside. Compared with the two-stage object detector, the one-stage detector such as SSD [10], YOLOv2 [11], and Pelee [12] directly predicts coordinates and class information with no region suggestion and refinement stages, making them widely used in many real-time video surveillance applications for their computational efficiency. The authors in [13] designed an FPGA-based smart camera for object detection in video surveillance applications to solve the resource scarcity problem and make edge-based live video analytic processing. However, a novel hybrid architecture between an intelligent camera and a cloud server that efficiently captures important objects in real-time has not yet been systematically developed.
One of the most significant objects in video surveillance is a person since it is strongly associated with every event and activity like an intrusion, loiter, arson and abandonment. Therefore, many IVA functions have been extensively studied in terms of detecting and tracking a person [7], recognizing the appearance of his/her gender and age [14] and predicting upcoming future events [15]. In [16], the authors designed a new CNN model based on Fast R-CNN, which incorporates large and small sub-networks to detect various sizes of pedestrian instances in an image. The authors in [13] designed an FPGA-based smart camera for object detection in video surveillance applications to solve the resource scarcity problem and analytical processing of live video based on the edges. Moreover, Sermanet et al. [17] applied an unsupervised approach based on convolutional sparse coding to pre-trained CNNs for pedestrian detection.
However, as shown in Fig. 1(a), people appear in various poses, clothes and sizes, under various lighting conditions, so some problems may arise when designing a general intrusion detector. Unfortunately, animals, neon sign lamps, mannequins, and bugs in particular in the low-resolution video surveillance image can be recognized as a person with high confidence score, as shown in Fig. 1(b). Thus, many studies have tried to explain why this problem occurs and how it can be solved in terms of 'data refinement' [18], 'domain adaptation' [19], [20] and 'new training techniques' [21]. The authors in [22] developed a transfer learning framework to adapt a generic pedestrian detector to a specific domain with a minimum annotation requirement. Also, Kim et al. [21] proposed an unsupervised domain adaptation method for a single-stage object detector by using weak self-training and adversarial background score regularization method in order to reduce the false negatives and false positives. However, they have not been fully implemented and sufficiently validated for use in real-world video surveillance system as nearly every method requires relatively large computational resources. Fig. 2(a) depicts a conceptual representation of the generic CNN-based method where the detector operates with a universal model in each domain, independent of domain information. Given the limited model capacity with a finite training dataset, it is difficult to satisfy all domains using a single model. Thus, unexpected problems such as false positives may appear in some cases, as shown in Fig. 1(b). To solve this problem, Khodabandeh et al. [18] described a robust method for unsupervised learning of object detection to reduce the inevitable problem of domain shift and privacy regulation in surveillance applications. The authors of [23] developed a framework for cross-domain detection of weakly supervised objects, assuming that they could access image-level annotated images in the target domain. However, these approaches have inherent limitations as they often focused solely on the training phase without considering the inference phase in the new domain. Hence, in order to deploy it on a real video surveillance platform, it is necessary to adapt the universal model to specific domains in the inference phase with realtime constraints, as depicted in Fig. 2 To address these limitations, we have developed a novel end-to-end hybrid video surveillance architecture that we use to robustly detect people incorporating front-end and back-end intelligence. In the case of an intelligent front-end system, we propose a new object detector for detecting pedestrians in video monitoring utilizing Multi-Scale ResBlock (MSRB). A new domain adaptation method is then used to replace the general model with a person-specific model by  (a) A 'one-way' framework where the front-end camera sends the detection results to the back-end server. (b) A mutual 'two-way' framework where the camera and server continuously exchange detection results and model performance data.
carefully learning the relevant space and context information in our intelligent back-end. Therefore, as illustrated in Fig. 3(b), the proposed architecture can be considered as a mutual 'two-way' framework that continuously exchanges detection results and meaningful feedback data between the front-end and the back-end. It differs from the conventional architecture of video surveillance where each camera detects and transmits the detection results to the back-end server, as shown in Fig. 3(a).
This paper is one of the first to study a life-long AI video surveillance architecture with a real-world platform that is primarily motivated by the near perfect accuracy and scalability of an IVA system. To demonstrate the usefulness and robustness of the proposed approach, we conducted extensive experiments on two public and one private datasets and show that the proposed method significantly outperforms some of the common baselines. This advantage enables this hybrid architecture to be deployed on a real-world commercial platform where we performed proof-of-concept (POC) tests with a Microsoft Azure cloud server to validate its applicability. The main contributions of the proposed approach are summarized as follows: • An end-to-end hybrid architecture between an intelligent camera and a cloud server was developed for detecting a person in a video surveillance system. It is an iterative and feedback-based framework that collects meaningful data through manual (user feedback) and automated operations for continuous understanding and learning.
• For the intelligent front-end, we propose a variant of the PeleeNet architecture called MSRB-PeleeNet for the CNN-based object detection model that performs competitive with other strong baselines.
• In order to precisely understand each space and contextual information, we are developing a new domain adaptation method in the inference phase, yielding an individual personal model in our intelligent back-end.
• To compare and verify the proposed method in a more practical way, we conducted POC tests under real conditions and laboratory experiments on the set of test data.

II. RELATED WORKS A. SYSTEMATIC EDGE-TO-CLOUD APPROACHES
There are several approaches to improve CNNs performance using an online process to overcome the lab test result. One of the outstanding approaches is to collect as many photos as possible online to cover all possible cases in the real-world environment. The authors in [24] proposed to utilize world wide web resources to learn robust object category models for detecting objects in real cluttered scenes. In [25], they developed the discriminatory framework for active and online learning which showed better performance compared to the study [24]. Both approaches did not focus on accelerating CNN's performance with edge-to-cloud architecture, but utilized the same number of training samples for supervised learning. The authors in [26] used many hardware resources, including a Central Processing Unit (CPU), Graphics Processing Unit (GPU), parallel image processing, video-based on distributed data analytics and machine learning framework. Its systematic optimization results in approximately 5 to 19 times acceleration compared to the off-the-shelf approach. Kang et al. conducted effective and efficient large-scale video analysis to accelerate large-scale image and video processing [27]. They proposed a search mechanism for the appropriate CNN model for a specific application based on a neural network analysis system, leading to highly accurate model reproduction. In addition, Han et al. developed a new CNN optimization system that produces a catalog of variants for each model and a runtime that schedules these variants on devices and in the cloud with regard to accuracy and resource bound [28]. To the best of our knowledge, this study could be one of the first papers to balance CNN-based intelligence between edge and cloud. Recently, Emmons et al. proposed preliminary work and described further research topics on advanced neural networks in a large-scale video surveillance application [29]. Since the general neural network is a blackbox, they developed a split-brain architecture that balances the neural network activation between the edge and the cloud. They also introduced many communication and computation trade-off challenges to create a scalable and flexible architecture that delivers balanced edge to cloud computing.

B. DOMAIN ADAPTIVE OBJECT DETECTION FOR PERSON
Object detection is an important technology in a real-world video surveillance system, requiring very light computation and almost 100% accuracy. The authors in [30] proposed a person detector that is highly robust to occlusion and hard negative cases by applying average grid classifiers for postrefinement. Ren et al. presented a very simple but efficient method using a region proposal network and Boosted Forests for pedestrian detection based on Faster R-CNN [8].
Although the CNN-based object detection model can perform well in most cases, potentially negative results can arise in some video surveillance locations. To mitigate this problem, some studies have recently suggested a domain adaptation method for object detection. However, it has received less attention compared to classification and semantic segmentation applications. This is because detection tasks require additional consideration of class imbalance, intra-class variation of negative classes and require high dimensional and dense features compared to classification tasks [31].
For domain-adapted object detection, the authors of [32] introduced unsupervised multi-instance learning with Real AdaBoost [33] and developed a tracking-based effective unsupervised online sample collection mechanism to reduce false alarms caused by domain shifts. While it shows better performance, it depends on the performance of the classifiers, making it difficult to apply in real-world practical applications. To improve the performance degradation caused by false negatives, the authors of [34] proposed an automatic adaptation method that creates soft-labels for unlabeled data in the target domain based on data from existing detectors (e.g. mis-classified objects data). In addition, Kim et al. proposed weak self-training and adversarial background score regularization methods for domain adaptive one-stage object detection [21]. It can reduce false negatives and false positives by masking the gradients of hard negative examples and helps CNN to extract discriminative features from the target background.

III. PROPOSED METHOD A. PROBLEM SETTING
To describe the problem more precisely, let's denote the source data (x s , y s ) which is used in the training set for a given source domain X s , and let the target data be denoted by (x t , y t ) which is drawn from the target domain X t , where x is an image and y is a labeled data which consists of the coordinates of the person bounding box and class information c. 1 While any good CNN-based object detector has been developed from a massive set of X s training datasets, it can confidently produce incorrect results when confronted with another new domain, that the detector has never seen before. In general, this is because that the source and target domain backgrounds share less common features compared to the foreground object backgrounds (P(X s ) = P(X t ), where P(X ) is the marginal probability distribution of the X domain). Understanding this problem involves a number of complex factors that can only be solved by understanding the new target domain by incorporating a sophisticated object detection algorithm into a robust architecture.

B. HYBRID ARCHITECTURE FOR ONLINE CONTINUOUS LEARNING
We propose a hybrid AI video surveillance architecture that combines front-end and back-end intelligence, as shown in Fig. 4. Unlike conventional cloud-based video surveillance platform, we deploy a CNN-based object detection model in the camera for intelligent front-end. Of course, it is really possible to obtain better performance by utilizing a model in the back-end with high computational power and deeper layers for the model. However, it is hard to handle multiple cameras simultaneously for a video surveillance system that operates 24/7 (24h a day and 7days a week) due to the limitations of processing complexity. Hence, if the performance does not differ significantly between the front-and the back-end model, it is more efficient to deploy the model at the frontend, considering the scalability, flexibility and expandability of the system.
In order to develop a robust, initial overall camera object detection model, we train our own CNN-based network model with a training dataset from numerous video surveillance sources. After extensive evaluation processes have been performed on various datasets, an output model is deployed to each object detection camera, as shown in Fig. 4. Although this initial model can perform well in most cases, potentially negative results can occur in some places. To mitigate this problem, we are developing a new domain adaptation method by understanding the individual X t environment in each camera. In other words, we replace the initial universal model with an individual specific model by carefully learning the relevant information about space and context. This is possible because the surveillance camera is usually installed in a fixed location, unlike other applications, such as autonomous vehicles and handheld cameras.
In our hybrid architecture, a short learning time (adaptation period about (∼1sec.) before the inference phase (service period) is needed to understand the environment more deeply. First, we fine-tune the initial model from the new target data to adapt to its environment during an adaptation period. This enables more accurate detection of the updated model by gaining knowledge of its space and contextual information over the service period. To deal with data heterogeneity for different purposes, we divide target images into two pieces of data: latent database and query database according to their confidence score of detection results, as shown in Fig. 4. If the image has a high confidence score from the previous model, we utilize it as the source of the verification dataset for future use to develop new IVA functions. On the other hand, if it has a low confidence score, we use it as a candidate for a new training dataset to overcome the uncertainty.
To ensure flexibility in the proposed architecture, we allow two approaches to sampling meaningful target data during the adaptation period, one manually (user feedback) and the other by automatic operation. For the user feedback method, the domain adaptation process is handled based on user feedback data, such as false positive and true negative cases. In other words, the adaptation procedure is carried out to deliver only a user-selected portion of the recorded video data to guarantee the high quality of the CCTV camera, as illustrated in Fig. 4. Almost every conventional video surveillance system employs this approach to solve the problem of low accuracy.
However, in general, people tend to avoid any kind of feedback related action, unless it is very disturbing. Therefore, we are also developing a new method for automatic domain adaptation by sampling new target domain data, which will be explained in the following Sec. III-D.
It is an iterative and lifelong process in which new upcoming data and previous model are engaged in a continuous process of improvement. However, attempting to assimilate new information into the previous model requires modifying its weights, which in turn significantly interrupts existing knowledge, yielding a problem known as catastrophic forgetting. Thus, to eliminate this problem and allow continuous learning, our architecture conducts the model evaluation process iteratively, allowing the current model to be updated in realtime and adapting to the individual environment. It makes it a hybrid more robust to the adaptation of new individual target data.

C. RELIABLE OBJECT DETECTOR FOR INTELLIGENT FRONT-END
In an edge device-based real-world video surveillance system, real-time operation with low computational cost must be possible, so the AI model must be fast and lightweight, and portability is a very important factor. Especially, to deploy a CNN-based model in the front-end camera, it is imperative to design a structure that is not highly dependent on a specific layer (e.g. depthwise separable convolution layer [35]). Wang et al. in [12] proposed PeleeNet with a structure that can operate on various AI chipsets increasing the model scalability and portability through a block module consisting of simple basic operations. However, since the ResBlock used in PeleeNet only applies one size of the kernel (3 × 3), it is difficult to obtain visual information on various scales. If an object is occluded or truncated by others or image borders, it is necessary to consider the surrounding information as well as the object itself. There are limitations to extracting and analyzing pixel information around objects when using a single kernel for ResBlock on the same feature map. So it is not easy to detect occluded or small objects which is a hard case in real-world video surveillance systems.
Therefore, we aim to develop a robust object detector (especially for detecting people in video surveillance cameras) while minimizing computational resources. To solve this lack of information for pedestrian detection in a surveillance camera, the studies use a multi-scale feature map or multi-scale kernel that can cover a wide range of receptive fields from the kernel of various scales for visual information. Inspired by previous studies, we propose a variant of the PeleeNet architecture called Multi-Scale ResBlock (MSRB)-PeleeNet that utilizes a multi-scale kernel by redesigning the existing ResBlock residual structure on each additional feature map, as depicted in Fig. 5. The proposed MSRB consists of various scales of kernels (3×3, 5×5, and 7×7) in order to more specifically consider surrounding contextual information. It is possible to analyze the specific layers features using multi-scale kernels, yielding statistically better performance regardless of the image environment. It is also noted that the extracted contextual information is also utilized to maintain performance even with serious small objects. In real-world video surveillance systems, there are often cases that are generally difficult to detect, such as (a) an occluded object partially obscured by an obstacle, and (b) a small object standing far away. When looking at the second column, it is difficult to detect small or occluded objects because the existing ResBlock only uses a single-size kernel. On the other hand, looking at the third column, the proposed MSRB structure may include various receptive fields, and thus shows that objects in various environments can be detected. Fig. 6 shows the qualitative results of two video surveillance scenes, according to which the proposed object detector enables more accurate detection of a person with occluded or small objects compared to the ResBlock of original PeleeNet. Moreover, we exclude the two last feature maps (3 × 3 and 1 × 1) in the original structure in order to reduce the high computational costs in PeleeNet. Finally, we can confirm that MSRB-PeleeNet is suitable for object detectors in front-end video surveillance cameras due to its superior accuracy and low computational complexity. Quantitative analysis will be discussed in Sec. IV-B.

D. CONTEXT-AWARE DOMAIN ADAPTATION FOR INTELLIGENT BACK-END
We propose a real-time domain adaptation scheme on MSRB-PeleeNet at each camera in order to learn about the corresponding environment of X t . As we mentioned earlier, during the adaptation period, we refine the original model with new negative images from the target domain that do not contain any people with several iterations to get updated. This enables the updated model to detect a person more accurately, using knowledge of its space and contextual information through the service period.
Here, we introduce a new method for automatically sampling negative background images during the adaptation period as shown in Fig. 7. For ease of understanding, let's denote the target data during adaptation period of frame duration N by x t A = {x t 0 , x t 1 , . . . , x t n }, where n is the frame number in the range 0 ≤ n ≤ N . Although there is no corresponding labeled data with the coordinates of the bounding box for training, we can use it as a negative set when there is no person in the image. First, we choose the first frame x t 0 as an element of the finally selected set x t S (x t S ⊂ x t A ). At this stage, even VOLUME 10, 2022 if there are some foregrounds in the scene, we can also get good domain adaptation results when we conducted lots of experiments. The second step is to obtain an image of the gray-scale frame difference: wherex n is the gray-scale image of the n th frame image. Here, it is highly possible that if the normalized mean value of the frame difference over the last few frames (five frames in this paper) is greater than a certain threshold , more than one moving object appears on the n th frame. To avoid the duplication problem, 2 we choose (n − 1) th frame, x t n−1 as a candidate image for the next step.
In the third step, we define the dissimilarity metric (DSM) of x t n−1 with all the frames in the selected x t S sample set to avoid duplication in that set, considering the differences between luminance, contrast, structure and texture: where S(x t n−1 ) is a metric of the structural similarity index(SSIM) [36], which is a remarkable method of assessing image quality taking into account the human visual system, which is calculated for all images in x t S : where χ s is the selected images before the (n − 1) th frame, {χ s } ∈ x t S , and µ x and σ x are the mean intensities and standard deviations of the frame x, respectively: and T (x t n−1 ) is the absolute difference in texture entropy on the (n − 1) th frame, E x t n−1 , which represents the difference in visual entropy in terms of texture information based on information theory [37]: where p T ,x n−1 is the probability mass function of the i th subband of wavelet decomposition output ofx n−1 , γ is the shape parameter of the generalized Gaussian distribution, s is the positive constants of s = 1 σ (3/γ ) (1/γ ) . Thus, the final negative sample set x t S is determined by the thresholding of the value of δ. 3 The structure of the above method is summarized in Algorithm 1. 2 Since the consecutive frames of a surveillance video are almost identical, it is better to extract just one frame as a representative sample than to sample all the frames. 3 More detailed explanation is given in the previous study of [37]. For training, all background samples are sorted by their predicted confidence scores in ascending order and use some of them with positive samples to build a CNN-based object detector in general [10]. However, we only tune the original object detector using the negative set x t S with no positive samples to detect. Therefore, if the original detector finds a person bounding box to be considered a false positive because there is no object present at this location in the image, we punish the loss according to the predicted box's confidence score as follows: where N p is the number of predicted bounding boxes l with a confidence score c.

IV. EXPERIMENTS
In order to verify the accuracy and suitability of the proposed approach, we performed two types of experiments; one is the laboratory tests and the other is POC tests. In the first part of the experiment (Sec. IV-B), we show the results of the proposed object detector with domain adaptation on two public and one private datasets for intelligent video surveillance. In the second part of the experiment (Sec. IV-C), we validate the effectiveness and unique strengths of the proposed approach in various POC tests. It is noted that we use all of the object detector models including MSRB to detect a person in video surveillance scenes as an example application of the MSRB. Thus, all datasets used in this study are video surveillance sequences captured in a fixed location with limited fields of view.

A. EXPERIMENTAL SET-UP
In our experiments, we used two public datasets provided by Performance Evaluation in Tracking for Surveillance (PETS)  Calculate DSM(x t n−1 ) by (1) 6: if DSM(x t n−1 ) ≥ δ then 7: Add (x t n−1 ) into the set of x t  [38] and the Oxford Town Centre [39]. For PETS in the 2009 dataset, we used an 'S2-L1' video showing an outdoors scene with several pedestrians crossing each other, for a total of 795 frames. We also used the Oxford Town Centre dataset showing pedestrians walking down a busy street, recorded at 25 fps with a resolution of 1920 × 1080 that pedestrians paths are mostly smooth at a constant speed. However, due to privacy regulations there is very small quantity of real video surveillance data. Thus, we used a private dataset which is composed of 130,561 training images (Private T ) and 5,787 hard level evaluation images (Private E ) with pedestrian bounding boxes for research purposes. Opening these two data to the public is difficult because of some legal issues.
For object detector training and domain adaptation, we set the batch size to 64 and the learning rate to 0.001. When we trained Private T 's private dataset, the network was trained for 150k iterations with an initial learning rate of 1.0 × 10 −3 , weight decay of 5.0 × 10 −4 , learning rate decay of 0.1 at 90k and 120k. In the case of benchmark methods, the inputs were resized to a specific size and trained using the default settings in the official code of each algorithm, which we adopt SSD [10] as the basic framework for MobileNet and PeleeNet. The anchor size was extracted using k-means clustering for each PETS 2009, Oxford Town Centre and Private T training set.
Experiments on the server were conducted on the NVIDIA Quadro P6000 with CUDA 9.0 and cuDNN v7.1 and the Qualcomm Vision Intelligence Platform with QCS603 designed for efficient machine learning on IoT devices was used for the AI camera. For accuracy comparison, the mean Average Precision (mAP) was selected, which has been used widely in previous object detection studies, with the Intersection over Union (IoU) threshold of 0.5.

B. EXPERIMENTAL RESULTS AND DISCUSSION
To evaluate the accuracy of MSRB-PeleeNet, we compared the performance with previous object detection models YOLOv4 [40], EfficientDet [41], MobileNet v3 [35], PeleeNet [12], RefineDet [42], M2det [43], RetinaNet [44], HSD [45] and EFGRNet [46]. For a fair comparison, we measure mAP with processing time on two different NVIDIA Quadro P6000 and Qualcomm QCS603 devices after training with the same Private T dataset for person detection in video surveillance videos. Table 1 shows the summarized performance in terms of mAP and frame per second (fps) for two public and one private datasets on GPU. As shown in the table, the proposed model delivered the highest mAP values and reasonable processing time with a substantial margin on the GPU. When Private T was used as a training dataset, the fps of PeleeNet outperformed those of other methods. In addition, MobileNet, which is a state-of-the-art light-optimized deep learning model for mobile devices, exhibits better fps than the proposed methods. However, the mAP performance of the proposed method is superior to that of other methods, including PeleeNet and MobileNet, for all test datasets (i.e. PETS, Oxford, Private E ). When MS COCO was used as a training dataset, PeleeNet achieved the best fps performance of all methods. However, PeleeNet exhibits the worst mAP performance for all test datasets. On the other hand, the proposed method, achieving the second-best fps performance, exhibits the best performance in terms of mAP.
To sum up, compared to the original PeleeNet, the proposed method shows greater accuracy in the overall test datasets, but with a slightly higher computational effort. Table 2 shows the summarized performance in terms of mAP and fps for two public and one private datasets on a Digital Signal Processor (DSP). Almost all CNN-based object detection models have real-time performance on the GPU, but their accuracy is significantly reduced on the DSP chip because of the high computational complexity as shown in the table. From the table, it is seen that PeleeNet exhibits the best fps performance among all the methods. However, the mAP of PeleeNet was the worst for all test datasets. Although MobileNet v3 also exhibits a fast processing time, it is highly dependent on the hardware types of the mobile devices. In other words, it has so far been difficult to deploy on other chipsets for mobile devices as it uses a new customer layer called a depth-wise separable convolutional layer. On the other hand, the mAP of the proposed method is superior to those of the other methods. In addition, the fps value of the proposed method is greater than for other methods except for PeleeNet. This result is consistent with the mAP and fps performance shown in Table 1.
In order to demonstrate the superiority of the proposed domain adaptation method, its performance (i.e., accuracy and computational complexity) was also compared with the results of other studies. As shown in Table 3, it can be concluded that the effects of the proposed domain adaptation method are very significant when comparing the results with the results of the original object detector. Since a short adaptation period is required for domain adaptation, the value of fps is reduced in all object detectors, but is irrelevant to realworld inference. The main computational burden is due to tuning with x t S to update the original model. Fig. 8 shows the detector results of the previous two object detector models and MSRB with the domain adaptation method on on six representative video surveillance images from each test dataset. As shown in the first row of the figure, when two people overlapped each other, the existing Mask R-CNN and PeleeNet methods recognized them as one person. In addition, both detectors often fail to detect a person who is occluded by other objects or appears partially, resulting in a low recall rate. On the other hand, there are some false positives in either foreground objects or backgrounds as shown in Fig. 8. Compared to both object detectors, the proposed MSRB shows a significant increase in performance in accuracy, but there may also be some false positives when they are person-like as shown in the third columns of this figure. Therefore, we can reduce these problems by using the proposed real-time domain adaptation method as shown in the fourth column in Fig. 8.

C. CHAMELEON PROJECT ON COMMERCIAL VIDEO SURVEILLANCE PLATFORM
In order to verify the proposed approach in a real-world video surveillance system, we performed POC tests on a real commercial T view platform, which is a large-scale cloud-based video surveillance service platform with realtime streaming, IVA and dispatch services in South Korea. Three hundred surveillance video cameras were employed out of 130k cameras during the week to validate our method. We call this experiment the Chameleon project, the aim of which is to verify the robust performance of the proposed method in practice.
There are two main components to implementation the system in this project, as shown in Fig. 9. One is Communication for Front-end Intelligence which requires the exchange of data and object detector models between each camera and the cloud server. For the front-end device, we used Microsoft's Vision AI Developer Kit with Qualcomm QCS603 based on the Azure cloud platform. The second is Understanding at Back-end Intelligence where we train the current model and iteratively evaluate the updated model to improve and identify performance. This makes the hybrid architecture more robust to adaptation to new, individual target data.
All implementations were based on Microsoft's Azure services in which we used GPU instances, web application server and message transmission protocol among cameras, cloud server, database and storage using Infrastructure as a Service, Platform as a Service and Software as a Service, respectively. More detailed explanations of a system implementation for both components are as follows.

1) COMMUNICATION FOR FRONT-END INTELLIGENCE
In this project, each camera had a container-based client to communicate with the server, which captured images and corresponding metadata were transferred to the storage server, as well as their identification data as shown in Fig. 9. The server then performed a sampling process to obtain a negative sample set in the new target domain during the adaptation period for each camera based on its metadata.
As it is an iterative and lifelong process in which data is continuously accumulated while learning a better model, it is necessary to manage the history of updating models. After training with the new target data, we evaluated the accuracy of the new model before sending it to each camera, comparing the previous one. In this project, this cycle takes approximately one minute for three hundred camera channels.

2) UNDERSTANDING IN BACK-END INTELLIGENCE
We used the Snapdragon Neural Processing Engine-based GPU server and the Caffe framework for a series of steps VOLUME 10, 2022 to learn, test and quantize all front-end cameras in parallel. Thus, to heterogeneously handle huge amounts of data requests from different cameras, we utilized the load balancing and queuing process on the Microsoft Azure GPU instance as shown in Fig. 9. More specifically, when the back-end server received the images and the corresponding metadata, the training and quantization processes were performed on the container-based server within the Microsoft Azure GPU instance. All these processes were operating simultaneously in the proposed hybrid architecture in parallel and asynchronously.

3) POC RESULTS AND DISCUSSION
Sixty test sequences out of three hundred camera channels were employed to identify the performance of this project on a commercial video surveillance platform. These sequences were annotated manually to ensure high precision of people marking. Each sequence has a frame rate of 15 fps and a length of 2 hours, so 108, 000 images were used to evaluate the performance of each camera channel. Fig. 10 shows the exemplary result frames for four representative sequences from test sequences that were used for validation. As shown in the first column of these figures, some false positives were detected either with the person or in the backgrounds. In addition, false positives were more frequent than true negatives. These errors were reported more often with high confidence results in the infrared mode. The second column of Fig. 10 shows the results of the domain adaptation using the MSRB that the false positives were significantly reduced while maintaining the ability to detect a person. In addition, a precision-recall (PR) curve was used to evaluate the overall performance of each camera, which represents the precision (positive predictive value) versus recall (true positive rate). In the PR curve, the y-axis is precision and the x-axis is a recall that performance increased when we applied domain adaptation during the adaptation period as shown in the last column in Fig. 10.
To investigate the performance depending on the camera environment, we divided the sixty test sequences into three groups of ''indoor'', ''outdoor during the day'', and ''outdoor at night''. Broadly, as shown in Fig. 11(a), the mAP performance for all groups increased over time and was significant for indoor compared to outdoor environment. This is because there was a complex pattern in the indoor scene leading to low precision and accuracy as shown in Fig. 11(b). However, some false positives were also detected in high-frequency image areas, such as moving tree leaves and the reflected light region in outdoor sequences. There was relatively little improvement in the outdoor night scene because there were hardly any moving objects, including pedestrians during the dawn period as depicted in Fig. 11(c).

V. CONCLUSION
Although many state-of-the-art CNN-based object detectors have shown remarkable performance, there are many unexpected errors that can occur during round-the-clock operation, leading to severe problems in video surveillance applications. Here, we propose a novel, end-to-end hybrid video surveillance architecture for reliable real-time object detection, consisting of front-end and back-end intelligence. Unlike conventional approaches, the proposed front-end utilizes a new object detector, called MSRB-PeleeNet to accommodate high scalability, flexibility and cost-effectiveness to meet the requirements of commercial video surveillance platforms. In addition, we develop a new domain adaptation method to replace the generic model with each camera's individual personal model by understanding the space and context information in real-time for intelligent back-end architecture. It is an iterative and continuous process in which new upcoming data and the previous model are consistently engaged in a continuous improvement process. As the advanced techniques of CNN-based computer vision and the demand for CNN-based intelligent video analysis continue to develop, we believe the proposed method will play an important role in applying the intelligent model to a real commercial platform.