Video Content Analysis for Compliance Audit in Finance and Security Industry

The quality and accessibility of modern financial service have been quickly and dramatically improved, which benefits from the fast development of information technology. It has also witnessed the trend for applying artificial intelligence related technology, especially machine learning to the finance and security industry ranging from face recognition to fraud detection. In particular, deep neural networks have proven to be far superior to traditional algorithms in various application scenarios of computer vision. In this paper, we propose a deep learning-based video analysis system for automated compliance audit in stock brokerage, which in general consists of five modules here: 1) Video tampering and integrity detection; 2) Objects of interest localization and association; 3) Analysis of presence and departure of personnel in a video; 4) Face image quality assessment; and 5) Signature action positioning. To the best of our knowledge, this is the first work that introduces remote automated compliance audit system for the dual-recorded video in finance and security industry. The experimental results suggest our system can identify most of the potential non-compliant videos and has greatly improved the working efficiency of the auditors and reduced human labor costs. The collected dataset in our experiment will be released with this paper.


I. INTRODUCTION
The Measures for the Suitability Management of Securities and Futures Investors (Hereinafter refer to as the Measures) of China has come into force in July 27, 2017. As the first special regulation of investor protection in China's securities and futures market, this act is an important foundation law of the capital market. The act will regulate the behavior of the brokers and also provide evidence for the settlement of disputes in the future.
The Measures requires that when a staff of a securities or insurance company sells financial products and signs cooperation documents, the company should be responsible for recording video and audio of the business processing that complies with certain rules for verification of possible The associate editor coordinating the review of this manuscript and approving it for publication was Guitao Cao . disputes which was called dual-record. The dual-record video usually consists of two persons in a specific scenario which includes a staff of securities or insurance company and a client. From the visual quality perspective, the video should meet the following conditions before it can be called a compliant video: • The video must be complete and continuous and can't be modified.
• The financial product salesman should wear work uniform and work name tag during video recording and the company nameplate/logo should appear in the recording background.
• The faces of both parties should be unmasked and clearly identified.
• Neither the server (staff) nor the client (customer) can leave the scene during the recording.
• The video should contain the entire process of document signing. In fact, to our best knowledge, currently the inspection and review tasks of the dual-recorded video is manually performed to ensure that the video complies with the corresponding rules. However, there are several problems with manual audit: 1) the length of the video is always longer than 15 minutes, which can be a very tedious work for manual review; 2) video tampering and integrity rely on frameby-frame analysis, which is almost impossible for manual check 3) auditing is a multi-tasking process that requires repeated analysis and review. The above problems may cause such situations that most of the videos cannot be carefully reviewed before they are stored and when a dispute occurs, the stored videos may fail to play their due role as they do not meet the dual-recorded requirements. If accurate automated compliance analysis of the video content can be performed in a real-time while the video is being archived, it should be well avoided.
As the deep convolution neural networks(DCNN) is powerful in learning robust and high-level semantic feature representations of the image / video sequence, it has fueled great strides in a variety of computer vision tasks. In image classification, despite some early successes of convolution neural network (CNN) (e.g. [1], [2]), the DCNN brings us to the new era of computer vision filed (e.g. [3]- [5]). The most significant advance was achieved in the Large Scale Visual Recognition Challenge 2012 [6], Krizhevsky et al. [7] used the DCNN which was called AlexNet to train on about 1.2 million images, and finally got the record-breaking result. Since then, DCNN based methods have dominated most of the tracks in the ILSVRC; Object detection is one of the most fundamental problem in computer vision, it seeks to locate object instances from a large number of predefined categories in natural images. Girshick et al. took the lead to apply DCNN in object detection in 2014 by proposing the Regions with CNN features (RCNN) for object detection [8]. Since then, object detection started to evolve at an unprecedented speed. Girshick also proposed Fast RCNN [9] that addresses some of the disadvantages of RCNN, while improving on their detection speed and quality. Although Fast RCNN significantly speeds up the detection process, it still relies on external region proposals, whose computation is exposed as the new speed bottleneck in Fast RCNN. To solve this problem, the Faster RCNN framework proposed by Ren et al. [10] offered an efficient and accurate Region Proposal Network (RPN) for generating region proposals. RCNN-family methods are typically slower because of large amount of region proposals which should be classified and regressed on by one later, and in order to improve the speed, single-stage detectors, such as YOLOv3 [11] and SSD [12], that treat object detection as a simple regression problem were proposed; Recently, the Siamese network based trackers have drawn much attention in the community. These Siamese trackers formulate the visual object tracking problem as learning a general similarity map by cross-correlation between the feature representations learned for the target template and the search region. SiamPRN [13] introduces a region proposal network after the Siamese network, thus formulating the tracking as a one-shot local detection task; In face detection, MTCNN [14] proposes a deep cascaded multi-task framework which exploits the inherent correlation between detection and alignment to boost up their performance, and various other tasks.
Video content analysis (VCA) deals with the extraction of structural metadata from raw video to be used as components for further processing in applications such as summarization, classification or event detection, and the goal is to extract structural and semantic content automatically for handling the problem of detecting abnormal events [15]- [17], which is a key subject in video surveillance, where the DCNN has also been widely used to determine anomaly of the scene. From the perspective of video content analysis, we focus on detecting the following non-compliance behaviors: a. the video is incomplete and tampering such as frame insertion or frame deletion; b. the uniformity of the staff is not according to the Measures, and the recording site does not contain identifiers such as company logo or brand which can be used to identify the company; c. it is impracticable to identify the participants by their low quality faces in the video sequence; d. the process of signing a document cannot be detected and pinpointed in the video. Based on deep learning technology, this paper effectively integrates the higher-precision and computationally efficient detection, recognition, and tracking deep learning neural networks, as well as basic methods in the traditional digital image processing field.
The contributions of our paper are summarized as follows: • This is the first work that introduces remote automated compliance audit system for dual-recorded video in the financial and securities industry.
• We proposed a series of practical solutions for compliance detection, which include the joint discrimination mechanism of Mean Structural Similarity Index (MSSIM) [18], luminance compensation and deep content analysis in special scenes is proposed for video integrity detection in complex scenarios; An elaborately designed object association scheme which can establish the relationship bonding the objects and corresponding person, moreover scoring mechanism which can represent facial pose, size, sharpness, and feature attributes in face quality assessment were used in this particular scenario; 3D CNN [19] is employed to recognize signature action, and we used Siamese-RPN to filter background noise in video sequence which greatly improves the accuracy.
• We built a dataset for object detection in the financial scenario by image search engines such as Google image, Baidu image search engine, etc., and we will make part of it public available together with the publication of this paper. VOLUME 8, 2020 The remainder of this paper is organized as follows: Section II gives a brief introduction to related work. In Section III, we describe the related methodologies for each module of this automated system. In Section IV the implementation details and case study are presented. Finally, Section V concludes the paper with a summary of findings and future works.

II. RELATED WORKS
Video content analysis (VCA) is the capability of automatically analyzing video for specific tasks based on video sequence instead of single image. This technical capability is widely used in the security industry for better sensing of the behind situation. The video data generated by IP cameras are used for data analysis to process, categorize and analyze the objects and activities captured. In our proposed system, human action and motion recognition is important part which has a wide range of applications, such as intelligent video surveillance and environmental home monitoring [20], video storage and retrieval [21], intelligent human-machine interfaces [22], and identity recognition [23]. Human action and motion recognition cover many research topics, including human detection in video, human pose estimation, human tracking, and analysis and understanding of time series data.
The key to good human action recognition is robust human action modeling and feature representation. Unlike feature representation in the image space, the feature representation of human action in video not only describes the appearance of the human in the image space, but must also extract the motion in the image sequence. The problem of feature representation is extended from two-dimensional image space to three-dimensional spatio-temporal space.
Over recent years, there are many works focusing on applying CNN models to the video domain with an aim to learn spatio-temporal patterns which have received increasing attention in video content analysis (VCA). Using the 2D CNN is a straightforward way to conduct video recognition. For example, Simonyan and Zisserman [24] designed a two-stream CNN for RGB input (spatial stream) and optical flow input (temporal stream) respectively. Ji et al. [19] introduced the 3D CNN model that operates on stacked video frames. The 3D CNN utilizes 3D kernels for convolution to learn motion information between adjacent frame in volumes. To effectively handle 3D signals, Sun et al. [25] introduced factorized spatio-temporal convolution networks that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layer. Wang et al. [26] proposed a spatial-temporal non-local module to capture long-range dependencies.
As you know, there are lots of repetitive tasks in the manual audit. Therefore, our goal is helping to reduce jobs in which AI and machine learning take over boring tasks, while humans spend more time with higher-level tasks. In this paper, we present an automated compliance audit problem in the real business scenario and propose an intelligent audit system based on a set of deep learning technologies to simplify the boring tasks and help the staff to improve their work efficiency. The system integrates object detection, object tracking, face detection and traditional image processing methods together to effectively solve the following problems, such as video tampering and integrity validation, object localization and object association, person-departure detection, face image quality assessment and signature action positioning. We divided the system to five corresponding business modules, see Figure 1, which includes video tampering and integrity analysis; badge, company logo, nameplate and uniform detection; checking if both the staff and the customer left the camera view during recording; face image quality assessment in video sequence; and signature action detection.

III. PROPOSED APPROACH FOR VISUAL ANALYSIS BASED COMPLIANCE AUDIT
In this section, we describe the detailed method of each sub-module in our proposed approach, which includes video tampering and integrity detection, object detection and object association, person-departure detection, face image quality assessment and signature action positioning.

A. VIDEO TAMPERING AND INTEGRITY DETECTION 1) VISUAL FEATURE
To detect if the video is integrated and not tampered by others, we use the Structural Similarity Index (SSIM) [27] which is a perceptual metric that quantifies the image quality degradation to detect the modification. SSIM measures the perceptual difference between two similar images, so it is sensitive to image changing in short time. Here in our system, the Mean Structural Similarity Index (MSSIM) [18] is used for measuring the difference between the two images, which is often more effective to capture the local changes of the image.
SSIM compares the similarity between two images x and y in three dimensions: luminance l(x, y), contrast c(x, y), and structure s(x, y). Finally, the similarity between x and y is a function of these three terms: The brightness similarity of the two images is calculated by the following formula: where x i is the mean intensity of the image and C 1 is a constant value to avoid division by zero. The contrast similarity of the two images is calculated by the following formula: where And the formula of structural similarity is as follows: In general, SSIM cannot be directly applied to evaluate the similarity of the whole image. One can divide the image into blocks, and then the SSIM of each block is calculated, which are finally averaged to generate the image-level score:

2) DEEP SEMANTIC FEATURE
The MSSIM based methods work fine in most of the common cases, but in some situations where violent lens shaking or strong light noise appear in the scene, which may cause false alarms. Therefore, our approach adopts the recent advent of deep residual neural networks [28] which is often pre-trained on ILSVRC2012 [29] to extract the deep feature of each video frame. Such features are known more robust against the drift across different domains.
Considering the tradeoff of computation and accuracy, in our experiment we select ResNet-50 as the feature extractor and use the last layer's features that are also called avg-pool whose dimension is 2048 and receptive field covers the whole image. Such deep features have rich semantic information and can well make up for the shortcomings of traditional methods by calculating the cosine similarity of two frames for measuring the difference in semantic-level. The default threshold is set to 0.95, and it means that the image pair will be identified as adjacent frame if the similarity score of two images is lower than 0.95.
The formula of cosine similarity: Illustration of deep semantic feature extractor using ResNet-50. The input is two frames sampled from the video sequence and the output is the similarity score of these two images which range is 0-1.0, the higher score the more similar.

B. OBJECT LOCALIZATION AND ASSOCIATION
In this scenario, the company staff should wear uniform and name tag; the company logo of the securities company should appear in the recording scene; the face, person, uniform, nameplate and company logo are also the key objects here.

1) OBJECT DETECTION
For visual object detection, YOLOv3 [11] is used in this paper. In fact, YOLO-based convolution neural network family of models for object detection is one-staged method which is extremely fast and accurate and absolutely a good choice when for real-time detection. YOLOv3 applied FPN-like structure combined with backbone (Feature Pyramid Networks) network called darknet-53 which is good at detecting small object in the wild. FPN exploits the inherent multi-scale structure of a deep convolution network to construct a feature pyramid that has rich semantics at all levels and facilitates the detection of objects at different scales.
For face detection, we use the light-weighted MTCNN [14] which detects the human faces and facial landmarks in the wild efficiently. MTCNN is a 3-staged detector. The first stage of MTCNN, i.e. P-Net, applies the same detector on multi-scales (pyramid sampling) of the input image. As a result, it could generalize pretty well to the target objects (faces) at various sizes and it could also detect rather small objects well. The second stage of the network model is called refine network R-Net, which largely filters the candidate windows of non-face regions, and at the meantime, refine the face bounding boxes generated by P-Net. In the third stage, The Output Network (O-Net) produces final bounding box and facial landmarks position. This method of face detection has an advantage on various light condition, face poses variations and visual variations of the face.

2) HUMAN TRACKING
Generally, in order to facilitate the continuous observation and analysis of the target of interest, we used Siamese-RPN [13] to continuously predict the position of the object in the video sequence. Siamese-RPN formulate tracking as convolutional feature cross-correlation between target template and search-ing region by introducing a region proposal network after a siamese network, which is end-to-end trained online with large-scale image pairs. It can give both tracking bounding boxes and the corresponding confidence scores of the object which help us determine if the target is out of the screen.
In order to find a suitable threshold to determine whether the target has disappeared, we have tested the tracker on 20 video sequences of our target scenarios and then calculate the average confidence value of the tracking target in the process of its departure as the threshold. Figure 3(a) illustrates an example video sequence which shows the score changing when people are walking out of the screen. Figure 3(b) illustrates the statistics of the changing of the confidence score when the object is leaving from the view. We marked the first frame as the beginning and the last frame as the ending, then average the confidence of these frames to get the threshold of this video. The mean threshold of all 20 target videos is about 0.73. Therefore, when the confidence of the target is less than 0.73 in 5 consecutive frames, we consider the target to leave the view. With this strategy, there is no failure case in our test set which includes 10 positive samples and 10 negative samples.

3) VISUAL OBJECT OF INTEREST ASSOCIATING
In order to implement the association function, we applied an IOU based matching mechanism by calculating the relative position of two objects. Since all the objects / people / face are detected separately, we should build up the relationship of them before doing the compliance audit (eg. match the uniform and name tag to the corresponding staff). Pipeline of standard checking for dressing and scene arrangement are as follows: • Identify the participant and assign the ID. • Detect target object (uniform, name tag and company logo), then mark the object position frame by frame.
• Associate the uniform and name tag to the corresponding staff. We need to know the relationships between each body, face, uniform and name tag. In this paper, we can associate them by calculating the IOU of two objects, but in this use case, the traditional IOU is not suitable for measuring the overlap if one object is much larger than another. As we can see in Figure 4(a), S 1 is the small object, while S 2 is the big one. S ∩ is the intersection area, but we modify the formulate here as IOU = S ∩ /S 1 . Establish the association for all the objects/person in the screen (There are two bodies, two faces, two uniforms, and one tag were detected in the video frame. IOU of body1 and face1 is 1.0, then we bind them as id1. IOU of tag and body1 is 1.0 and that of uniform1 and body1 is 0.992, we determine that tag and uniform1 belongs to id1. IOU of body2 and face2 is 1.0, then we bind them as id2. IOU of uniform2 and body2 is 0.877, we determine that uniform2 belongs to id2).
In the three cases of matching, a certain face and body are considered to attach to the same person only if the IOU is over 0.9; a certain uniform is thought to belong to a certain body only if the IOU is over 0.8; if a person wears a work badge / logo, IOU of the object and the body should be 1.

C. PERSON-DEPARTURE DETECTION
The Measures requires that neither the staff nor the customer leave the camera screen during video recording, and we formulate this problem as a motion detection task that can be solved by detection and tracking technology. In the previous section, we have introduced YOLOv3 and Siamese-RPN to solve the detection and tracking problem separately. According to the Measures, here we have designed the algorithm pipeline of person-departure detection as Figure 5. The detector and tracker are not guaranteed to be 100% correct, and there may be false positives, so we are going to count the prediction result of five consecutive frames and then give the final decision if all of these frames have the same state. This strategy ensures the final prediction of this module has almost no error.

D. FACE IMAGE QUALITY ASSESSMENT
To exactly identify the identity of the participants in the whole process, the Measures requires the face of the staff and the customer should be clear enough in the video. In this section, we propose two metrics to measure the quality of the face image:

1) FACE SEMANTIC FEATURE
We know the MTCNN network contains two heads, boundary box regression and classification, in which the classify-cation branch gives the confidence of the foreground (face). The larger the confidence value is, the more obvious the face pattern is. Based on this experience, we directly use this output of the network as one metric to measure the quality of the face in the video sequence.

2) VISUAL FEATURE
Visual features here mainly include face size, posture and face image blurriness. To make a quantitative analysis of the evaluation results, a comprehensive scoring rule was established in this paper (see TABLE 1(a)). In this paper, the quality score was set from 0 to 100. The higher the score is, the better the quality is.
Facial size is one of the important factors which directly affects the quality of a face. We compress the face image at different resolution (bilinear interpolation), from 60 to 120, then use the face recognition model to calculate the similarity of the template face and the compressed face. The performance degrades a lot when the size is smaller than 60*60. Blur Detection works by using Laplace operator to perform sliding window operation on the detected face area and calculate the variance which can reasonably assess how blurry a face is, the Laplace operator is as follows: According to our experiment, we summarize the scoring scheme of facial size, blurriness, face pose and face semantic feature. For the size, blurriness, pose and confidence score of MTCNN, we set up an almost linear mapping from the original metric to the final score. comprehensive scoring rule was established in this paper (TABLE 1(b)-1(e) lists the details of how to map each feature to face quality score).
Concretely, as MTCNN has detected five facial landmarks, as illustrated in Figure 6, in this part we directly use relative position of these landmarks for facial pose estimation. Pitch, yaw and roll are the three dimensions of movement when an object moves through a medium which used for measuring the facial pose here. The formula is expressed as follows: where h 1 = y 3 − y 1 +y 2 2 , h 2 = y 3 − y 4 +y 5 2 . where

E. SIGNATURE ACTION POSITIONING
Action recognition task involves the identification of different actions from video clips (a sequence of 2D frames) where the action may or may not be performed throughout the entire duration of the video. In this section, we use 3D CNN [19] model for action recognition. This model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. In our practice, we made below improvements for the handwriting action prediction: • In this application scenario, most of the videos contains two persons in a scene at the same time, which means it is not rigorous to recognize the action through the whole frames, as well as the complex background. We firstly crop a ROI region by the bounding box provided by the body detector and tracker, then feed the cropped ROI sequence to the network.
• We sample every 8 frames and take 16 consecutive frames each time as the network input. Once the writing action is detected for three consecutive times, we mark it as the starting point of the signing process.
• Then if the signature action is not recognized in three consecutive times, we regard it as the ending point.

IV. EXPERIMENTS AND CASE STUDY
In this section, we perform a collection of experiment on video tampering/integrity detection, object detection, object tracking and action recognition tasks, because this paper involves a commercial project and it is required by the client that the overall performance information shall not be disclosed. Therefore here we only report the component-wise performance numbers to obey the commercial constraints. As we know, train a model from scratch requires a large dataset and takes a lot of time for converging, all network backbones used in this paper were pre-trained on the ImageNet dataset and then applied to the specific tasks.  The pre-trained AlexNet, 1 ResNet-50, 2 Darknet-53 3 and 3D ResNeXt-101 4 [30] models are all publicly available.

A. VIDEO TAMPERING AND INTEGRITY DETECTION
In the indoor scenario, the lighting is an important factor who impacts the performance of the MSSIM method very much, but less to the deep semantic feature. We have designed three experiments to verify the effectiveness of our strategy.

1) MSSIM UNDER CONSTANT ILLUMINATION
When the illumination is constant, we built three datasets, adjacent frame, interval frame and irrelevant frame which respectively come from two adjacent frames, two non-adjacent frames in the same video, and two irrelevant 1 https://github.com/songdejia/Siamese-RPN-pytorch 2 https://github.com/KaimingHe/deep-residual-networks 3 https://pjreddie.com/media/files/yolov3.weights 4 https://github.com/kenshohara/3D-ResNets-PyTorch frames in different videos, and each set has 500 pairs. As you can see in TABLE 2, the higher value means the frame pairs are more similar. The parameter of MSSIM is set as we use a sliding window to divide the image into blocks. The window size set to 11. Considering the influence of the window shape on the block, Gaussian weights is used to calculate the mean, variance, and covariance of each wind-ow, and the standard deviation for Gaussian kernel set to 1.5. Meanwhile, we attack a normal video by frame insertion and frame deleting to test if there are obvious patterns when these attacks occur. The comparison results are shown in Figure 9. We can observe that the changing of MSSIM value of the normal video is very smooth in temporal, always lower than 0.02. In other words, the difference of the adjacent frame is usually less than 0.1, so we set the threshold of the MSSIM as 0.1. When inserting the attack video segment into the raw video, we can see there are two obvious peaks appeared in the Figure 9(b), so the attack frames can be easily located by analyzing the MSSIM value of the peaks (much larger than 0.1). Figure 9(c) shows the score changing when a short segment is deleted; there is obvious changes in values, and the peak value is around 0.2 which is also much higher than the threshold.

2) MSSIM UNDER NON-CONSTANT ILLUMINATION
While the light is changing, luminance compensation is applied for suppressing the interference caused by light changes. In the previous experiments, we only test the method with constant light condition, but in the realistic scenario, there are many reasons which cause the light condition change or drastic change, so only using MSSIM will case false positive detection (light change will cause the value of MSSIM change a lot). Therefore, in this paper, we consider using a cascade detector with luminance compensation and semantic feature matching for accurate discrimination of false positive.
The structure of the cascade detector is shown as Figure 10. When the MSSIM value suddenly changes a lot, it will be considered as a potential tampering frame. Then we will compensate luminance for the neighbor frame of candidate frame (plus/minus the difference between the average value of these two frames) and calculate the MSSIM value again to do future decision.
In the processing of brightness compensation, in order to restore the intuitive visual information and avoid unnecessary interference, we perform a gamma correction on the darker frame. We dynamically select the value that makes the average brightness of the two frames closest to each other as the adjustment parameter (picked from 0.1, 0.2, . . . ,0.9). VOLUME 8, 2020

3) DEEP SEMANTIC FEATURE
As we mentioned before that the MSSIM based method may cause false alarms in some situations. The deep semantic feature extracted by ResNet-50 network, illustrated in Figure 2, is more focused on the high-level semantic information and insensitive to image noise. From Figure 11, we know it is beneficial to the traditional MSSIM based method. There are totally four peaks in the video sequence: p1 and p2 are brought by the luminance changing (false alarms), while p3 and p4 are brought by frame insertion. We know from Figure 11(a) that p1 and p2 were successfully suppressed by luminance compensation but the value is larger than the default MSSIM threshold (0.1), only using MSSIM criteria we cannot handle this situation. The bottom figure of Figure  11(a) shows the corresponding score from semantic feature similarity. We can easily identify the p1 and p2 are false alarms whose scores are larger than the default threshold of deep semantic feature, and also the attack frames p3 and p4. In Figure 11(b), all the peaks were brought by intense light changing. Although the luminance compensation module almost failed here, the deep semantic feature can successfully identify they are normal frames.

B. OBJECT DETECTION
In this part, we introduce the implementation details of the deep learning model training, include the datasets, the training hyper parameter selection and the corresponding evaluation results. The networks are trained using stochastic gradient descent (SGD) on GTX 2080Ti GPU.

1) CONSTRUCTING COMPLIANCE AUDIT OBJECT DATASET
As mentioned in previous section, we built the compliance audit object dataset by crawling images using image search engines, we manually defined the relevant key words for this scenario and then applied the crawling process. Since there are many noisy images in the raw collection, we spent much efforts to clean the data, the primary objective was to develop a robust dataset for optimizing the deep-learning based detection model. Our dataset features four categories representing generic types of the compliance audit target object, which we labelled for a total of 10.5K images and about 20.1K object bounding boxes, including person, uniform, name tag and company logo. The training set includes 8432 images, while the validation set includes 2108 images (The split of the original dataset is 80% training, 20% validation). The distribution of the dataset and corresponding performance of each category in this paper is listed at TABLE 3.

2) TRAINING THE MODEL
Data augmentation in convolutional neural network can robustly classify/detect objects even if they are placed in different orientations as it has a property called invariance. More specifically, a CNN can be invariant to translation, color jittering, viewpoint, size or lighting noise (or a combination of the above). In our paper three types of data augmentation methods were used: affine transformation, gamma correction, and noise injection which make the model more robust to illumination changing, camera jitter and small motion in the scene. We also perform mean subtraction for the input of all models.
All the input image is resized to 608 × 608. We use the darknet-53 backbone pre-trained from ImageNet and fine-tune on our dataset. To determine the priors, YOLOv3 applies k-means cluster to generate the anchors.

C. HUMAN TRACKING
In this paper, the experiment is performed on VOT-2015 challenging tracking dataset [31]. The VOT2015 dataset consists of 60 sequences. The performance is evaluated in terms of accuracy (average overlap while tracking success-fully) and robustness (failure times). The overall performance is evaluated using Expected Average Overlap (EAO) which takes account of both accuracy and robustness. Besides, the speed is evaluated with a normalized speed (EFO). We didn't report the EFO in our experiment, because this metric is very much related to the configuration of the hardware.
We use a modified AlexNet pre-trained from ImageNet with the parameters of the first three convolution layers fixed and only fine-tune the last two convolution layers in Siamese-RPN. The weight decay is 0.0001 and the momentum is 0.9. There are totally 80 epochs performed and the learning rate is decreased in log space from to. The final performance is a little better than the original paper (see TABLE 4), and this could be due to data augmentation and more training time.

D. SIGNATURE ACTION POSITIONING
In this experiment, Kinetic-400 [32] was used for model training. Kinetic-400 contains 400 categories and the video segment is about 10 seconds. Each category contains about 400 to 1,150 videos where the writing category includes 735 segments. We spatially resize the sample at 112 × 112. The size of each sample is 3 × 16×112 × 112 (channels, frames, width, height), and each sample is horizontally flipped with 50% probability. All generated samples retain the same class labels as their original videos. Since we didn't train the networks from scratch, the training parameters include a weight decay of 1e-5 and 0.9 for momentum. The initial learning rate is 0.001 and divide it by 10 after the validation loss saturates.
In this experiment, we used mobile phone to capture 50 video sequences, which totally include 50 writing clips, the length is about 10 seconds and the FPS is 25. The signature process of each video is about 3 seconds. At the inference stage, we sampled every 8 frames and take 16 consecutive frames each time as the network input.
As illustrated in Figure 8, we used the action recognition model to locate the 'begin' and 'end' of the writing action, here if the model's prediction is 'writing' for 3 times in a row, we believe it's the 'begin' timestamp, as well if the prediction is 'non-writing' for 3 times in a row, we think it's the 'end' timestamp. Finally the IOU of the ground-truth and the prediction time window of writing action is calculated, if the IOU is larger than 0.5, this should be a good prediction. The average precision is 74% at IOU=0.5.

V. CONCLUSION AND FUTURE WORK
The surge of deep learning over the last decades is impressive due to the strides it has enabled in the field of computer vision. In this work, we have introduced a deep learning-based video content analysis framework in compliance audit of stock brokerage to reduce the workload for the auditors and also speed up the audit process. The proposed framework integrates several state-of-the-art deep learning models in order to make the system stable and efficient. Although the system has not been fully deployed to large-scale business scenario, the prototype so far has proven its potential in real-world commercial applications.
For future work, to deal with the fundamental challenge for relatively small size training dataset problem and the varying drift of the objects for analysis in real-world visual audit setting, we plan to resort to image matching and registration [33], [34], as well as structure matching based approaches [34]- [36]. MINGYU WU received the bachelor's degree in electronics engineering from Sun Yat-sen University, Guangzhou, China, in 2018. He is currently pursuing the master's degree in electrical engineering with Shanghai Jiao Tong University, Shanghai, China. His current research interests include computer vision and machine learning, especially deep learning for image classification and object detection.
MINZE TAO received the master's degree in software engineering from Nanjing University. He is currently the Director of artificial intelligence with E-Capital Transfer Company Ltd. His research spans many aspects of programming language implementation, software development, high-performance computing, natural language processing, computer vision, and machine learning. In particular, his current research focuses on natural language processing, computer vision, and deep learning. He has recently been working on an intelligence robot application development and played a central role in the development of a conversational AI framework for building contextual assistants.
QIN WANG received the bachelor's degree from Hunan University, in 2004. He is currently the Assistant General Manager and the Technology Director of the Business Unit of Intelligent Technology, E-Capital Transfer Company Ltd. His research interests include software engineering and machine learning, especially with applications in computer vision and natural language processing. He once worked as the key developer in both finance and the Internet industry.
LUYE HE received the bachelor's degree in mathematics from Zhejiang University, in 2002. He is currently the General Manager of the Business Unit of Intelligent Technology, E-Capital Transfer Company Ltd. His research interests include software engineering and artificial intelligence, especially with applications in speech recognition and natural language processing. He once worked as the Vice General Manager of the Department of Information Technology, Shanghai Rural Commercial Bank (SRCB).
GUOLIANG SHEN is the Director of the Department of Technology and a Senior Engineer of automatic control with Zhejiang Zheneng Natural Gas Operation Company Ltd. He mainly engaged in the research and application of automation-related advanced algorithms, intelligent control, and model identification.
KAI CHEN (Member, IEEE) received the bachelor's degree from the Huazhong University of Science and Technology, the master's degree from Xi'an Jiao Tong University, and the Ph.D. degree from Shanghai Jiao Tong University. He is currently an Assistant Professor with the Department of Electrical Engineering, Shanghai Jiao Tong University. His research interests include data mining and computer vision, especially optical character recognition and its applications in the industry.