Foldover Features for Dynamic Object Behaviour Description in Microscopic Videos

A behavior description helps analyze tiny objects, similar objects, objects with weak visual information, and objects with similar visual information. It plays a fundamental role in the identification and classification of dynamic objects in microscopic videos. To this end, we propose foldover features to describe the behavior of dynamic objects. Foldover is defined as: Each frame of an object’s motion is superimposed on the same spatial plane in the spacetime order of the motion, the result of the superposition is the foldover of the object’s motion. Foldover of an object contains temporal information, spatial information, behavior features and static features. Therefore, the features extracted based on the foldover of the object are the foldover features. In this work, we first generate foldover for each object in microscopic videos in X, Y and Z directions, respectively. Then, we extract foldover features from the X, Y and Z directions with statistical methods, respectively. The core content of this paper is to construct the foldovers and extract the foldover features. Through these two steps, the temporal information, spatial information, behavior features and static features of the object are enhanced and included in the foldover features. Furthermore, the description of the behavior of dynamic objects by the foldover features is strengthened. Finally, we use four different classifiers to test the effectiveness of the proposed foldover features. In the experiment, we use a microscopic sperm video dataset to evaluate the proposed foldover features, including three types of 1374 sperms, and obtain the highest classification accuracy of 96.5%.


I. INTRODUCTION
In computer vision, a video is made up of many frames and video analysis is basically image analysis [1]. In addition, we tend to focus on a specific target object or class of video rather than the whole video. Therefore, image feature extraction is very important for video analysis [2]. Currently, static features [2] and dynamic features [3] are mainly used to identify or classify different objects in images as shown in TABLE 1.
From TABLE 1 we can see that when facing the following three conditions, it is easy to describe the objects with existing The associate editor coordinating the review of this manuscript and approving it for publication was Jeon Gwanggil . static and dynamic features (similarity corresponds to distinction, if the similarity of two objects is high, the distinction between them is low; similarly, if two objects are very similar, the distinction between them must be high): (1) the distinction between static features and dynamic features is both high. (2) there are obvious differences in static features and little differences in dynamic features. (3) the difference between static features is little, while the difference between dynamic features is large. However, objects in microscopic videos are hard to identify or classify in the following two cases: (1) two objects with very similar static and dynamic features.
(2) two objects with very weak static features and very similar dynamic features. To this end, we propose new foldover features to describe the behavior of objects in microscopic videos.
In microscopic videos, the following difficulties usually exist in identifying or classifying different individuals of the same class of tiny objects. Firstly, because most of the tiny objects are colorless or transparent, they have little color or texture information. Secondly, when tiny objects have similar morphological characteristics, it is difficult to distinguish them by shape features. Thirdly, if the size of the objects are only several pixels, it is tough to obtain available information. Fourthly, if two objects have both similar static and dynamic features, it is hard to identify or classify them. Hence, we select the microscopic sperm videos as the experimental material, where sperms have little color information, weak shape information, tiny sizes, and similar static and dynamic features.
Foldover is defined as: Each frame of an object's motion is superimposed on the same spatial plane in the space-time order of the motion, the result of the superposition is the foldover of the object's motion. Foldover of an object contains temporal information, spatial information, behavior features and static features. Foldover features are a kind of behavior feature that is based on dynamic targets. The workflow diagram of the proposed algorithm is presented in FIGURE 1.
There are five steps in FIGURE 1: (1) Video preprocessing, the purpose of which is to obtain the motion foldover of each tiny object in the video. (2) Foldover detection, which is used to detect the foldover of each object. (3) Foldover features extraction, extracting the foldover features from the X, Y, Z, three directions. (4) Foldover features optimization, the Convolutional Neural Network [4] (CNN) removes the redundant information in the foldover features and further enhances the foldover features. (5) Classifier design, four different classifiers are used to verify the superiority of foldover features.
The core content of this paper is to construct the foldovers and extract the foldover features. The contributions of the foldover features are as follows: (1) The foldover features provide a feature extraction method for the behavior classification of tiny objects. (2) The foldover features provide a method to extract feature information for objects with little feature information. (3) In the behavior classification of similar objects, the application of foldover features makes the classification to obtain good results.

II. RELATED WORK
This section summarizes the existing works that is related to our study. II-A summarizes the static feature extraction methods, including various classical feature extraction methods and deep learning feature extraction methods. II-B summarizes the dynamic feature extraction methods, including several common dynamic feature extraction methods and deep learning feature extraction methods. II-C summarizes the technologies of target detection and feature engineering, including several common feature engineering and target detection methods. II-D summarizes the classifier design, including some well-known algorithms.

A. STATIC FEATURES
Static features usually include color, texture and shape features. Color features describe the surface properties of the scene corresponding to an image region based on pixel information [5]. However, when images have little color information (microscopic sperm videos), the color features are almost identical. For example, in the article [6], a brightness histogram is used to retrieve images with good results. However, when multiple objects have similar color brightness, this method is secure to lose effectiveness.
Texture features reflect the properties of surface structure organization and arrangement with slow or periodic change [7]. For example, the proposal of the Histogram of Oriented Gradient (HOG) feature [8], the advantage of HOG is that the geometric deformation and optical deformation of images have little influence on HOG. However, HOG is difficult to deal with the occlusion. For example, in the microscopic sperm videos, when two sperms collide and overlap, the extraction result of HOG feature will have errors. Another example, the application of Gray-Level Cooccurrence Matrix (GLCM) [9]. GLCM is used to calculate uniformity and strength values to identify candidate areas of Ground Glass Opacity (GGO) nodules. However, GLCM cannot identify two very similar objects by describing the gray relationship between a certain pixel and a pixel within a certain distance.
There are many effective shape features, such as geometric features, Hu moments [10], shape signature [11] and Scale-invariant Feature Transform (SIFT) features [12]. The geometric features mainly include perimeter, area, long axis, short axis, length-width ratio and complexity, which can be used for motion analysis. However, in the analysis and recognition of similar targets (such as microscopic sperm videos), it is not effective to use geometric features. Hu moments are higher-order geometric features used to reflect the distribution of random variables in statistics. Translation, scale expansion, rotation these changes will not affect the invariant moment. It has good invariance. However, Hu moments depend on image segmentation a lot, and their application fields are limited. SIFT feature [12] is a local feature of images, which is invariant to rotation, scale scaling and brightness change, and has excellent stability to angle change, affine transformation and noise influence. However, the detection of critical points is an essential step in SIFT feature extraction but the features extracted from tiny targets are limited. Shape signature is a boundary -based shape descriptor formed by a set of one-dimensional signals called shape signatures [11], which is robust to environmental conditions (partial occlusion) and image transformation (scaling, rotation, translation). But, the point of shape signature is to identify objects based on their shape, which is not effective at recognizing object (such as sperm) with similar shape.
With the development of deep learning technology, we can adopt different neural network frameworks to extract the target objects' in-depth features. Convolutional Neural Network [4] (CNN) is an efficient identification method because it avoids the complicated pre-processing steps and can directly input the original images. VGG16 [13] network is a classical CNN explores the relationship between the depth of the convolutional neural network and its performance. The error rate is significantly reduced. Therefore, we can use VGG16 network to directly extract the in-depth features of static targets. Deep learning features can be used for further data statistics at the pixel level. However, when objects are tiny (such as sperms in microscopic videos), the feature extraction ability of CNN is minimal.

B. DYNAMIC FEATURES
With the development of pattern recognition and intelligent video processing technology, there is much research on dynamic target analysis. The dynamic texture is an extension of static texture in the time domain, which includes both static and dynamic information [14]. For example, in the Motion Energy Model [15], a video sequence is regarded as the direction in the three-dimensional space-time, and a directionally selective filter is used to extract the motion information on each position. In [16], based on the expansion of separable guided filtering theory, a 3D filter is decomposed into three independent one-dimensional filters, which are filtering along the horizontal, vertical and time directions, and the filtering efficiency is significantly improved. Gaussian Mixture Model (GMM) [17] is widely used to model the background of complex dynamic scenes, especially on the occasions of periodic movement, such as shaking branches, turbulent water, snowstorms and fountains. GMM can steadily and quickly detect suspected motion prospects. Mixtures of Dynamic Textures (MDT) [18] is used for video frame sequence modeling. MDT can use dynamic textures to generate a series of video sequences into specific samples, which has excellent performance in motion clustering and segmentation. The above four examples are target motion analysis in video based on dynamic texture. However, the basis of dynamic texture is static texture, which is an extension of static texture in the time domain. In microscopic video analysis, we encounter the following difficulties: (1) Multi-objective analysis, there are many objects of our analysis in each frame. (2) All the objects are very tiny, and there is no significant difference in the appearance of different objects (such as sperms). (3) Little texture information of tiny objects. (4) Interference of impurities, some impurities are similar to our analysis objects in appearance. The above difficulties cannot be solved by dynamic texture features.
The acquisition of motion parameters is also useful for object motion analysis. For example, a series of motion parameters of each sperm are continuously collected to analyse sperm motion [19] and achieve good results. However, in the case that there are many sperm targets in the camera lens, different sperm targets have similar motion patterns and little difference in motion parameters. Therefore, it is not enough to rely on motion parameters alone.
In recent years, deep learning method has been successfully applied in object tracking field, and gradually surpasses the traditional method in performance. A typical strategy is that first obtaining the feature representation of a target by using CNNs, then the CNNs are trained on a large-scale classification database like ImageNet [20], and the trained CNNs are finally used to classify and track the objects. This approach not only avoids the problem of insufficient samples of large-scaleCNN, but also makes full use of the strong representation ability of deep learning features.
FCNT mainly analyses the conv4-3 and conv5-3 output feature maps of VGG-16 [21]. FCNT constructes a feature screening network and two complementary heat-map prediction networks based on the analysis of features of different CNN layers. FCNT makes the targets more robust during deformation. The work of [22] uses the output of conv3-4, conv4-4 and conv5-4 in a pre-trained VGG-19 [13] as the feature extraction layer. The Features extracted from these three layers are respectively studied through relevant filters to obtain different templates, and then the obtained three results are fused to obtain the final target position. However, the above method is not applicable to the identification and analysis of multi-target motion in microscopic sperm videos. The difficulties in using deep learning in the field of target tracking and recognition are appearance deformation, light change, fast movement, motion blur, interference from similar objects, scale change, occlusion and target movement out of the field of view. These difficulties are also the problems that we encounter in the microscopic sperm videos. In addition, the five difficulties proposed in this paper in the section on dynamic texture are still not well solved by using the above methods. These five difficulties are also the key problems to be solved in this paper.

C. FEATURE ENGINEERING AND TARGET DETECTION
In the recognition and analysis of dynamic objects, image processing and object detection are very important, because the accuracies of image processing and object detection affect the results of recognition and analysis. Specifically, image segmentation and feature extraction are two important steps in image processing.
Image segmentation is critical to the effectiveness of feature extraction. Mask-Refined R-CNN (MR R-CNN) [23] adjusts the stride of ROIAlign (region of interest align), and the feature fusion is realized by replacing the full convolutional layer with a new semantic segmentation layer. Combining with the feature layer of global and detail information, the segmentation accuracy is greatly improved. Article [24] presents an automated data augmentation method for synthesizing labeled medical images, learning a model of transformations from the images, and using the model along with the labeled example to synthesize additional labeled examples. Each transformation is comprised of a spatial deformation field and an intensity change, enabling the synthesis of complex effects.
Noise removal and contrast enhancement constitute important topics in image processing, which can improve the accuracy of image segmentation. Article [25] proposes a noise-level estimation method, whereby the noise level is estimated by computing the standard deviation and variance in a local block. The obtained noise level is then used as an input parameter for the block-matching and 3D filtering (BM3D) algorithm, and the denoising process is then performed, the method converts low contrast data into high contrast data and reduces high noise level. Article [26] remove both impulse and Gaussian noise, and enhance contrast. To enhance image contrast, low contrast pixels become even lower, and high contrast pixels become even higher.
Correspondingly, the quality of feature extraction is based on the result of image segmentation, for example, two-dimensional discrete cosine transform (2DDCT) is used to extract the features of left and right palmprints to constitute a double-source space [27]. More discriminant coefficients can be preserved and retrieved with discrimination power analysis (DPA) from dual-source space, the accuracy performance is improved. Another example, PalmHash Code and PalmPhasor Code, as two cancelable palmprint coding schemes, are proposed to balance the conflict between security and verification performance [28].
In addition, we can manipulate the features to improve the quality of the features, such as, select, weight and combine. In [29], dynamic weighted discrimination power analysis (DWDPA) enhances the discrimination power (DP) of the selected discrete cosine transform coefficients (DCTCs) without premasking window, in other words, it does not need to optimize the shape and size of premasking window. Dynamic weighting gives larger weights to the DCTCs with larger discrimination power values (DPVs) which optimizes and enhances the recognition performance. Conjugate 2DPalmHash Code (CTDPHC) [30] is constructed by 2DPalmHash Codes (2DPHCs) of palmprint and palmvein, it is proposed as a cancelable multi-modal biometric. CTDPHC enjoys higher verification accuracy and stronger anti-counterfeit ability, while trades neither computational complexity nor storage cost.
Object tracking is a component of dynamic object detection, because the accuracy of object tracking affects the result of dynamic object recognition and analysis. The difficulty of object tracking is background interference and object occlusion. In the case of occlusion and scale variation, article [31] proposes a scale adaptive target tracking method with good performance. This article proposes an update strategy based on occlusion detection, which provides an effective method for object detection with occlusion. A double-channel object tracking (DCOT) is proposed in [32]. The discriminative correlation filter (DCF), which has strong discriminative power of low-level features, is employed for the position deviation suppress of the samples generated from MDNet. This method guarantees the accuracy of tracked positions effectively.
In target recognition, saliency detection has important application value, which can bring a series of significant help and improvement to visual information processing. Article [33] proposes a new salient property of part-object relationships provided by the Capsule Network (CapsNet) for salient object detection, and presents a deep Two-Stream Part-Object Assignment Network (TSPOANet). The proposed model requires less computation budgets while obtaining better wholeness and uniformity of the segmented salient object. The proposal of the Deep Conditional Random Field network (DCRF) [34] takes into account both the depth features and the neighbor information. DCRF is a good combination of low-level internal context and high-level semantic information, keeping object boundaries clear and suppressing background noise. Another example, article [35] proposes a novel end-to-end network for multi-modal salient object detection, which turns the challenge of RGB-T saliency detection to a CNN feature fusion problem. Under challenging conditions, such as poor illumination, complex background and low contrast, The network performs the saliency detection task well. Article [36] proposes an approach that considers the internal color and saliency properties of the image. It changes the saliency map via an optimization framework that relies on patch-based manipulation using only patches from within the same image to maintain its appearance characteristics. This method has significant results in both the saliency manipulation and the realistic appearance of the resulting images. Article [37] proposes a framework to learn deep salient object detectors without requiring any human annotation. It is a good solution to the problem that it is expensive and time-consuming to provide pixel-level groundtruth masks for each training image. Another example, article [38] proposes a two-stage mechanism for robust unsupervised object saliency prediction, it refines the pseudo-labels from different unsupervised handcraft saliency methods in isolation, and improves the supervisory signal for training the saliency detection network. The two-stage mechanism is crucial to improve the quality of pseudo-labels and hence achieve competitive performance on the object saliency detection tasks.

D. CLASSIFIER DESIGN
There are several applications for Machine Learning (ML), the most significant of which is data mining. People are often prone to making mistakes during analyses or, possibly, when trying to establish relationships between multiple features. This makes it difficult for them to find solutions to certain problems. Machine learning can often be successfully applied to these problems, improving the efficiency of systems and the designs of machines [39].
A kind of well-known algorithms are based on the notion of perceptron, such as multilayered perceptrons (Artificial Neural Networks) [39]. The advantages of Artificial Neural Networks (ANNs) are: Strong parallel distributed processing ability, strong distributed storage and learning ability, strong robustness and fault tolerance to noise nerves [40]. Another well-known algorithms are based on the ensemble learning, such as Random Forests (RFs) [41]. The advantages of random forests are: It has a strong ability to process high-dimensional data, the generalization ability of the model is strong, it is fast to train the model, and the model can handle unbalanced data [42]. Another well-known algorithms are based on the Support Vector Machines (SVM) [39]. The advantages of SVM are: It can solve machine learning problems in small samples, improve generalization performance, solve nonlinear problems, and the problem of neural network structure selection and local minima can be avoided [43].

III. FOLDOVER FEATURES
In this section, we introduce the proposed foldover feature extraction method, referring to III-A foldover construction, III-B foldover feature extraction.
For the convenience of narration, the variables are used in this paper as follows: (1) We define a data set of videos as χ= {X 1 , X 2 , . . . , X i , . . . , X n }, i = 1, 2, 3, . . . , n, where X i is the video variable, i is the video number, and n is the total number of videos in χ. Furthermore, is a set of frames (static images), where x (i,j) is the frame variable, j is the frame number, m is the total number of frames in X i . In addition, the image pixel, k is the pixel number, h is the total number of pixels in a frame, h = h 1 × h 2 , h 1 = 1, 2, 3, . . . is the number of pixels in a row, and h 2 = 1, 2, 3, . . . is the number of pixels in a column. (2) We define the intensity (pixel value) at pixel x (i,j,k) as p x (i,j,k) ∈ [0, 255].
(3) We define a set of sperms in each frame as where s (i,j,l) is one sperm, l is the sperm number, and q is the total number of sperms in this frame.

A. CONSTRUCTION OF FOLDOVERS
There are many sperms in a semen microscopic video, we construct a foldover for each sperm by the following six steps. The work flow of the constrction of foldover is shown in FIGURE 2. As the work flow is shown in FIGURE 2, Each frame of an object's motion is superimposed on the same spatial plane in the space-time order of the motion. The result of the superposition is the foldover of the object's motion. Besides, we can extract the temporal information, spatial information, behavioral features and static features of the object from the foldover.

1) VIDEO DECOMPOSITION
We decompose a semen microscopic video X i into frames is a static gray-scale image, and an example is shown in FIGURE 3.

2) IMAGE SEGMENTATION
We define the threshold value of the image x (i,j) as T x (i,j) , the segmentation result of x (i,j) as x seg (i,j) , and the value of the In Eq. (1), When the pixel value p x (i,j,k) is lower than the threshold T x (i,j) , the result of threshold segmentation p x seg (i,j,k) is 0 (black); otherwise, the result of threshold  segmentation p x seg (i,j,k) is 1 (white). Finally, we get the image segmentation result x seg (i,j) , and all the sperms ς (i,j) in each frame x (i,j) are obtained. An example of the threshold segmentation result is shown in FIGURE 4.

3) BARYCENTER COORDINATES EXTRACTION
Based on the image segmentation results x s (i,j) , we define a barycenter coordinates set of all sperms for total frames in the video X i as is the barycenter coordinate variable, i is the video number, j is the frame number, and m is the total number of frames. Furthermore, C (i,j) = c s (i,j,1) , c s (i,j,2) , . . . , c s (i,j,l) , . . . , c s (i,j,q) is a set of barycenter coordinates for all sperms ς (i,j) in the frame x (i,j) , where c s (i,j,l) is the barycenter coordinates of l-th sperm in the j-th frame of ith video. In conclusion, we extract all barycenter coordinates ψ (i) from all sperms in the video X i .

4) TARGET MATCHING
Currently, the commonly used sperm quality test method is computer-assisted sperm analysis (CASA) [44], CASA applies computer technology and advanced image processing technology to the analysis of sperm dynamics.
The quantitative data of sperm dynamics are provided by analyzing the sperm motility images. Nearly all commerical CASA instruments use the nearest neighbor (NN) tracking scheme [19], in which the initial image processing provides a centroid for each spermatozoon in the first frame of a scene, for each cell location of the most probable centroid in successive frames is deduced, and connecting the centroids for a spermatozoon provides its actual trajectory [44].
Our challenge is to match the same target from the current frame to the next frame, and we choose a classical k-nearest neighbor (k-NN) [45] algorithm to solve this problem, and an example of k-NN is shown in FIGURE 5. Based on the results of ψ (i) , we obtain the barycenter coordinates of all sperms ς (i,j) in the video X i . Then, we use k-NN algorithm to calculate Euclidean distance: There is a barycenter coordinate c s (i,j,l) in frame x (i,j) , and next frame x (i,j+1) , where all the barycenter coordinates are C (i,j+1) = c s (i,j+1,1) , c s (i,j+1,2) , . . . , c s (i,j+1,l) , . . . , c s (i,j+1,q) . We calculate the Euclidean distance between c s (i,j,l) and all the barycenter coordinates in C (i,j+1) , and figure out a set of Euclidean distance We find the minimum in D (i,j) , and define this minimum as d min D (i,j) . We use k-NN algorithm to classify all the barycentric coordinates to their corresponding coordinates in the former frame of the video. The result of classification is that all barycentric coordinates ψ (i) = C (i,1) , C (i,2) , . . . , C (i,j) , . . . , C (i,m) of the same sperm target in the video X i are classified into one category. An example of a classification is shown in FIGURE 6.
As the example shown in FIGURE 6, we define a set of classification result as φ (i) = S (i,1) , S (i,2) , . . . , S (i,g) , . . . , S (i,τ ) , where S (i,g) = I (i,j,g) , I (i,j+1,g) , . . . is a set of all the barycentric coordinates of one sperm in this video X i , I is the barycentric coordinate variable, j is the frame number, g is the index number of classification result, τ is the total number of classification result, and i is the video number.
In the video X i , there are sperms constantly swimming into or out of the visual field, therefore, sperm counts are inequality in different frames. According to this practical situation, we give a solution strategy as follow: • Case-I: If there is a sperm swimming into the visual field, we define this sperm as a new target, and it will have a new classification result for its own with the k-NN classifier.
• Case-II: If there is a sperm swimming out of the visual field, we stipulate that the motion of this sperm is over. Based on Case-I and Case-II, we can conclude that the number of classification result φ (i) = S (i,1) , S (i,2) , . . . , S (i,g) , . . . , S (i,τ ) is the total number of sperms in the video X i , where τ is the total number of sperms. FIGURE 7 shows an example of the sperm count statistics from frame 36 to frame 80 in video X i .

5) CONSTRUCTION OF THE FOLDOVER
According to the result of k-NN classification, we get the barycentric coordinates φ (i) = S (i,1) , S (i,2) , . . . , S (i,g) , . . . , S (i,τ ) of all the sperms in the video X i . The following operations are performed for each k-NN classification result φ (i) . First, according to k-NN classification result S (i,g) , we determine the range of the frames in which the sperm moves. Then, we extract these frames from the segmentation results X according to the range of frames. In these extracted frames, setting the barycentric coordinates S (i,g) = I (i,j,g) , I (i,j+1,g) , . . . as the center, setting r pixels as a standard radius. We calculate the distance between the barycentric coordinates  S (i,g) = I (i,j,g) , I (i,j+1,g) , . . . and all other pixels in the X is defined as Eq. (3).
The pixel value p L (i,j,g) x is 0 which is more than r pixels away from the barycentric coordinate I (i,j,g) ; otherwise, the pixel value p L (i,j,g) x seg (i,j,k) is p x seg (i,j,k) . In this way, we target each sperm in the segmentation results, and we define the result as L (i,j,g) x seg (i,j) . L (i,j,g) x seg (i,j) is the image segmentation result of the g-th sperm in the j-th frame of the i-th video, and an example of the L (i,j,g) x seg (i,j) result is shown in FIGURE 8.
Second, according to the L (i,j,g) x seg (i,j) result, we can get a set θ (i,g) = L (i,j,g) x seg (i,j) , L (i,j+1,g) x seg (i,j+1) , . . . of the same sperm. We use L (i,j,g) x seg (i,j) to localize the sperm region from the original frame (image) according to Eq. (4).
In Eq. (4), we define the extracted result as o (i,j,g) x (i,j) . If the pixel value p L (i,j,g) x is equal to 0, the pixel value p o (i,j,g) x (i,j,k) is 0; otherwise, the pixel value p o (i,j,g) x (i,j,k) is the pixel value p x (i,j,k) corresponding to the original image. According to Eq. (4), we hold on the image of each sperm, and we define the result as is the image of the g-th sperm in the j-th frame of the i-th video, in which the background is black, and FIGURE 9 is an example of o (i,j,g) x (i,j) .
Thirdly, a set of the o (i,j,g) x (i,j) is denoted as O (i,g) , where we define (i,g) O (i,g) as the total number of extracted results in O (i,g) , and O (i,g) is defined as Eq. (5).
According to Eq. (5), we can obtain a set O (i,g) of images of the g-th sperm in the i-th video. O (i,g) contains all the images of the g-th sperm in the i-th video, these images have a black background such as the example in FIGURE 9. Based on the Eq. (5), we define (i,g) as the foldover of the g-th sperm in the i-th video, and p (i,g) x (i,j,k) is expressed by Eq. (6).
As the definition in Eq. (6), we add up the k-th pixel of each frame in O (i,g) , and the sum is p (i,g) x (i,j,k) , k = 1, 2, 3, . . . , h, k is the pixel number, h is the total number of pixels in a frame, h = h 1 × h 2 , h 1 = 1, 2, 3, . . . is the number of pixels in a row, and h 2 = 1, 2, 3, . . . is the number of pixels in a column. In this way, we add the corresponding pixel values in different frames to obtain (i,g) . (i,g) is the foldover of the g-th sperm in the i-th video, p (i,g) x (i,j,k) is the pixel value of the k-th position in (i,g) , and the (i,g) is shown in FIGURE 10.
By method of accumulation in FIGURE 10, images of the same sperm in different frames are placed on the same spatial plane. In this spatial plane, the images in O (i,g) construct the foldover of the g-th sperm in the i-th video.

6) CONSTRUCTION OF 3D IMAGES
In the video X i , the swimming directions of sperms are uncertain. Therefore, we need to unify the swimming directions of sperms to facilitate our experimental analysis. We define the direction in which the starting barycentric coordinate of sperms to their ending coordinate as the positive direction (forward direction), and the horizontal direction is defined as the X direction. In order to unify the swimming directions, we rotate the foldover (i,g) into this positive direction to the X direction, and we define the rotated foldover (i,g) as R (i,g) . An example of R (i,g) is shown in FIGURE 11. FIGURE 11 is only the two-dimensional visualization result of the R (i,g) , it cannot contain all the information of the R (i,g) . Therefore, we show a 3D vision of the R (i,g) to reflect all the information of the foldover in FIGURE 12.

B. FOLDOVER FEATURES EXTRACTION
Foldover feature extraction is the statistics of the information in the R (i,g) , which is also the focus of our whole method, and the method of foldover feature extraction consists of the following four steps.

1) FOLDOVER PROCESSING IN THE X, Y, AND Z DIRECTIONS
Foldover processing in the X, Y and Z directions is shown in FIGURE 13. First, we define the length of R (i,g) on X, Y and Z three directions as R (i,g) ( ), where is defined in Eq. (7).
=      X along the X axis Y along the Y axis Z along the Z axis Second, we cut the foldover R (i,g) along the direction of with a step length of ν , and we can receive a set of slices which is defined as (8), u is the number, and VOLUME 8, 2020 FIGURE 13. An example of foldover processing in the X, Y and Z directions. (a) is the foldover processing in the X direction, (b) is the foldover processing in the Z direction and (c) is the foldover processing in the Y direction.
is the total number. ,2) , . . . , ,u) , . . . , Third, in X and Y directions, R (i,g) can reflect time information and movement information of sperms, but R (i,g) cannot reflect the information of pixel accumulation. For slices (R, ) (i,g) ( = X or Y), we set the pixel values of the areas where the foldover exists to 1 and other areas to 0. We add (R, ) (i,g) ( = X or Y) together as the result of R (i,g) in the X and Y directions. Unlike the foldover slices in the X and Y directions, the foldover slices in Z direction truly reflect the effect of pixel accumulation. Therefore, we have no necessary to set the pixel values, so the pixel values of the areas where the foldover slices in the Z direction exists are added directly. We define the cumulative result of foldover slices Finally, we get three cumulative results U    According to the human sperm quality assessment proposed by the World Health Organization (WHO) [46], sperm motility is grouped into four categories as shown in TABLE 2.
So, the grade of FIGURE 15 (a) is D (immotility), and the grade of FIGURE 15 (b) is A (rapid progressive motility). In X direction, the foldover contains the range of moving direction, which is the length of the foldover along the X direction R (i,g) (X). The total number of frames (i,g) O (i,g) that make up the foldover (i,g) is the time information, which is the movement time of sperm, and by (i,g) O (i,g) we calculate the frame rate of (i,g) in the X direction. We define the frame rate as v (FPS,X) (i,g) in Eq. (10), and 2D visualization of two foldovers in the X direction are shown in (a) and (b) of FIGURE 16.
In Y direction, the foldover contains the range of the orthogonal direction of moving direction, which is the length of the foldover along the Y direction R (i,g) (Y). Similar to X direction, we calculate the frame rate of (i,g) in the In Z direction, the foldover contains trajectory, shape, and brightness information. By the trajectory of the foldover we calculate the motion distance, the motion displacement and the average path length. Furthermore, we calculate the motion distance and the motion displacement by φ (i) = S (i,1) , S (i,2) , . . . , S (i,g) , . . . , S (i,τ ) (S (i,g) = I (i,j,g) , I (i,j+1,g) , . . . ), and by fitting φ (i) = S (i,1) , S (i,2) , . . . , S (i,g) , . . . , S (i,τ ) to the third power, an equation can be calculated based on the motion path, then the average path length of sperm is calculated by combining this equation. We define the motion distance as A (i,g) , the motion displacement as B (i,g) , the fitted equation as I (i,j,g) and the average path length as M (i,g) , the formula of A (i,g) , B (i,g) and M (i,g) are expressed by Eq. (12), Eq. (13) and Eq. (14).
In Eq. (12), we add up all the barycentric coordinates S (i,g) = I (i,j,g) , I (i,j+1,g) , . . . contained in the foldover as the distance of motion A (i,g) .
In Eq. (13), we calculate the distance between the first position I (i,j,g) and the last position I (i, (i,g)( O (i,g)) ,g) of the VOLUME 8, 2020 foldover as the motion displacement B (i,g) .
In Eq. (14), by fitting the equation I (i,j,g) , we can calculate the new coordinates corresponding to S (i,g) = I (i,j,g) , I (i,j+1,g) , . . . , and add the distance between these new coordinates we can obtain the average path length M (i,g) .
According to the motion distance A (i,g) , motion displacement B (i,g) and average path length M (i,g) , we can further calculate curvilinear velocity (VCL), straight line velocity (VSL) and average path velocity (VAP) in Eq. (15).
We define the VCL as v (VCL) (i,g) , and we obtain VCL based on A (i,g) and (i,g) O (i,g) . The VSL is defined as v (VSL) (i,g) , we calculate the VSL based on B (i,g) and (i,g) O (i,g) . The VAP is defined as v (VAP) (i,g) , we calculate the VAP based on M (i,g) and (i,g) O (i,g) .
Regarding the shape information, foldovers can detect the deformation of sperm during the movement. The brightness information mainly includes the pixel accumulation process, the higher brightness area indicates that the sperm stay in this area for the longer time. 2D visualization of two foldovers in the Z direction are shown in (e) and (f) of FIGURE 16.
Although U (R,X ) include the information of foldovers, they are three matrices of an object (such as a sperm) with a lot of redundant information. Therefore, we make statistics on all the information of U (R,X ) and U (R,Z ) (i,g) to optimize them. Especially, we apply convolutional operations to achieve the optimization, where we define the process of convolution optimization as H ( = X, Y and Z), H (i,g,k) is the k-th pixel of the g-th foldover in the i-th video, and H is defined in Eq. (17).
In Eq. (17), G is the convolution kernel, and e is the dimension of the G. Here, because we cannot consolidate all the useful information and get rid of all the redundant information by just once convolution, we need to do multiple convolutions.
Furthermore, v LIN (i,g) , STR (i,g) , WOB (i,g) and H (i,g,k) are joined together to form three foldover feature vectors, where v (FPS,X) (i,g) and H X (i,g,k) are concatenated to form the foldover feature F X (i,g) of the X direction; v and H Y (i,g,k) are concatenated to form the foldover feature F Y (i,g) of the Y direction; (i,g) , LIN (i,g) , STR (i,g) , WOB (i,g) and H Z (i,g,k) are concatenated to form the foldover feature F Z (i,g) of the Z direction. The algorithm of the foldover features are shown in Algorithm 1.

Algorithm 1 Generation of H (i,g,k)
Input: Videos χ preprocessed video X i Output: H (i,g,k) , = X, Y and Z 1: video decomposition: 6: rotate the foldover: R (i,g) ⇐ (i,g) 7: foldover processing: Finally, We obtain the foldover feature vectors, and F Z (i,g) . F X (i,g) , F Y (i,g) and F Z (i,g) are extracted from the single foldover R (i,g) of the same sperm, and they represent the foldover features from the three directions of X, Y and Z respectively. The visual information of the foldover R and F Z (i,g) represent the visual information in each direction. F X (i,g) , F Y (i,g) and F Z (i,g) contain temporal information, spatial information, behavior features and static features, and foldover features are a kind of behavior feature based on foldover for dynamic targets. According to the foldover features, we can solve the following difficulties we encounter in the microscopic videos: (1) Multi-object recognition, (2) Similar object recognition, (3) Tiny object recognition, (4) Impurity interference and (5) Little feature information.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, experimental results and analysis are discussed, including IV-A experimental setting, IV-B experimental results.
A. EXPERIMENTAL SETTING 1) EXPERIMENTAL DATA In this paper, a practical microscopic video set χ = {X 1 , X 2 , . . . , X i , . . . , X 59 } with 59 semen videos is applied to test our method. The format of the videos is grey-scale mp4, the size of each frame is 698 × 528 × 3 pixels and the frame rate is 30 frames per second (FPS). There are 1,374 sperms in set χ . For all the sperms, ground truth (GT) images are prepared manually by four biomedical engineers and two medical doctors, where the sperms are labeled as foreground object with 1 (white) and other regions are labeled as background with 0 (black). We mark the number of each sperm in the video and propose the following strategy for sperm numbering: • Case-I: All the sperms in the video (moving or stationary) are numbered, the numbers increased from 1, and each sperm is numbered horizontally from the top of the visual field.
• Case-II: If there is a sperm swimming out of the visual field, we stipulate that the motion of this sperm is over.
• Case-III: If there is a sperm swimming into the visual field, we assume we have a new sample and give it a new number.
• Case-IV: A malformed sperm is considered a sample, for example a sperm with two heads or two tails.
Furthermore, based on the diagnosis of the medical doctors, all sperms are grouped into three classes, including poor motion state, good motion state and excellent motion state. There are 950 samples of poor motion state, 262 samples of good motion state and 162 samples of excellent motion state. Also, the number of samples in the training set is equal to that in the testing set, 687 samples are used for the training set and 687 samples are for testing. In the training set, the sample number of poor motion state is 462, the sample number of good motion state is 138, and the sample number of excellent motion state is 87. In the testing set, the sample number of poor motion state is 488, the sample number of good motion state is 124, and the sample number of excellent motion state is 75. An example of the video frames and their GT images is shown in FIGURE 17.

2) EVALUATION INDEX
We use classifiers to evaluate foldover features with a three-class classification task of sperms, and the classification evaluation indicators are shown in TABLE 2 [46]. Specifically, four classifiers are tested in this paper, including Artificial Neural Networks [40] (ANNs), Random Forests [42] (RFs) and Support Vector Machines [43] (linear-SVM and RBF-SVM). Because there are more motionless, slow-swimming sperms and fewer fast-swimming sperms in the videos, we calculate multiple indexes to evaluate the proposed foldover features. Firstly, we calculate the confusion matrix of all classification results. Then, based on the confusion matrices, we can further calculate the accuracy, precision, recall, specificity and F1-measure as shown in TABLE 3.
The negative number of the actual sample is N = TN + FP, the number of positive is P = FN + TP, and the total sample size is C = N + P, where TP is True Positive, TN is True Negative, FP is False Positive and FN is False Negative. Recall (also known as sensitivity) can measure the VOLUME 8, 2020  reliability of the model's prediction with a positive sample, a higher recall means that an algorithm returns more relevant results. Precision can measure the accuracy of the model in predicting positive samples, a higher precision means that an algorithm returns substantially more relevant results than irrelevant ones. Specificity (also called the TN rate) measures the proportion of actual negatives that are correctly identified as such. F1-measure is a measure of the accuracy of a test, considering both the precision and the recall of the test to compute the score. Thirdly, because our experiment is used for three categories, the precision has three values, and each class has its corresponding precision, we define the three values of precision as Precision1, Precision2 and Precision3. In the same way, there are also three values for recall defined as Recall1, Recall2, Recall3. Based on the confusion matrices, we can calculate the macro precision, the macro recall and the macro F1-measure as shown in TABLE 4.
Since our experiment is a triage experiment, therefore, when we calculate Macro_P, we need to calculate the mean of Precision1, Precision2 and Precision3, and the calculation of Macro_R is the same. Finally, based on the accuracy of each category, we calculate the varianceas shown in TABLE 4.

1) EVALUATION FOR FOLDOVER FEATURES
Artificial Neural Networks [40] (ANNs), Random Forests [42] (RFs) and Support Vector Machine [43] (linear-SVM and RBF-SVM) are used to test the effectiveness of the foldover features. Specifically, the parameters of the ANNs are set as follows: The number of network layers is 2, the number of hidden nodes is 10, and the activation function is log-sigmoid; The parameter of the RFs is set as follows: The number of decision tree is 200; The parameters of the Support Vector Machine are set as follows: Kernel function of linear-SVM is linear kernel, kernel function of RBF-SVM is radial basis function.
The foldover features, F X (i,g) , F Y (i,g) and F Z (i,g) are classified by ANNs, RFs, linear-SVM and RBF-SVM, and the confusion matrices of classification results are shown in FIGURE 18.
F Z (i,g) obtains the best results in four classifiers, especially in ANNs, the accuracy is 91.8%, and the classification accuracy of each category is also excellent, 93.5%, 87.1% and 89.2%, respectively.

2) COMPARISON WITH STATIC FEATURES
Firstly, according to the φ (i) = S (i,1) , S (i,2) , . . . , S (i,g) , . . . , S (i,τ ) , each sperm is detected to a size of 26 by 26 pixels in the corresponding frame, the pixel size of 26 by 26 is an ideal size after repeated experiments to ensure which is the only sperm we want in the detected image, and an example of some detected sperms is shown in FIGURE 19.
Secondly, we extract the static features of sperms after detection, including Histogram of Oriented Gradient [8] (HOG), Grey-Level Co-occurrence Matrix [9] (GLCM), the geometric invariant moment proposed by Hu [10], Scale-Invariant Feature Transform [12] (SIFT) and gray histogram [6]. All static features are extracted from detected sperm images, but the movement of a sperm exists in multiple frames, therefore, we adopt the method of multiple extraction, and randomly select one sperm image from all the images of this sperm at a time to extract the static features. Thirdly, the number of times to extract the static feature is ten,   obviously, the number of times to classify the static feature is ten. We use Artificial Neural Networks [40] (ANNs), Random Forests [42] (RFs) and Support Vector Machine [43] (linear-SVM and RBF-SVM)) classifiers to classify static features and construct the total confusion matrices of ten experiments to represent the classification results. The classification results of static features in four classifiers are shown in in FIGURE 20.
According to the confusion matrices of static features in FIGURE 20, we calculate the evaluations, and the comparison evaluations between static features and foldover features are shown in TABLE 5. Considering the comparison in TABLE 5, the accuracy of F X (i,g) , F Y (i,g) and F Z (i,g) are significantly higher than that of static features. The reason for the low accuracy of static features is: Static features are extracted from detected sperms  Evaluation of static features with four classifiers. The first column shows the types of static features, the second column shows the types of classifiers, the third to the last columns show the calculated evaluations. We use the first three letters of each evaluation to indicate the evaluation metric, such as Acc is accuracy, Pre is precision, Mac_P is Macro_P, Rec is recall, Mac_R is Macro_R, Spe is specificity, F1-mea1 is F1-measure, Mac_F1 is Macro_F1 and Var is variance. The red font value means that the value is the maximum value in the column (Unit: %).
images, in which there is few difference between stationary sperms and moving sperms, therefore it is difficult to distinguish different categories of sperms by static features.
Because there are not many differences between static sperms and moving sperms, it is easy to miss-classify all sperms into one category by using static features to classify sperm in different motion states, consequently, one precision (precision 1, precision 2 and precision 3) for one static feature is very high and the others are very low. The case for recall values are totally similar to that of the precision. The difference of values between precision 1, precision 2 and precision 3 further affect the macro of precision, recall and F1-measure. F X (i,g) , F Y (i,g) and F Z (i,g) can distinguish three categories well by the advantage of foldover in classification, especially the information of foldover in the Z direction is very beneficial to distinguish sperms in different motion states. Therefore, the value of precision and recall are higher than which of the static features. Furthermore, foldover features perform well in the macro of precision, recall and F1-measure.

3) COMPARISON WITH DYNAMIC FEATURES
Three dynamic features are selected for the comparative experiment, including dynamic texture features and features extracted based on the CNNs (VGG-16 and VGG-19 networks). The first step is the same as the operation of static features, where each sperm is detected to a size of 26 by 26 pixels in the corresponding frame. The difference is what we need is the entire movement of the detected sperm. Therefore, the detected sperm images are combined into a video of the corresponding sperm. Secondly, we refer to the articles [14], [21], [22] to extract dynamic texture and deep learning (VGG-16 and VGG-19 networks) features in the detected semen videos. The third step is the same as the operation of static features, where we use ANNs [40], RFs [42] and SVM [43] (Linear-and RBF-SVM) classifiers to distinguish dynamic features, and the classification results are shown in FIGURE 21.
According to the classification results of dynamic features in FIGURE 21, we compare evaluations between dynamic features and foldover features in TABLE 6.
Considering the comparison in TABLE 6, the accuracy of F X (i,g) , F Y (i,g) and F Z (i,g) are significantly higher than that of dynamic features. The reason for the low accuracy of dynamic features is: Sperms are tiny and there is very little dynamic information. Therefore, it is difficult to distinguish different categories of sperms by dynamic features. It is easy to classify most of sperms into one category by using dynamic features to classify sperm in different motion states, consequently, the value of the true positive (TP) further affect the calculation results of all evaluations.

4) ADDITIONAL EXPERIMENT PART A: COMPARISON WITH DEEP CONVOLUTIONAL NEURAL NETWORKS
Currently, deep convolutional neural networks (DCNNs) are applied successfully to various applications, in which depth plays a major factor in increasing efficiency of the network. Especially, in the field of image classification, DCNNs has excellent advantages [47]. Therefore, we compare two well-known DCNNs (VGG-16 and VGG-19 networks) in the classification task of 1374 sperms, and the experimental results are shown in FIGURE 22. There are several reasons for the poor results of DCNNs: (1) The high similarity of different sperms makes it difficult to extract effective features for DCNNs classification.
(2) DCNNs are difficult to operate on the selection of visual information. In the process of deep convolution, DCNNs discard some visual information judged as redundant, which is terrible for sperms with little visual information. (3) Due to the lack of sperm visual information, the depth of DCNNs is required to be relatively high. A larger depth may cause VOLUME 8, 2020 TABLE 6. Evaluation of dynamic features with four classifiers. The first column shows the types of dynamic features, the second column shows the types of classifiers, the third to the last columns show the calculated evaluations. We use the first three letters of each evaluation to indicate the evaluation metric, such as Acc is accuracy, Pre is precision, Mac_P is Macro_P, Rec is recall, Mac_R is Macro_R, Spe is specificity, F1-mea1 is F1-measure, Mac_F1 is Macro_F1 and Var is variance. The red font value means that the value is the maximum value in the column (Unit: %). sperms to have no visual information to extract, while a smaller depth may cause the extracted features to have no differentiation.

5) ADDITIONAL EXPERIMENT PART B: FOLDOVER FEATURE FUSION
In order to enhance the discriminative ability of features, one important method is feature fusion, including early fusion and late fusion [48]. Early fusion is defined as the integrates unimodal features before learning concepts, and late fusion is defined as that first reduces unimodal features to separately learned concept scores, then these scores are integrated to learn concepts. Especially, because early fusion is easy to operate and requires only one learning phase, it is wildly used in video analysis tasks [48]. Hence, for foldover features F X (i,g) , F Y (i,g) and F Z (i,g) , we adopt the early fusion method for feature fusion. We classify the foldover features of 1374 sperms by the results of early fusion, and the experimental results are shown in FIGURE 23.
According to FIGURE 23, the results on the four classifiers are excellent after the early fusion of the foldover features (F X (i,g) , F Y (i,g) and F Z (i,g) ). After early fusion, the accuracy of the three classifiers (ANNs, RFs and linear-SVM) reaches more than 97%, nearly 6% higher than the highest accuracy of 91.8% without early fusion in FIGURE 18. The accuracy of the RBF-SVM is 81.8% higher than that of the RBF-SVM in FIGURE 18 (79.5%). The recall of the three classifiers (ANNs, RFs and linear-SVM) is excellent, most of them above 90%, among which the highest reaches 100% in ANNs, and the recall of the RBF-SVM is also much better than that of the RBF-SVM in FIGURE 18.

6) EXPERIMENTAL ANALYSIS
There are two main reasons why the classification results of foldover features are superior to classical static and dynamic features. First, there is a high degree of similarity between different sperms. When two sperms are very similar in shape, size, color and texture, static features cannot effectively distinguish two sperms. However, the foldover features can solve this problem well, because the differences of foldovers between two sperms are very obvious as the example shown in FIGURE 24. is the foldover of (a) in the Z direction, (e) is the foldover of (d) in the Z direction, (c) is the foldover of (a) in the 3D visualization, and (f) is the foldover of (d) in the 3D visualization. Second, because sperms are very tiny and there is very little visual information, it is very difficult to distinguish two different sperms. However, due to the foldover features contain not only the original shape and texture information of sperms, but also the movement information of sperms, they can discover more useful visual information. Furthermore, we analyse the X, Y and Z directions of the sperm foldovers, and expand the sperms movement information, the 3D visualizations of two sperms are shown in FIGURE 24 (c) and (f).
According to FIGURE 24 (c) and (f), although FIGURE 24 (a) and (d) contain little visual information, the information contained in their foldovers is abundant, and the differences between the foldovers are obvious. In addition, even the static and dynamic features of two sperms are very similar, foldover features contain a lot of visual information to distinguish the sperms as shown in FIGURE 25.
The two sperms in FIGURE 25 (a) and (d) are very similar in shape, color, size and texture. In FIGURE 25 (b) and (e), there is a high similarity between the two sperms motility states. However, according to the FIGURE 25 (c) and (f), when both static and dynamic features are similar, the information contained in the foldovers is significantly different. It proves that the foldover features are superior in distinguishing tiny objects, similar objects, objects with little visual information and objects with similar visual information.

V. CONCLUSION AND FUTURE WORK
In this paper, we propose novel foldover features, which are applied to dynamic object behavior description in microscopic videos. Compared with classical static and dynamic features, the foldover features show obvious advantages in distinguishing tiny objects, similar objects, objects with little visual information and objects with similar visual information. In the experiment, we use four different classifiers (ANN, RF, linear-SVM and RBF-SVM) to test the effectiveness of the foldover features, and an overall outstanding classification accuracy is obtained, indicating the effectiveness and potential of the proposed foldover features.
In the future, we plan to increase the amount of data in a single category, allowing the same doctors to expand the data and address the imbalance in our experimental data. Then, although we have tested the foldover features on the semen microscopic videos, we will test it on more highly similar objects to improve the generalization of the foldover features. In the recent years, he has been conducting productive studies in intelligent medical imaging computing and modeling, machine learning, brain networks, and brain models. He has published more than 80 papers in peer-reviewed journals and international conferences. He has won many academic awards, such as the Chinese Excellent Ph.D. Dissertation Nomination Award and the Award for Outstanding Achievement in Scientific Research from the Ministry of Education.
TAO JIANG was born in 1975. He received the Ph.D. degree from the University of Siegen, Germany, in 2013. He is currently a Professor with the Chengdu University of Information Technology (CUIT), China, where he is also the Dean of the Control Engineering College. His research interests include machine vision, artificial intelligence, robot control, self-driving auto, and membrane computing.