Classification of Occluded Images for Large-Scale Datasets With Numerous Occlusion Patterns

Large-scale image datasets with numerous occlusion patterns prevail in real applications. The classification scheme based on subspace decomposition-based estimation with squared $l_{2}$ -norm regularization (SDBE_L2) has shown promising performance for the classification of partially occluded images. For the large-scale image datasets with numerous occlusion patterns, it however suffers from a high labor intensity in acquiring extra image pairs and a large consumption of computational resources in the training stage. To reduce the labor intensity, this paper enumerates several useful types of extra image pairs to guide the collection of extra images and introduces an intra-class random pairing method to semi-automatically form the extra image pairs. To alleviate the consumption of computational resources, this paper proposes two dictionary compression approaches: 1) uncentered PCA-based single partition compression (UPSPC), which compresses the dictionary to a size not larger than twice the column vector length without affecting the classification accuracy, and 2) uncentered PCA-based intra-class partition compression (UPIPC), which can further shrink the occlusion error dictionary (or class dictionary) when it has a small number of occlusion classes (or image classes). The proposed approaches are based on the property of SDBE_L2 being invariant to the uncentered PCA of sub-dictionaries. The extensive experiments on the Caltech-101 dataset and Oxford-102 flower dataset demonstrate the enumerated examples and the intra-class random pairing method facilitate acquiring the extra images and forming the extra image pairs only with a small loss in the classification accuracy. The experimental results on a large-scale occluded image dataset synthesized from the ILSVRC 2012 classification dataset with numerous occlusion patterns show that the proposed dictionary compression approaches reduce the dictionary size by over 11 times and shorten the training time by more than 39 times without loss in the classification accuracy.


I. INTRODUCTION
CClassification of partially occluded images is a longstanding challenge in computer vision [1]- [6]. Recently, many research attempts have been made to introduce the rapidly developing deep learning techniques [7]- [15] into this field. Under the deep learning framework, occlusion can be tackled in a low-level representation, e.g. the image itself [16]- [22] or low-level deep feature map [23], or a high-level representation, e.g. the deep feature vector (DFV) [24], of the image. None of these approaches, however, can handle the classification on large-scale generic The associate editor coordinating the review of this manuscript and approving it for publication was Wenming Cao .
image datasets with numerous occlusion patterns, which prevails in real applications.
To cope with the occlusion, the variations introduced by occlusion are usually needed to be modeled. For the low-level representation, the diversity of the occlusion-related variations are much richer than that for the high-level representation, since the low-level representation contains more details than the high-level representation. Consequently, for the low-level representation, much more training images, especially occluded training images, are usually required to model the occlusion-related variations than for the high-level representation.
For a large-scale image dataset with numerous occlusion patterns, the diversity of the occlusion-related variations in VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the low-level representation is extremely rich and thus a huge number of occluded training images are usually required for the approaches based on the the low-level representation. Acquiring and labeling the occluded training images for the classification task, however, is a labor-intensive and time-consuming work. Lack of sufficient task-specific occluded training images discourages the development and application of the low-level representation based approaches. In contrast, much fewer occluded training images are needed for the high-level representation due to less diversity. Therefore, handling occlusion in the high-level representation seems more practicable for the large-scale datasets with numerous occlusion patterns. In [24], a high-level representation based approach, namely subspace decomposition-based estimation (SDBE), was proposed to improve the classification performance against the occlusion on the occluded generic image datasets. The SDBE_L2 implementation of SDBE, which employs a squared l 2 -norm to regularize the decomposition coefficients introducing very low extra computational cost in testing, is applicable to the large-scale datasets.
However, for the large-scale datasets with numerous occlusion patterns, the SDBE_L2-based classification scheme presented in [24] can exhaust the computational resources on many computers in the training stage. In [24], for each image class or occlusion pattern, at least tens of DFVs or occlusion error vectors (OEVs) were employed to construct the class dictionary (CD) or occlusion error dictionary (OED), respectively. For numerous image classes or occlusion patterns, this dictionary construction method gives rise to a dictionary of catastrophically large size, which requires a huge amount of memory to store and extremely long time to process. Besides, constructing an OED from exact-matching extra image pairs, which was adopted in [24], is a labor-intensive task for real applications. The exact-matching extra image pairs each is composed of a non-corrupted and a corrupted version of an image. In practice, this type of image pairs usually requires intensive manual labor to prepare due to lack of an automatic approach to precisely find out the corrupted versions of an occlusion-free image from plenty of occluded images.
In this paper, we focus on overcoming the above-mentioned deficiencies of the SDBE_L2-based classification scheme on the large-scale datasets with numerous occlusion patterns. We improve the SDBE_L2-based classification scheme from two aspects: 1) reducing the labor intensity in acquiring and forming the extra image pairs and 2) alleviating the computational resource consumption in the training stage.
To reduce the labor intensity, we first investigate what types of extra image pairs can contribute positively to the improvement on the classification of occluded images. Then, we enumerate several examples that can achieve a high overlap between the linear span of the generated OED and the occlusion error subspace as the empirical guidelines for collecting the extra images. Also, we introduce an intra-class random pairing method to effectively exploit the extra images that lack exact-matching counterparts and reduce the labor intensity in forming the extra image pairs.
To alleviate the computational resource consumption, we propose two novel dictionary compression approaches: 1) uncentered PCA-based single partition compression (UPSPC) and 2) uncentered PCA-based intra-class partition compression (UPIPC). The proposed dictionary compression approaches are based on the uncentered principal component analysis (PCA). We prove that SDBE_L2 is invariant to the uncentered PCA of sub-dictionaries. Two methods are introduced to divide the CD and OED into subditionaries: 1) single paritition and 2) intra-class partition. The single partition treats the CD or OED as a single subdictionary. While for the intra-class partition, the CD or OED is divided into sub-dictionaries according to the image classes or occlusion classes, respectively. In UPSPC, the compression is achieved by reserving the non-zero uncentered principal components (PCs) of each sub-dictionary of the single partition. The maximum size of the dictionary compressed by UPSPC is only twice the length of the column vectors. When the number of the image classes or occlusion classes is small, the size of the CD or OED can be further decreased by UPIPC. In UPIPC, the intra-class partition is employed and the compression is achieved by only reserving the first few PCs of each sub-dictionary.
In summary, our main contributions to improve the SDBE_L2-based classification scheme are as follows.
• We expand the admissible types of the extra image pairs and provide the empirical guidelines for the collection of the extra images to diminish the difficulty in acquiring the extra images.
• We introduce the intra-class random pairing method to reduce the workload in pairing the extra images.
• We propose two novel dictionary compression approaches: UPSPC and UPIPC, to alleviate the consumption of the computational resources in the training stage. The paper is organized as follows. Section II summarizes the related works. Section III briefly reviews the SDBE_L2-based classification scheme. Section IV investigates the useful types of the extra image pairs and introduces the intra-class random pairing method. Section V describes the proposed dictionary compression approaches. Section VI presents experimental results and Section VII concludes the paper.

II. RELATED WORK
In the image space, deep generative models are commonly employed to restore the occluded image regions. A large number of exact-matching occluded and occlusion-free image pairs that can cover the variations caused by the occlusions are usually required to train the generative models. In [16], [18]- [20], the generative adversarial networks (GANs) were used to yield the missing portions of the image. Without particular considerations on the classification, however, the images recovered by purely inpainting are unsuitable for image classification due to the inter-class information introduced by the generated patches, which degrades the classification accuracy. Furthermore, these approaches need to know the shapes and locations of the occlusions or missing portions in advance. Unfortunately, automatic occlusion detection in generic images is still a tough challenge in computer vision.
To avoid providing prior knowledge about the occlusion structure, learning the occlusion structure from data is developed for the generative model based approaches. In [25], an average relative difference between the activations of the occluded and the clean image was introduced to discriminate between the corrupted and the non-corrupted elements of the DFVs for the stacked sparse denoising auto-encoder. The corrupted elements are then replaced with the averaged non-corrupted elements to reconstruct the occlusion-free face image. Because of the holistic attribute of the DFVs and inter-class information introduced by the averaged non-corrupted elements, this approach only exhibits improvement for small-scale face datasets with a few types of synthetic occlusions.
To improve the classification performance for the generated model based approaches, in [26], an identity-based supervised CNN was proposed to provide an extra guidance for the training of robust LSTM-autoencoders to preserve the identity information. The trained robust LSTM-autoencoders, which have an LSTM branch to learn the structure of the occlusion were then employed to de-occlude face images. However, in order to cover the variations caused by the occlusions, massive exact-matching occluded and occlusion-free face image pairs are required to train the robust LSTMautoencoders. The robust LSTM-autoencoders only shows improvement in face recognition.
In the deep feature space, the correspondence between the occluded image regions and the corrupted elements of the deep feature map, training with data augmentation, and recovering uncorrupted DFVs were investigated to improve the robustness of deep features to the occlusion. In [23], a pairwise differential Siamese network (PDSN) was proposed to learn the correspondence between the occluded facial regions and the corrupted activations of the top convolution layer. The occlusion-associated feature elements, which will be discarded in classification, are indicated by a discarding mask that is generated according to the learned correspondence. This approach however is highly rely on an accurate occlusion detector and well-aligned face images because the correspondence is built on a predefined image grid and the generation of the feature discarding mask requires an accurate detection of the occlusion. Consequently, this approach is not applicable to the classification of generic occluded images. In [27], a strategy for data augmentation with synthetic occluded face images was proposed to train the CNNs so as to extract the features more locally and equally. The occluded face images were synthesized by placing the occluder with a high probability at the face regions sensitive to the classification result. The sensitive face regions were identified by an occlusion map that was obtained via the visualization technique proposed in [28]. The occlusion map, however, is only effective for well-aligned images, e.g., aligned face images. In addition, to cover the variations caused by numerous occlusion patterns, an extremely large number of synthetic occluded images should be employed in training. This will prolong the training time dramatically. In [27], the improvement was only shown on small face datasets with a few occlusion patterns. In [24], the SDBE-based image classification scheme was proposed. The SDBE approach recovers the non-corrupted DFV by projecting the DFV of the occluded image onto the linear span of a CD along that of an OED. Two implementations of SDBE were introduced in [24], SDBE_L1 and SDBE_L2. For a large-scale dataset, SDBE_L2 has much less computational complexity in testing than but achieves classification accuracy similar to SDBE_L1. While, the implementation of SDBE_L2 in [24] is computational expensive in training for the large-scale datasets and numerous occlusion patterns. This is because the similarities between the OEVs associated with the same occlusion class and between the DFVs of the same image class are not effectively exploited. In addition, the redundancy in the dictionary is not removed.
From the above analysis, we can learn that for the large-scale generic image datasets with numerous occlusion patterns, the practicable classification scheme is still a challenge.

III. SDBE_L2-BASED CLASSIFICATION
In this section, we briefly describe SDBE_L2-based classification scheme for the sake of completeness. SDBE_L2 is designed to estimate the non-corrupted DFV by projecting the corrupted DFV onto the class subspace A along the occlusion error subspace B. Because the projection is usually nonunique in practice, SDBE_L2 constrains the projection by minimizing the squared l 2 -norm of the projection coefficients.
Suppose  VOLUME 8, 2020 Then, the DFV of the ith test image, v i , is decomposed as follows, where n is a noise term and α = [α T 1 , . . . , α T i , . . . , α T K A ] T and β = [β T 1 , . . . , β T i , . . . , β T K B ] T are coefficients. SDBE_L2 employs the squared l 2 -norm regularized LS estimates to solve equation (1) as follows, where λ is a positive hyperparameter, The SDBE_L2-based classification scheme has two stages: training stage and testing stage. The primary purpose of the training stage is to construct the CD and OED and then figure out the projection matrix. The details of the training stage are listed below.

7)
Train the classifier C with the column vectors of A (optional).
• Output: O and C. The function of the testing stage is to predict the image class of the test image. The details of the testing stage are listed as follows.
4) Predict the image class ofv 0i with the classifier C.
• Output: The image class ofv 0i . While the testing stage is independent of the size of the dictionary D because O ∈ R m×m , the computational resources consumed to figure out O in the training stage, such as the memory and computing time, increase with the size of D.
To calculate O, the memory should be able to store the For a large-scale image dataset with numerous occlusion patterns, (N A + N B ) is very large such that the memory required to compute O can easily exceed the memory capacity of many computers. For instance, for a dataset with 1000 image classes and 2000 occlusion patterns, adopting the setting used in [24]: 20 DFVs for each image class and 100 OEVs for each occlusion pattern, the size of D T D is 220000 × 220000. Considering the commonly used double-precision representation of real value (8 bytes per value), the minimum memory required to store a 220000 × 220000 matrix is over 360GB, which exceeds the memory capacity of desktop computers.
The most time-consuming procedure in the training stage is inverting D T D + λI , which has a computational complexity of over O((N A + N B ) 2.37 ) [31], [32]. Therefore, to process a huge size matrix, such as the above-mentioned example, the computing time will be extremely long.
Besides, in [24], employing the exact-matching extra image pairs to generate the OED restrains the practical application of the SDBE_L2-based classification scheme. In many applications, the exact-matching counterparts of the extra images are difficult to be acquired.

IV. ACQUIRING EXTRA IMAGE PAIRS
In this section, we investigate the types of the extra image pairs that can contribute positively to the classification of occluded images. Acquiring the extra image pairs includes two steps: 1) collecting the occluded images and occlusion-free images and 2) matching the occluded images to the occlusion-free images to yield the extra image pairs. As mentioned in [24], the linear span of the OED B is regarded as an approximation of the occlusion error subspace B. To achieve a small estimation error, the linear span of B should heavily overlap with B, i.e., the overlapping ratio, should be sufficiently large. In addition to the type of the extra image pairs used in [24], i.e., covering all the occlusion patterns in the test images and having exact-matching between the occluded and the occlusion-free images, it is apparent that many alternatives can yield the OEDs with high δ. Through a toy example, we first take a close look at the variation, caused by the occlusion, of the DFV. As shown in Fig.1, the DFVs of the occluded images, which reflect the features of both the occlusion patch and original images, move from the class subspaces A i 's to the linear span A oj that is associated with the jth occlusion pattern. For instance, the black arrow line v 0i (the DFV of the original occlusion-free beaver image) moves from A i to A oj and becomes the black arrow line v i (the DFV of a beaver image with 25% occlusion). The OEV between v i and v 0i , which falls into the occlusion error subspace B, is drawn out as a solid red arrow line. In Fig. 1, the blue arrow lines are additional OEV examples associated with B.
From Fig.1, we can learn that for a given occlusion pattern, the relative position between the class subspaces associated with the occlusion-free extra images, e.g., A i 's, and the linear span associated with the occlusion pattern, e.g., A oj , determines the linear span of the OEVs. This indicates that any occluded and occlusion-free image pairs with appropriate relative positions in the deep feature space can be adopted as the extra image pairs because the linear span of their generated OEVs can have a high δ.
In practice, many types of occluded and occlusion-free image pairs can meet the requirement of high δ. To guide acquiring useful extra image pairs in practice, we enumerate some examples from three aspects: occlusion-free extra image, occluded extra image, and pairing method. The effectiveness of these examples is demonstrated in section VI-A.

A. OCCLUSION-FREE EXTRA IMAGE
The occlusion-free extra image determines the start point of the OEV. Suppose the test images are drawn from an image class set T . The occlusion-free extra images can be drawn from either inside or outside T .
For the case inside T , the OEVs start from the space close to the start points of the OEVs of the occluded test images 1 .
While for the case outside T , the start points of the OEVs locate in the space far from those of the OEVs of the occluded test images. The linear span of the OED, therefore, more likely deviates from B for the case outside T than for the case inside T . Consequently, drawing from outside T usually shows smaller improvement than from inside T .
Despite achieving smaller improvement, drawing from outside T facilitates the collection of extra images. For instance, in a flower classification task, the animal images can be adopted as the extra images. Here, we give two examples regarding the case outside T . 1) Intra-dataset occlusion-free extra images: the occlusion-free extra images are drawn from the same dataset as the test images but distinct from the image classes of the test images. 2) Inter-dataset occlusion-free extra images: the occlusion-free extra images are drawn from the datasets distinct from the task-specific dataset, For instance, the OED used for the classification on the Oxford-102 dataset [33] is constructed by drawing the occlusion-free extra images from the Caltech-101 dataset [30](see the experiment in section VI-A2). Since the images from both inside and outside T can be adopted as the occlusion-free extra images, the occluded extra images and the pairing method become the deterministic factors for the usefulness of the extra image pairs.

B. OCCLUDED EXTRA IMAGE
The occluded extra image determines the endpoint of the OEV. Suppose the occluded test images and the occluded extra images are associated with the jth and j th occlusion patterns, respectively. If A oj , the linear span of the DFVs of the occluded extra images, is heavily overlapped with A oj , the OEVs of the extra image pairs would mainly end within the space that can produce B j , the linear span of the OEVs of the extra image pairs, of high δ.
In addition to the jth occlusion pattern, it is easy to infer that the following occlusion patterns can give rise to an A oj heavily overlapped with A oj .
1) The intra-class occlusion pattern: the occlusion patch is distinct from but belongs to the same occlusion class as the occlusion patch of the jth occlusion pattern.
2) The linear transformation occlusion pattern: the occlusion pattern is a translation, rotation, or scaling of the jth occlusion pattern.
3) The noised occlusion pattern: the occlusion patch is a noised version, e.g., corrupted by the Gaussian noise, of the occlusion patch of the jth occlusion pattern. 4) The cropped occlusion pattern: the occlusion patch is cropped from the occlusion patch of the jth occlusion pattern or vice versa.
For an extreme scaling in the second item, e.g., enlarging the occlusion patch from 25% occlusion to 80% (see the experiment in section VI-A), A oj and A oj may not be overlapped with each other. In such situation, however, both A oj and A oj are in the same direction from A i to the linear span associated with the image class of the occlusion patch. It means that the OEVs associated with the extremely scaled occlusion pattern point towards the direction similar to the OEVs associated with the jth occlusion pattern. Therefore, B j and the linear span of B j can have large overlapping.

C. PAIRING METHOD
In practical applications, many extra images may not have their exactly matched counterparts. For instance, for face recognition in the wild, it is almost unable to find an occlusion-free face image and an occluded face image having the same head pose, face expression, and illumination. In such situation, the exact-matching method used in [24] is unable to efficiently exploit the extra images. In particular, for a small set of extra images, the extra image pairs generated by using the exact-matching method will be insufficient to produce an OED to approximate B with a small error. On the other hand, finding the exactly matched occlusion-free and occluded images from plenty of extra images usually consumes a vast amount of labor.
To alleviate these problems, we introduce an intra-class random pairing method, where each extra image pair is formed by randomly combining an occlusion-free extra image with an occluded extra image of the same image class. The intra-class random pairing method is a semi-automatic method only requiring to determine the image classes of the occluded extra images manually. Therefore, much less labor is needed to pair the extra images. It can also effectively exploit the extra images lacking the exactly corresponding counterparts in the set of collected extra images.
For a well-trained CNN, the DFVs of the occlusion-free images of the same image class cluster together, and thus the image class determines the approximate location of the start point of an OEV. Meanwhile, the DFVs associated with the same occlusion pattern locate close to each other since these DFVs reflect the features of the same occlusion patch. For a given occlusion pattern, it is apparent that the start points and endpoints of the intra-class random pairing OEVs have small deviations from the respective points of the exact-matching OEVs since these two types of OEVs are associated with the same image class and the same occlusion pattern. Therefore, the intra-class random pairing OED, which is constructed by using the extra image pairs that are formed with the intraclass random pairing method, can have high δ.

V. DICTIONARY COMPRESSION
In this section, we introduce the uncentered PCA-based dictionary compression approaches. As the basis of the proposed approaches, the uncentered PCA of sub-dictionaries is first described, and then, the strategies of dictionary partition and compression are presented.
where L i and R i are the orthogonal matrices of the left and the right singular vectors, respectively, and i is a diagonal matrix of singular values in descending order. The column vectors of R i are actually the eigenvectors of D T i D i , thereby being viewed as the the principal axes of the uncentered PCA [36] of D i .
To avoid confusing with the standard PCA [37], [38], where the matrix is column-centered, in this paper, the notations for the uncentered PCA is modified by ''uncentered'', e.g., the principal axes of the uncentered PCA is named uncentered principal axes. Equation (7) can be rewritten as D i R i = L i i . The left term represents the projection of D i onto the uncentered principal axes. The columns of L i i therefore can be interpreted as the uncentered PCs 3 . Let and is the projection of ω i onto the uncentered principal axes. Then, we have Recall that R i is an orthogonal matrix, i.e., R i R T i = I, where I denotes the identity matrix. We have From equation (11), we obtain = ω T ω = || ω|| 2 2 .
It is easy to infer from Lemma 1 that in the uncentered principal space of D i , which is defined by R i , equation (2) keeps unchanging.
By repeatedly replacing each D i with D pca i , we have the following theorem.  (2) can be obtained by replacinĝ ω i 's inω with R T iω i 's. Theorem 1 indicates that SDBE_L2 is invariant to the uncentered PCA of sub-dictionaries. Moreover, the projection of a sub-dictionary onto the uncentered principal axes gives rise to a significant-first form for the sub-dictionary since the uncentered PCs carrying more information come before those capturing less information.
These properties facilitates the dictionary compression. First, the size of D pca i is not larger than m × m because the number of the non-zero columns of D pca i is equal to the rank of D i , which is smaller than or equal to m. Second, if the original column vectors of a sub-dictionary are highly correlated, the information of the sub-dictionary concentrates on the first few uncentered PCs, which correspond to the large singular values. Consequently, the sub-dictionary can be compressed by only reserving the first few uncentered PCs. 3 In the literature, some authors confusingly call the column vectors of R i ''principal components'' (e.g., [38]), but we reserve this name for D i R i in keeping with the terminology in [37].

B. DICTIONARY PARTITION AND COMPRESSION
The above analysis encourages us to compress the dictionary via the uncentered PCA of sub-dictionaries. However, how to partition the dictionary D into sub-dictionaries has yet to be investigated.
An intuitive partition is to consider the CD A or OED B as a single sub-dictionary. We name this type of partition single partition. The single partition can ensure that A or B is compressed to a size not larger than m × m. Let m A or m B denote the numbers of non-zero PCs of A or B, respectively. Because the rank of A or B is not larger than m, we have m A ≤ m or m B ≤ m. By dropping off the zero PCs, A or B is shrunk to a size smaller than or equal to m × m in the unentered principal space. We call this type of compression uncentered PCA-based single partition compression (UPSPC). The UPSPC algorithm is presented in Algorithm 1. UPSPC is a type of lossless compression due to the invariability of SDBE_L2 to the uncentered PCA.

Algorithm 1 The Proposed UPSPC Algorithm
Input: A or B. 1) Let D 1 = A or B.
2) Conduct SVD on D 1 according to equation (7). UPSPC provides a lower bound for the compression. For D associated with a small number of image classes or occlusion classes, a higher compression ratio is able to be achieved. We divide the CD or OED into sub-dictionaries according to the image classes or occlusion classes, respectively, i.e., put the CD column vectors of the same image class into the same sub-dictionary or the OED column vectors of the OED associated with the same occlusion class into the same sub-dictionary. We name this type of partition intra-class partition. The column vectors of the same image class or occlusion class are usually highly correlated. Therefore, for the intra-class partition, the first few uncentered PCs of each sub-dictionary capture most of the information of the sub-dictionary and can represent the sub-dictionary approximately.
Suppose there are K A image classes and κ B occlusion classes in D. Let m A i and m B j denote the numbers of the reserved uncentered PCs for the ith sub-CD and the jth sub-OED, respectively. If then A or B can be compressed to a size smaller than that achieved by UPSPC by only reserving the first few uncentered PCs of each sub-dictionary. We call this type of compression uncentered PCA-based intra-class partition compression (UPIPC). The UPIPC algorithm is presented in Algorithm 2. Unlike UPSPC, the classification accuracy for UPIPC has a small deviation from that for the uncompress dictionary due to discarding some uncentered PCs in each sub-dictionary. VOLUME 8, 2020   ].

C. CLASSIFICATION SCHEME
The overall SDBE_L2-based classification scheme integrated with the proposed dictionary compression approaches is illustrated in Fig. 2

D. ANALYSIS OF COMPUTATIONAL RESOURCE CONSUMPTION
For the large-scale datasets with numerous occlusion patterns, it is apparent that N A m and N B m. Therefore, the maximum size of the matrices needed to be stored is m × (N A + N B ), which is the size of the dictionary before compression. Such a maximum size is much smaller than that of the original SDBE_L2-based scheme, which is ( Regarding the computational complexity, unlike the original SDBE_L2-based classification scheme, inverting D T D + λI is not a computationally intensive procedure anymore, because the size of D becomes not larger than m × m after dictionary compression. The most time-consuming procedure is the SVD of the uncompressed CD or OED. The SVD has a computational complexity of O(mN A ) or O(mN B ) for the CD or OED, respectively [39]. This complexity is much lower than that of inverting D T D + λI in the original SDBE_L2-based scheme, since N A m and N B m for the large-scale datasets with numerous occlusion patterns.

VI. EXPERIMENTS
In the experiments, the proposed algorithms were implemented with Matlab. Like [24], the MatConvNet [40] implementation of the ResNet-152 network [8], which was pre-trained on the ILSVRC2012 classification dataset [41], was adopted as the base-CNN to extract the 2048-D DFVs. The experiments are conducted on a PC with an i7 CPU and 64GB memory and without GPU acceleration.

A. EXPERIMENTS ON ACQUIRING EXTRA IMAGE PAIRS
In this section, we illustrate the effectiveness of the various examples enumerated in Section IV. The experiments were conducted on the Caltech-101 dataset [30] and Oxford-102 flower dataset [33].
The l 2 -regularized l 2 -loss linear SVM [42] was adopted as the classifier, which was trained on the normalized DFVs of the training images for each experiment. The regularization parameter of the linear SVM was drawn from the grid set = {2 −15 , . . . , 2 0 , . . . , 2 15 }. To match the input size of the base-CNN, all of the images were directly resized to 224 × 224. The occluded images were synthesized by superimposing the occlusion patches scaled to specific occlusion ratios on the resized images.
The optional l 2 normalization procedures (step 4 in the training stage and step 2 in the testing stage) were adopted in the experiments due to a strong preference of the linear SVM for normalized vectors. The hyperparameter λ of SDBE_L2 is drawn from the grid set = {10 −6 , . . . , 0.5, 1, . . . , 10}.

1) CALTECH-101 DATASET
The Caltech-101 dataset excluding the ''background'' image class is adopted for evaluation. The settings on the training images, occlusion-free test images, and occlusion-free extra images are the same as in [24]. In particular, the dataset is split into a class set (80 image classes with names from ''accordion'' to ''schooner'' in alphabet order) and an extra set In the first experiment, the cropping, noising, scaling, rotation, and translation of the occlusion patch and intra-class random pairing method with intra-dataset occlusion-free images (example 1 in Section IV-A, example 2, 3, and 4 in SectionIV-B, and example in Section IV-C) were evaluated. The test images were synthesized by using a single occlusion pattern (25% occlusion at the center of the image), named original occlusion pattern, to contaminate the occlusion-free test images. The example of the occluded image for the original occlusion pattern is indexed by 1 in Fig.3a.
For each trial, the OED was constructed by employing the occluded extra images synthesized only with the testing occlusion pattern. In Fig.3, each trial, except the 22nd trial, is numbered by the index of the occlusion pattern used to synthesize the occluded extra images. The occluded extra images of the first and 22nd trials were corrupted by using the original occlusion pattern. The first 21 trials adopted the exact-matching OEDs, while the last trial employed the intra-class random pairing OED. The OED used in each trial includes 630 OEVs associated with 21 image classes, each 30 OEVs.
Suppose the test images are contaminated with the kth occlusion pattern and the OED is constructed by using the extra image pairs associated with the lth occlusion pattern. The intersection between B k and the linear span of B l is unable to be measured directly since the exact span of B k is unavailable. Instead, we appraise the intersection between the linear spans of B l and B k as an approximation since the linear span of B k apparently has the smallest deviation from the B k among all of the occlusion patterns.
The normalized mean correlation between the column vectors of B k and B l is employed to assess the degree of the intersection between their linear spans. The normalized mean correlation is defined as VOLUME 8, 2020 where ρ ij (B k , B l ) is the Pearson correlation coefficient [43] between the ith OEV of B k and the jth OEV of B l . A high magnitude of ρ(B k , B l ) indicates a strong correlation and hence large intersection.
The result for each trial is shown in Fig.3b. Comparing with the approach without employing SDBE_L2, which achieves 55.4% classification accuracy, all of the evaluated variants of the original OED (index 1) improve the performance significantly, though smaller than the original OED. Therefore, in practice, we can collect the occluded images associated with the occlusion patterns being of the cropping, nosing, scaling, rotation, and translation of the original occlusion patterns. Also, we note that the intra-class random pairing method achieves the improvement similar to the exact-matching method (index 22 vs. index 1). So, in practice, we can employ the intra-class random pairing method to save much labor.
By comparing the dash(blue) line and the solid(red) line in Fig.3b, we can learn that the improvement is highly related to the normalized mean correlation ρ. The results manifest that the overlapping between the linear span of the OED and the occlusion error subspace is a deterministic factor for the performance of SDBE_L2. This also indicates that in practice, ρ is able to be employed to estimate the effectiveness and contribution of the acquired extra image pairs to the performance improvement.
In the second experiment, we evaluated the performances for the intra-class occlusion pattern (example 1 in Section IV-B). The classification results are plotted in FIGURE4. For each testing occlusion ratio, the OED is constructed by using the extra image pairs associated merely with the testing occlusion ratio, e.g., for the test of 25% occlusion, the occluded extra images only with 25% occlusion are employed to construct the OED. At each testing occlusion ratio, the intra-class occlusion pattern z 2 achieves a significant improvement, close to that achieved by the original occlusion pattern z 1 , over that without SDBE_L2, whereas the inter-class occlusion pattern z 3 achieves a very small improvement.
From the above two experiments, we know that if the occlusion patterns in the extra image pairs are associated with the same occlusion classes as those in the occluded test images, the OED can provide positive contribution to the classification. This indicates that the OEVs associated with the same occlusion class are highly correlated.

2) OXFORD-102 FLOWER DATASET
We conducted a performance comparison between the intra-dataset occlusion-free extra images and the inter-dataset occlusion-free extra images (example 1 and 2 in Section IV-A) on the Oxford-102 flower dataset.
The first 83 image classes of the Oxford-102 flower dataset were regarded as the class set. The images of the class set in the training set, validation set, and testing set, which are defined in [33], were adopted as the training images (10 images per image class), occlusion-free validation images Comparison of the classification accuracies with respect to the occlusion ratio for the OEDs associated with the original occlusion patch z 1 , intra-class occlusion patch z 2 , and inter-class occlusion patch z 3 . z 1 and z 2 are drawn from the same occlusion class and z 1 and z 3 from distinct occlusion classes. The occluded test images are synthesized by contaminating the occlusion-free test images with the occlusion patch z 1 scaled to the testing occlusion ratios.
(10 images per image class), and occlusion-free test images, respectively.
Two types of extra sets were evaluated for comparison. The first type, where the occlusion-free extra images were randomly drawn from the remaining 19 image classes each with 30 images, is a case for the intra-dataset occlusion-free extra images; the second one is a case for the inter-dataset occlusion-free extra images, which adopts occlusion-free images used in section VI-A1, except for those in ''sunflower'' and ''water_lilly'', which are two image classes of flowers.
For each occlusion ratio, eight occlusion patterns -four occlusion patches (shown in Fig.5a) at two occlusion positions: center and off-center (illustrated in the examples of occluded images in Fig.5b) -were employed to synthesize the occluded images.
Similar to the experiment in FIGURE4, for each testing occlusion ratio, the occluded extra images were synthesized with the occlusion patterns of the testing occlusion ratio. The exact-matching method was employed to form the extra image pairs. The occluded and occlusion-free validation images were employed to determine the regularization parameter for the linear SVM and the hyperparameter λ for SDBE_L2. After the hyperparameters were obtained, the training images and the occlusion-free validation images were combined as the set of training images to train the linear SVM for testing.
The experimental result is reported in Fig.5c. From the results, we can observe that the extra sets of the second type achieves significant improvement, though smaller than the extra sets of the first type, for the occluded test images. This result demonstrates that both the intra-dataset and the inter-dataset occlusion-free extra images are applicable to SDBE_L2. It is worth noting that the intra-dataset occlusion-free extra images can be irrelevant to the task. In practice, exploiting the task-irrelevant images alleviates the difficulty in collecting the extra images dramatically.

B. EXPERIMENTS ON DICTIONARY COMPRESSION
In this section, we concern ourselves with the dictionary compression. We evaluated both UPSPC and UPIPC on a large-scale synthetic occluded image dataset with numerous occlusion patterns. The original ''fc1000'' and ''prob'' layers of the pre-trained ResNet-152 network, which actually constitute a softmax classifier, were adopted as the classifier in the experiments.
Due to lack of publicly available large-scale datasets with numerous occlusion patterns, we adopted the ILSVRC2012 dataset as the set of occlusion-free images and synthesize the occluded images from the occlusion-free images. To match the input size of the base-CNN, the occlusion-free images were first resized to the shorter side of 256 and then cropped the center with the size of 224 × 224 out as the occlusion-free images for further processing.
The ILSVRC2012 dataset was split into the class set, extra set, and occlusion patch set. The splitting configuration is the same as in [24], i.e., 900 image classes in the class set, 20 image classes in the extra set, and 80 image classes in the occlusion patch set.
To make the uncompressed CD tractable for comparison in our computer, a subset of original training images, which was formed by randomly drawing 20 original training images from each image class of the class set, was adopted as the training set for SDBE_L2.
The original validation images in the class set were treated as the occlusion-free test images. To reduce the evaluation workload, the occluded test images were synthesized from a subset of the occlusion-free test images, which was obtained by randomly drawn from the occlusion-free test images each image class with five images. We compared the classification accuracies of the subset and the whole set of the occlusion-free test images with the original ResNet-152 network and observed merely a very small increase (0.34%) in classification accuracy for the subset. This indicates that the statistic of the subset has a negligible deviation from that of the whole set, and thus the occluded test images synthesized from the subset can statistically represent those generated from the whole set.
A total of 288 occlusion patterns were employed to synthesize the occluded test images. 36 occlusion patches, as shown in FIGURE6a, were segmented from the images that are randomly drawn from 12 image classes, each 3 images, of the occlusion patch set. Each occlusion patch has 2 occlusion ratios (10% and 20%) and 8 corruption positions (4 positions for each occlusion ratio), which are randomly and independently generated for each occlusion patch.
A total of 100 occlusion-free extra images, which were randomly drawn from 20 image classes of the extra set, each image class 5 images, were used in the experiment. 28800 occluded extra images were synthesized with all 288 occlusion patterns from the occlusion-free extra images. Two types of original OEDs, exact-matching OED and intra-class random pairing OED, which are denoted by the superscriptions e and i, respectively, were evaluated in the experiment.
By employing the OED construction method presented in [24], the generated OED is a 2048 × 28800 matrix. Taking the CD of 2048 × 18000 into account, we got the matrix D of 2048 × 46800. Accordingly, D T D, which is used to compute P, is a matrix of 46800 × 46800. For the double-precision representation, D T D requires over 17GB memory to store, which is unable to be handled by many desktop computers.
We first evaluated UPSPC for the CD and OED. To show the advantage of the uncentered PCA over the centered PCA, we also evaluated the center PCA-based single partition compression (CPSPC), where the uncentered PCA of UPSPC is replaced with the centered PCA. We employ the superscriptions ''UPSPC'' and ''CPSPC'' to denote the CDs or OEDs compressed by using UPSPC and CPSPC, respectively.
The classification results on the synthetic occluded test images are tabulated in TABLE 1. The original CDs or OEDs are indicated by the superscription ''org''. UPSPC successfully reduces the size of the dictionary to m × 2m (over 11 times smaller than the original dictionary) without affecting the classification accuracy for both the cases of the exact-matching OED and intra-class random pairing OED   Then, we evaluated UPIPC. Similarly, the centered PCA-based intra-class partition compression (CPIPC), where the uncentered PCA of UPIPC is substituted with the centered PCA, was evaluated for comparison. By the superscriptions ''UPIPC'' and ''CPIPC'', we denote the CDs or OEDs compressed by using UPIPC and CPIPC, respectively. In addition, for comparison, the random selection-based compression approach, denoted by the superscription ''rnd'', was also evaluated for the intra-class partition. In the random selection-based compression approach, m A i (or m B j ) column vectors are randomly drawn from the ith (or jth) instra-class sub-CD (or sub-OED) to construct the CD (or OED).
We first evaluated the compression for the CD. Without loss of generality, only the exact-matching OED was adopted for evaluation. For simplicity, m A i 's were set to the same value for all i's. The classification accuracy averaged over all of the occluded test images is shown in FIGURE7, where only the results for B UPSPC,e are plotted because B org,e and B UPSPC,e show the same results.
From FIGURE7, we can observe that the uncenter PCA-based approach is slightly better than the centered PCA-based approach and much better than the random selection-based approach. Although for intra-class partition, the compressed CD can achieve a classification accuracy close to the uncompressed CD by only reserving 5 PCs for each sub-dictionary, the size of the compressed CD is much lager than for the single partition (2048 × 4500 vs. 2048 × 2048) owing to the large number of the image classes. This result shows that for a large number of image classes, the intra-class partition is worse than the single partition.
Next, we evaluated the compression for the OED. Similar to the experiment for the CD, m B j 's were set to the same value for all j's. UPIPC, CPIPC, and the random selection based compression approach were evaluated. Both the exact-matching OED and the intra-class random pairing OED were adopted for compression. The classification accuracy averaged over all of the occluded test images are shown in FIGURE8.
From FIGURE8, we note that by only reserving 5 PCs for each sub-dictionary, i.e., a total of 60 column vectors for the compressed OED (much less than those for the single partition, 60 vs. 2048), both the OEDs compressed by using UPIPC and CPIPC achieve a slightly higher classification accuracy than that compressed by using UPSPC, which has the same results as the original OED. This result demonstrates that for a small number of occlusion classes, the intra-class partition can achieve higher compression ratio than the single partition.
The slightly higher classification accuracy can be contributed to the mitigation of the influence of the outlier OEVs. The original OEDs usually contain some outlier OEVs that are harm to SDBE_L2 and mainly captured by the PCs associated with small singular values. By ignoring the less important PCs, the compressed OED can alleviate the influence of the outlier OEVs, thus achieving better result. We can also observe that the uncentered PCA-based approach, UPIPC, achieves the best results for both the exact-matching OED and intra-class random pairing OED. This indicates that the advantage of the proposed approach is irrelevant to the pairing method.
In the above experiment for compression, we should note that if a centered PCA-based compression approach is applied to both the CD and OED, the classification accuracy is degraded dramatically; while, if it is applied only to one of the CD and OED, the classification result is just slightly worse than that achieved by the uncentered PCA-based compression approach. The SDBE approach is based on the uncorrelation or independency between the CD and the OED [24]. In the centered principal subspaces, the CD and the OED are not nearly uncorrelated or indepencent. Therefore, applying the centered PCA to both the CD and the OED gives rise to low classification accuracy.
We should also note that the intra-class random pairing OED has a similar performance (a small performance loss, less than 0.5% in classification accuracy) comparing to the exact-matching OED. This result demonstrates that the intra-class random pairing method is an effective alternative to reduce the construction workload of the OED.

VII. CONCLUSION
To reduce the difficulty and workload in acquiring the extra image pairs for the SDBE_L2-based classification scheme, we have given some examples on the useful types of occlusion-free extra images and occluded extra images and introduced the intra-class random pairing method to semi-automatically form the extra image pairs. The extensive experiments on various synthetic occluded image datasets show that comparing to the original OED in [24], the enumerated examples and the intra-class random pairing method only result in a small loss in classification accuracy. We have also observed that the classification performance for a variant of the OED is highly related to the normalized mean correlation to the original OED.
In addition, in order to decrease the dictionary size, we have proved that SDBE_L2 is invariant to the uncentered PCA and proposed two novel uncentered PCA-based dictionary compression approaches, UPSPC and UPIPC. For UPSPC, the size of the dictionary can be reduced to not larger than twice the column vector length. For the OED (or CD) associated with a small number of occlusion classes (or image classes), the dictionary can be shrunk further by UPIPC. The proposed dictionary compression approaches facilitate the application of the SDBE_L2-based classification scheme to the large-scale datasets with numerous occlusion patterns. The experiments conducted on the large-scale synthetic occluded image dataset have demonstrated the effectiveness of the proposed dictionary compression approaches.
In this paper, although UPIPC can achieved better compression under the specific condition, the adoption of UPIPC is manually determined. An automatic adoption approach need to be investigated in the future. In addition, we only propose the improvements from the perspective of the dictionary generation. Promoting the classification accuracy on the occluded images from the perspective of the base-CNN training is deserved investigation in the future. Integrating the SDBE_L2-based approach into other computer vision tasks, such as object detection and semantic segmentation, to improve the performance against the occlusion is also our future research direction.