An Interpretable Compression and Classification System: Theory and Applications

This study proposes a low-complexity interpretable classification system. The proposed system contains main modules including feature extraction, feature reduction, and classification. All of them are linear. Thanks to the linear property, the extracted and reduced features can be inversed to original data, like a linear transform such as Fourier transform, so that one can quantify and visualize the contribution of individual features towards the original data. Also, the reduced features and reversibility naturally endure the proposed system ability of data compression. This system can significantly compress data with a small percent deviation between the compressed and the original data. At the same time, when the compressed data is used for classification, it still achieves high testing accuracy. Furthermore, we observe that the extracted features of the proposed system can be approximated to uncorrelated Gaussian random variables. Hence, classical theory in estimation and detection can be applied for classification. This motivates us to propose using a MAP (maximum a posteriori) based classification method. As a result, the extracted features and the corresponding performance have statistical meaning and mathematically interpretable. Simulation results show that the proposed classification system not only enjoys significant reduced training and testing time but also high testing accuracy compared to the conventional schemes.


I. INTRODUCTION
Classification for multimedia has been studied and applied to a variety of applications such as security, entertainment, and forensics for years.The development of intelligent algorithms for classification results in efficiency improvements, innovations, and cost savings in several areas.However, classification based on visual content is a challenging task because there is usually a large amount of intra-class variability, caused by different lighting conditions, misalignment, blur, and occlusion.To overcome the difficulty, numerous feature extraction modules have been studied to extract the most significant features and the developed models can achieve higher accuracy.Selecting a suitable classification module to handle the extracted features according to their properties is also important.A good combination of feature extraction and classification modules is the key to attain high accuracy in image classification problems.
In general, classification models can be divided into two categories: 1) Convolutional neural networks (CNN) structure and 2) non-CNN based structure.The CNN structures are in the mainstream [1]- [6], and they are usually stacked up with several convolution layers, max-pooling layers, and ReLU layers.Hence the depth of the structure can be deep such as ResNet [6].Due to the deep structure, CNN models can extract the features well and achieve high accuracies.However, CNN models usually share several disadvantages, including mathematically intractable, irreversible, and time-consuming.Most of the CNN models use handcrafted-based feature extraction or backpropagation to train the convolution layers which makes the models mathematically inexplicable and time-consuming.In addition, the max pooling and ReLU layers are nonlinear, hence the structures become irreversible.The above disadvantages increase the difficulty in designing and adjusting suitable structures for various types of data sources.Therefore, research has been conducted to design a model which can be easily and fully interpretable mathematically like the second category introduced below.
Classification models using non-CNN based structures are mainly built with two parts, feature extraction modules, and classification modules.To design a model which is mathematically interpretable, several machine learning techniques for feature extraction or classification have been developed including various versions of principal component analysis (PCA) [7]- [17], linear discriminant analysis (LDA) [18]- [25], support vector machine (SVM), nearest neighbor (NN), etc.For the feature extraction modules, in [7]- [17] the authors used PCA; while in [18]- [25], the authors used LDA to extract features from images.For the classification modules, [8], [9], [13], [17] and [21] applied the nearest neighbor classifier.On the other hand, [7] and [20] utilized the SVM classifier.The combinations of feature extraction and classification modules are generally nonlinear.Thus to inverse or recover the extracted features to original data is difficult.
Inversing the features to original data reveals important information for classification.For instance, if certain significant features are extracted and they result in high classification accuracy, one would be interested in visualizing or quantifying these features in the original multimedia if this is feasible.The reversibility is like Fourier series.One can inverse individual Fourier series, quantify how they contribute to the original signals and understand the importance of individual series.Due to the difficulty of data reversibility in most existing solutions, methods to compress data efficiently dedicated for classification purposes have not been well addressed yet.
Data compression for classification purposes lead to not only reduced storage size but also decreased computational complexity.For instance, in a classification problem, if the extracted features can be reduced while the reduced features can achieve satisfactory testing accuracy and the inversed image from the reduced features still keep important characteristics of the original image as well, one may store the reduced features instead of all features.Also, the reduced features can be used for classification to decrease computational complexity.Furthermore, such data compression is interpretable because it keeps the most significant features for classifications.
In this study, we propose an interpretable classification system that can achieve high testing accuracy with low complexity.The proposed system is linear and inversible.Hence data compression via the system is possible.The proposed system can be divided into three parts including feature extraction, feature reduction and the classification for the reduced features.Thanks to the linear property of the proposed system, the extracted features can be inversed to original data to see insight into individual features.Thus, it is like a linear transform, such as Fourier transform, where both forward and backward directions are feasible.The proposed system can also significantly reduce the extracted features while it still maintains high classification accuracy.As a result, data compression for classification purposes is realizable via the proposed system.Moreover, we find that the extracted features in the proposed system can be approximated to uncorrelated Gaussian random variables.It is worth pointing out that the Gaussian approximation result was also observed in [7].The Gaussian approximation result endues the proposed system and the extracted features statistical meaning.Hence classical estimation and detection theory can be applied to the proposed system for classifying the data [26].This motivates us to use the concept of maximum a posteriori (MAP) to detect the class of input images.Therefore for every input image, there is a probability that this image belongs to each class.The detected class is the one that has the maximum probability.Since every class has a probability for the input data, one can also determine top candidate classes for the data and develop more sophisticated algorithms to refine the classification results.Consequently the proposal is an interpretable compression and classification system.
To verify the classification ability of the proposed system, we conduct experiments in face recognition who is this person?.We use the dataset in [27], Labeled Faces in the Wild (LFW).LFW is widely used by face recognition models like [1]- [4] and [8].Experiment results show that the proposed scheme outperforms conventional systems in terms of both testing accuracy and computational complexity.Moreover, the training and testing time of the proposed system is much faster than conventional schemes.In a standard PC platform, two datasets with 2804 and 6592 images only take 11 and 16 seconds for training respectively.Also, less than 1 msec to recognize the class of one image.The accuracy reaches 97.61% for a 19-person dataset and 84.71% for a 158-person dataset.Furthermore, thanks to the linear property, the proposed system can inverse the reduced features to the original image and the compression ratio is significant.In our experiments, the proposed system can reduce the number of features from 12288 to 270, a compression ratio up to 45.51:1; at the same time, the average deviation between the original and compressed images is only 9.14%, and more importantly the testing accuracy is 97.61% for the 19-person dataset and 84.71% for the 158-person dataset.
The outline of the remaining parts is organized as follow: In Section II, we present the linear recognition model and corresponding algorithms.In Sections III-V, individual modules of the proposed system are explained in details, where Section III explains the linear feature extraction module, Section IV describes the linear feature reduction module, and Section V introduces the linear classification module.In Section VI, we provide experimental results to show the advantages of the proposed system.Conclusions and future works are given in Section VII.

II. PROPOSED RECOGNITION MODEL AND METHODS
A block diagram of the proposed recognition model is shown in Fig. 1, which consists of preprocessing, feature extraction, feature reduction, and classification modules.The preprocessing module handles the original defects of the dataset such as noise, contrast, and brightness of the images.The preprocessing module contains operations including object detection, image processing, and data augmentation.It is an adjustable module designed according to the dataset.The feature extraction module extracts the significant features out from the dataset and has multiple stages.Each of the stages is a linear transformation.Hence the transformation is invertible, and the images can be reconstructed from the extracted features.This point will become more clear later.
After the feature extraction module, features are reduced for data compression purpose and avoiding overfitting.Finally, the reduced features are passed through the classification module, where we propose to use linear discriminant classifier because we find that a Gaussian statistical model can be assumed after the proposed linear transformation.Because the proposed system is linear and based on statistics, many results can be explained and improved mathematically.Let us explain individual modules more detailed in the following sections.

III. FEATURE EXTRACTION WITH VARIABLE CUBOID SIZE
In this section, we introduce the feature extraction module, which is linear and hence both forward direction (data to extracted features) and inverse direction (features to original data) are feasible.The extracted features can be assumed to be uncorrelated Gaussian random variables.Those points are explained in the following subsections.

A. Forward Direction
The proposed feature extraction module is a fast linear datadriven feedforward transformation with low space complexity.Referring to the block diagram in Fig. 1, let Q(n) ∈ R I×J be the nth preprocessed image of a size of I × J, where n = 1, 2, • • • , N .Hence N is the total number of the input images.In the testing phase, the value of N can be one; while in the training phase, it has to be more than one to find the transform kernels.Then, we feed all of the preprocessed images into the multi-stage proposed scheme and let the initial global cuboid G 0 (n) ∈ R I 0 ×J 0 ×K 0 be Q(n), where the superscript is the stage index and I 0 = I, J 0 = J and K 0 = 1.
Assume that there are P stages.At each of stage, we reshape the global cuboid into multiple non-overlapping local cuboids, perform principal component analysis on individual cuboids, collect the results, and reshape them into another global cuboid for the next stage.According to the size of input images, one can adjust the side lengths of the local cuboids in the vertical and horizontal direction at each stage.Let the side lengths of the local cuboids in stage p be l p i × l p j × l p k , where l p i , l p j are adjustable.The values of l p i , l p j and l p k satisfy P p=1 When the initial input and the side lengths l p i , l p j are given, the input can be processed by the multi-stage scheme.The dataflow and the dimension conversion at Stage p are shown in Fig. 2 and also explained in Steps 1-3 below: The stage index p is with increasing order, i.e., p = 1, 2, • • • P .

local cuboids below
where I p = I p−1 /l p i and J p = J p−1 /l p j .For instance, letting the side lengths be 4.If the image at Stage p − 1 is with size 64 × 64, the size becomes 16 × 16 at Stage p.
Step 2. Principal component analysis to local cuboids.The principal components are used as the transform kernels.Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of possibly correlated observations into linearly uncorrelated sets.When data vectors are projected onto the principal components, the PCA coefficients have larger variances which can help in separating data into various classes.
First, we reshape the local cuboids , where K p is equal to the volume of the local cuboid defined as and the reshape operation is defined as Second, the principal components of the vectors f p i,j (n) are calculated.The principal components are the eigenvectors of the covariance matrix of the data vectors.Let the mean for all data vectors be f p i,j given by The covariance matrix R p i,j is then calculated using Then we find the K p eigenvectors B p i,j of R p i,j , where each ] k are the principal components of f p i,j (n).Third, we project the vectors f p i,j (n) onto the transform kernels (principal components) and obtain the PCA coefficients g p i,j (n) defined as Step 3. Reshape PCA coefficients and form another global cuboid G p (n).We then reshape all the PCA coefficients obtained in Step 2 to place the coefficient vectors in the spectral direction using the following procedure: Combining all g p i,j,: where G p (n) ∈ R I p ×J p ×K p .The dimension conversion of global cuboids from Stage p − 1 to Stage p can be written as Stop criterion.The whole process is stopped when the last stage, Stage P , is approached, and the global cuboids At the final stage, one can obtain the final PCA coefficients given by where K P is with the constraints of l p i , l p j and l p k that K P = I × J.As a result, there is no growth in space complexity and the space complexity is O(1) throughout the whole process.Note that the space complexity doubles at each stage for the system in [7].Then the global cuboid at the final stage can be reshaped to the PCA coefficient vector g P (n) at the final stage: The PCA coefficients at the final stage are also called features that are used to determine the class of the input images.

B. Gaussian Approximation of Features
As mentioned in Step 2 of the previous subsection, the PCA is a procedure that can convert a set of possibly correlated coefficients into uncorrelated ones.Hence the coefficients at the final stage should be almost uncorrelated.To see this, the (i, j) − th element of the correlation coefficient matrix ρ g P ,g P of the PCA coefficients at the final stage is defined as where ḡP is the mean of g P (n) defined as For example, let g P be the feature vector that will be detailed in Experiment 1 in Sec.VI, where the dimension is 90.Then, the obtained correlation coefficient matrix is shown in Fig. 5.
From the figure, we see that the PCA coefficients at the final stage are almost uncorrelated.In addition to the uncorrelated relationship among features, we also find that the statistics of individual features can be approximated by Gaussian distribution.The histograms of some randomly picked samples are shown in Fig. 4, in which Gaussian approximation well matches the histograms.In addition to this dataset, we have also verified various datasets and observed similar Gaussian approximation results of the proposed system.Knowing that the features are nearly uncorrelated Gaussian distribution, we can treat the features as the Gaussian mixture model (GMM).The GMM phenomenon was also reported in [7].Approximating the features using the GMM is useful and motivates us to use the linear discriminant classifier, a MAPbased detector, that will be introduced in Sec.V.

C. Inverse Direction
Since the proposed scheme is a linear transformation, both forward and inverse transformations can be conducted.The inverse transformation is to reconstruct the PCA coefficients g P (n) at the final stage to the preprocessed image Q(n).We elaborate on the inverse transformation in the following steps: Now the stage index p is with decreasing order, i.e., p = P, P − 1, • • • , 1.
Step 1. Reshape the PCA coefficients back to vectors g p i,j (n).This is a simple inverse procedure of (10) given by Step 2. Project the the vectors onto the inverse transform kernel.Referring to (9), the inverse kernel of B p T i,j is simply its inverse matrix.Hence the backward result of this step can be easily obtained by where Step 3. Reshape f p i,j (n) back to the local cuboids.Referring to (6), this step is given by Step 4. Form global cuboid and take PCA coefficients.The method to collect the local cuboids to form global cuboid is the inverse of (4), and the method for taking the PCA coefficients for the previous stage is the inverse of (11).
Conduct Steps 1-4 for all P stages, the preprocessed images Q(n) can be recovered losslessly.

IV. FEATURE REDUCTION
The invertibility is an important property of the proposed system.One can reduce the features and thus keep significant elements of g P (n).By inversing g P (n) with reduced features, the original image can still be reconstructed with important features reserved.One may treat this as a lossy data compression dedicated for classification purposes or feature filtering.Next, let us introduce how to reduce the features and how to recover data from the reduced features.

A. Proposed Methods for Feature Reduction
To reserve the features that have the largest discriminations, the PCA can again be applied for this purpose.The property is that when data projects onto its significant principal components, the principal coefficients have larger variances.As a result, the principal coefficients with larger variances can be considered more significant and have larger discriminant power than those with smaller ones.Hence the problem reduces to keeping the features that have the largest variances.The variance vector of the features are defined as where ḡP is the mean defined in (18).When σ g P is obtained, one can find the set of indices I which corresponds to the D largest variances defined as Once the set I is found, the features can be reduced accordingly.The feature vector after the feature reduction x(n) can be obtained by The feature vector has a size reduction from K P to D. Also, x(n) contains the top few significant features of the nth image and represents this image in training or testing phase.Determining a suitable value of D is important.This value can be determined in the training phase.More specifically, given a set of training data, one can set D so that the classier has the best performance and this will be shown in the simulation results later.

B. Recover Images from Reduced Features
As mentioned in Sec.(III-C), the features before feature reduction g P can be used to reconstruct the preprocessed images losslessly.This is also true for the features after feature reduction x(n) with lossy image reconstruction.The inverse transform is different only at Stage P , which is elaborated below.For Stage P : Step 1. Form transform kernel at Stage P with feature reduction.The transform kernel BP i,j can be obtained by extracting the columns indexed by set I given by Step 2. Project features after feature reduction.The feature vector x(n) is projected onto the Moore-Penrose pseudoinverse matrix of transform kernel, i.e., ( BP T i,j ) pinv ∈ R K P ×D and yield where ( BP T i,j ) pinv is the pseudo-inverse matrix of BP T i,j in (25).
Step 3. Reshape to local cuboids.The vector After Steps 1-3 for Stage P , the same procedure for Steps 1-4 introduced in Sec.III-C shall be conducted for other stages from P − 1 to 1.After P -stage of inverse transformation, the initial global cuboid Ĝ0 (n) is reconstructed.Due to the energy loss from the eliminated features, image processing is needed for reducing the reconstruction loss.First, we compensate the energy loss using a brightness gap h(n) defined as Second, we apply histogram equalization to enhance the contrast and fix the range span as well.The histogram equalization was widely used, see e.g., [28] and [29].Here, we use the function histeq provided by the Matlab.Finally, the recovered image can be obtained with a slight loss Q(n) defined as In Fig. 6, we provide an example using an image from the testing dataset that will be introduced in Sec.VI later.The original features of the image in the red layer is 4096, and we recover the image using only 90 features.In row (a), the red layer of the original image and its histogram are shown.
In row (b), we show the recovered image without any image processing Ĝ0 (n) and we can see the image is darker than the original one and the contrast is poor, though one can still recognize the image.In row (c), the brightness has been compensated by adding h(n) defined in (28).Hence the mean of the histogram shifts.In row (d), we show the recovered image Q(n) using (29).One can see that no matter from the image or its histogram, they are the closest to those in row (a).Thus the image processing in ( 28) and ( 29) really helps when recovering images from reduced few features.When D < K P , reconstructing images from the reduced features can be regarded as a lossy data decompression.Such data compression and decompression are meaningful for data classification.More specifically, if only D most significant features are needed to achieve a target classification performance, one can simply store the data set with D features instead of K P features because both have comparable classification performance.Namely, we only store the top D significant features x(n) ∈ R D , instead of the original data Q(n) ∈ R I×J .Such a concept may be used to reduce the data size and computational complexity in classification algorithms.The compression ratio r with feature reduction can then be written as The examples in Fig. 6 (a) and (d) thus have a compression ratio of 4096 : 90 = 45.51 : 1.

V. LINEAR DISCRIMINANT CLASSIFIER
After the feature reduction, the reduced features can be fed into the classifier, which is introduced here.As discussed in Sec.III-B, the features can be approximated to be uncorrelated Gaussian random variables.Hence classical theory in estimation and detection can be used and the resulting performances have theoretical explanations.Now we have the reduced feature vector x(n) and would like to determine which class it belongs to.To treat such a problem, classical maximum a posteriori (MAP) estimator provides a good reference.In the area of machine learning and pattern recognition, this concept is usually called linear discriminant analysis (LDA), which finds a linear combination of features that can separate two or more classes.Here we combine the two concepts, LDA and MAP estimator to design the linear discriminant classifier introduced as follows: First, let y(n) be the output of the classifier, which is the class that x(n) belongs to, and C m be Class m.The MAP estimator is given by where M is the number of classes.The posterior probability is defined as where P (x(n)) is written as Since P (x(n)) is a constant for all classes, one can ignore the denominator in (32).As for the numerator in (32), since we consider the input features are multivariate Gaussian distributed, the priori probability can be written as where n m is the number of training images which are with class m, and N = M m=1 n m is the number of all training images.The mean of x m (n) is defined as and the covariance matrix Σ m is defined as where In the linear discriminant classifier, we need a common covariance matrix so as to keep the likelihood function linear.In this work, we propose to use the pooled covariance matrix Σ as the common matrix.The pooled covariance matrix is defined as The likelihood function P (x(n)|C m ) for multivariate Gaussian random variables is then given by where recall that D is the number of reduced features.
To simplify the calculation, referring to (32), we take the natural logarithm, and ignore the denominator of the posterior probability function P (C m |x(n)), i.e., The term ln P (x(n)|C m )P (C m ) can be rewritten as From (37), since the covariance is common, ln( Then the remaining term can be written in a linear form consisting of a weight w m and a bias term b m for an individual class given by where the term ) is the bias.Finally, the classifier can be written in a linear equation form as follows: We summarize the proposed scheme for training and testing including the linear transformation of feature extraction, feature reduction and linear discriminant classifier in respectively in Algorithms 1 and 2.

VI. EXPERIMENTAL RESULTS
In this section, the experiment results are provided to show the performance of the proposed classification system.
Dataset and Hardware.Labeled Faces in the Wild (LFW) is a well known academic test set for face verification [1].There are two datasets used in the experiments.For the first data set, 158 classes are selected from the whole dataset which has more than 6 images for training and 2 images to testing per class.For the second data set, 19 classes are selected out of the 158 classes in the first data set, which has more than 30 images for training and 10 images for testing.Table I details the size of the two datasets.Intel(R) Core(TM) i7-8700CPU and 16GB RAM is used to conduct the experiments.
Image Preprocess.Firstly, the face is detected and the background of the image is blacked via an open-source python file [30].Then all the images are resized to 64 × 64.An overview of the images after the face detection operation and resizing is shown in Fig. 7. Secondly, the training images are augmented.The reason is that in the first dataset, there are some classes that have only 6 images for training, which is insufficient to train a good model.Also, some faces do not face the same directions as shown in Fig. 7. Therefore, we augment the training images by flipping the images horizontally and make the number of images doubles.Thirdly, we separate the Let global cuboid size be I 0 × J 0 × K 0 , where I 0 = I , J 0 = J and K 0 = 1.

4:
I p = I p−1 /l p i and J p = J p−1 /l p j 5: for i = 1 : I p do 6: for j = 1 : J p do 7: Find the transform kernel (PCs) B p i,j using eigenvector-based method.

8:
Project f p i,j (n) onto B p i,j to obtain PCA coefficients g p i,j (n) using ( 9).9: Reshape g p i,j (n) to g p i,j,: (n) to form global cuboid G p (n) using (11).end for 12: end for 13: Obtain final PCA coefficients g P i,j,: (n) at the final stage and reshape them to vector g P (n).
-------Linear Discriminant Classifier -------  three primary colors of images into R, G, B 3 layers and then equalize the 3 layers of histogram individually.This step enhances the contrast of images and enable the separation of the three primary color layers.
After preprocessing, the training data are passed through the proposed system, and Algorithm 1 is used to process the data.
Algorithm 2: Testing Process Local cuboid spatial sizes l p i × l p j same as Algo. 1. Transform kernels B p i,j .Weight w m and bias b m .Remaining indices I. Output: The classification accuracy. 1: Let global cuboid size be I 0 × J 0 × K 0 , where I 0 = I , J 0 = J and K 0 = 1.
-------Linear Discriminant Classifier -------  Experiment 1: Testing accuracy as functions of various numbers of features.As discussed in Sec.IV, the number of features id reduced from K P to D to avoid overfitting in training classifier, where K P = 4096 for one primary color layer in the experiments.For designing a face recognition model, the main goal is to achieve high testing accuracy, and thus D is decided mainly according to the testing accuracy.Let the local cuboid sizes be set as ), which is a 3stage scheme.Table II shows the testing accuracy corresponding to 8 different numbers of features.Several observations are summarized: First, more features do not necessarily lead to better testing accuracy.This is because more features may result in overfitting problem and hence feature reduction is needed.Second, for both datasets, the best testing accuracy occurs when D = 90 for one primary color layer and a total of 270 features for three layers.Consequently, the compression ratio is 1 : 45.5, which is a significant reduction of data size.
Experiment 2: Testing accuracy and computations as functions of various local cuboid sizes.In this experiment, Here, the number of extracted features is fixed to 270, which leads the best testing accuracy according to Experiment 1.The results are shown in Table III.Observed from the table that Settings 2 and 3 both achieve the best accuracy.However, Setting 2 has a lower computational cost than that in Setting 3. Hence Setting 2 is an efficient setting of local cuboid sizes for this experiment.
Experiment 3: Performance with multiple candidates.Following Experiments 1 and 2, we show all the images that are incorrectly recognized from the 19-classes testing dataset in Fig. 8.We see that there are some problematic images which are marked with red and blue rectangles.In the images with red rectangles, the faces are covered by hands; while in the image with a blue rectangle, the error actually cames from image preprocessing (face detection operation).When we eliminate those problematic images or select a better face detection operation, the testing accuracy can be improved from 97.6% to 98.5%.Similarly, in Fig. 9, we displayed the incorrectly recognized problematic images from the 158classes testing dataset.Some of them are seriously affected by sunglasses, hands, and even other's shoulder.When we exclude these errors, the accuracy improves from 84.71% to 86.3%.
Additionally, because the proposed scheme classifies the images using a MAP-like estimator, for each image the classifier outputs the individual probabilities for individual classes that this image belongs to.Hence it is possible to determine several candidates with the highest probabilities that one image belongs to.Refined and improved algorithms can be developed to determine the best decision from the candidates with the highest probabilities.It is worth mentioning that some classifiers cut space into regions and do not have the cluster    images from 12288 (full features), 9000, 6000, 3000, 270 features which respectively have 4096, 3000, 2000, 1000, 90 features for recovering one primary color layer.To compare the recovery result, we use the percent deviation to calculate the loss defined as Table IV shows the corresponding percent deviations and compression ratio.It is observed that using more features does not necessarily lead to a small percent deviation due to overfitting issue.In addition, reconstructing images from 270 features has a very high compression ratio and a satisfactory low percent deviation.Note that the deviation for 12288 is very small but nonzero due to the accumulated computational error of multiple stages in the platform.Let us see how the recovered images look like.Fig. 11, shows some reconstructed sample images.Row (a) are the samples recovered from full 12288 features and the average percent deviation is 1.7226e-04%, which proves that we can  recover the image losslessly from full features.Row (b) are the samples recovered from 9000 features and the average percent deviation is 5.48%.Rows (c) and (d) are the samples recovered from 6000 and 3000 features and the average percent deviations are 12.41% and 15.51% respectively.Observing that as the reduced features increase in this level, the deviations grow higher.Also in rows (c) and row (d), the reconstructed images contain certain insignificant details which look like noises.As a result, the reconstruction quality and the testing accuracy are poor due to overfitting as we can see in the previous example in Table II.When the number of features reduces to 270, however, in row (e), we see that the insignificant details disappear and the effect just like passing the images through a smoothing process.Consequently, the deviation is only 9.14% which is even lower than row (c) and row (d).This result also reflects that 270 features are suitable for training the classifier since they grasp most of the significant information of an image and avoid overfitting occurred as using 6000 and 3000 features.Moreover from the top and bottom samples of the first column in Fig. 11, the glasses disappear in the reconstructed image from the reduced 270 features.This implies that the proposed scheme can remove redundancy that is irrelevant or unimportant to classification.To see this point more clearly, Fig. 12 shows more reconstructed images from the 270 features.Observed from the figure that glasses and figures disappear.That is, significant features extracted by the proposed scheme would be the information about face structures, not those disturbing objects like glasses and fingers.Therefore, when we reconstruct the images from a suitable amount of features, the disturbing items should disappear.From a viewpoint of data compression, the compressed data is dedicated to achieving better classification performance.This effect may be treated as "feature filtering" of the proposed system in classification.
Experiment 5: Comparisons between conventional and proposed schemes.In this experiment, we compare the proposed system with the AlexNet and the Saak transform in [7].The same preprocessed training and testing images are used.We apply the AlexNet provided by Matlab.The CNN model is trained using exactly the same layers like those in AlexNet.Moreover, considering the hardware, we allow that the CNN provided by Matlab uses GPU to speed up, and the proposed scheme only uses CPU.The GPU that CNN uses is NVIDIA GeForce GTX 1050 Ti.For the Saak transform, 600 features are used, which can achieve its best accuracy.The result is shown in Table V.We see that the proposed model achieves a better accuracy while the whole computational time including training and testing is only one twenty-fifth of that with CNN.

VII. CONCLUSION AND FUTURE WORK
We have proposed a linear classification system that can inverse the extracted and reduced features to original data, and achieve data compression for classification purposes as well.Experimental results show that the proposed system outperforms the conventional classification schemes in terms of not only computational complexity but also testing accuracy.
From the viewpoint of data compression, the proposed system compresses the data in a way beneficial for classification purposes.That is, when the images are recovered from the reduced features via the proposed system, several unimportant feature redundancies for classifications such as glasses and covered hand on the faces are naturally filtered out.We call this effect feature filtering.When the compressed data is used for classification, the testing accuracy is high while the recovered images can still achieve a small percent deviation as well.The nice properties including linearity, reversibility, achieving data compression and feature filtering have made the proposed system worth further investigation.The solutions to develop sophisticated algorithms to refine the detection results among top guesses, and to efficiently utilize the advantages of data compression for classification are still open.

Fig. 1 .
Fig. 1.The proposed linear classification system consisting of the preprocessing, feature extraction, feature reduction, and classification modules.

Step 1 .
Global cuboid to several local cuboids.Let the global cuboid of the nth image at Stage p − 1 be G p−1 (n) with size I p−1 × J p−1 × K p−1 .The global cuboid is cut into several local cuboids G p−1 i,j (n) with size l p i ×l p j ×l p k at stage p.With the constraints of the local cuboid side lengths, one can perfectly cut the global cuboid into I p × J p non-overlapping

Fig. 2 .
Fig. 2. The dataflow, reshape operations and dimensions of cuboids between Steps p − 1 and p.

Fig. 3 .
Fig. 3.The dimension conversion of the global cuboid through stages.S p and inv(S p ) are the forward and inverse of the proposed scheme between stage p and (p − 1), respectively.

Fig. 5 .
Fig.5.The correlation matrix of the extracted features, which is nearly diagonal with diagonal elements being 1 and the off-diagonal elements being smaller than 1.0e-05.

Algorithm 1 :
Training Process Input: N preprocessed training images Q(n) ∈ R I×J .Local cuboid spatial sizes l p i × l p j in stage p for p = 1, 2, • • • , P .Output: Transform kernels B p i,j .Weight w m and bias b m .Remaining indices I. 1:

16 :
Feed x(n) and labels into the Linear Discriminant Classifier, and obtain weight w m and bias b m .

Fig. 7 .
Fig. 7.An overview of images after face detection and resizing.

Fig. 8 .
Fig. 8.All the 11 incorrectly recognized images from the 19-classes dataset.There are some problematic images boxed in red and blue rectangles.If they are eliminated, the accuracy can reach 98.5%.

Fig. 9 .
Fig. 9.All the incorrectly recognized problematic images from the 158classes dataset.If these unavoidable errors are eliminated, the accuracy can reach 86.3%.

Fig. 10 .
Fig. 10.The CDF of top-X guesses for both testing datasets.The accuracies of both datasets reach above 90% when having top-3 guesses, 99.1% for the 19-class dataset and 93.1% for the 158-class dataset.

Fig. 12 .
Fig. 12.The image recovery method eliminates disturbing objects.The top image originally has fingers in front of the man's chin and disappears after recovered.The middle and the bottom images both have glasses at first but disappear after recovered.

TABLE II TESTING
ACCURACY AS FUNCTIONS OF VARIOUS NUMBERS OF FEATURES

TABLE III TESTING
ACCURACY AND COMPUTATIONS AS FUNCTIONS OF VARIOUS LOCAL CUBOID SIZES.

TABLE V ALEXNET
VS. SAAK TRANSFORM VS.PROPOSED RECOGNITION MODEL