A Category-Specific Dictionary Learning Method Tailored for Reconstruction-Based Feature Coding

Bag-of-Visual-Words (BoVW) is still a useful image classification model when there is not enough data to use Deep Learning. In BoVW model, the practice of reducing the reconstruction errors of local features can improve the classification accuracy owing to the decrease of information loss. Many reconstruction-based coding methods are proposed to learn a visual dictionary and encode local features via minimizing the reconstruction errors of local features with constraints. Besides this, the accuracy can also be improved by learning the category-specific dictionaries and then encoding features based on these dictionaries. By considering the two practices together, we propose a simple category-specific dictionary learning method tailored for reconstruction-based feature coding. Our method can be used as a universal one to improve the classification accuracies of many reconstruction-based coding methods, which is the highlight of our method. Concretely, a universal dictionary is learned by employing a reconstruction-based coding method and then refined for each category to obtain the category-specific dictionary of this category. When encoding a feature by a category-specific dictionary, the visual words for encoding it are decided in advance by the indices, which correspond to the non-zero elements of its coding vector obtained with the universal dictionary. The effectiveness of our method is validated by observing whether there is an accuracy improvement after applying our method. Our results on Scene-15, Caltech-101, and UIUC-Sports datasets show that the accuracies of four representative coding methods are improved by about 0.3% to 2.7%, which experimentally demonstrates the universality and effectiveness of our method.


I. INTRODUCTION
In decades, many methods for image classification have been presented in the field of computer vision. The challenges such as the change in viewpoint, illumination, partial occlusion, clutter, inter and intra-category visual diversity, make image classification a difficult task. Until now, there are two representative image classification models, i.e., Bag-of-Visual-Words (BoVW) and convolutional neural network (CNN), which have achieved many encouraging results in the past ten years.
BoVW model divides the process of converting an image to a vector into five stages [1]. In the beginning, image The associate editor coordinating the review of this manuscript and approving it for publication was Junchi Yan . patches are extracted from training images in a dense or random manner. Then, image patches are described as feature descriptors (local features) via statistical analysis over pixels of image patches. Scale-invariant feature transform (SIFT) [2] is widely used to describe image patches as 128-dimensional vectors. Next, a visual dictionary is learned using local features from training images by a learning algorithm such as K -means [3] or sparse coding [4]. After this, local features are encoded as coding vectors by the learned dictionary. In the end, all coding vectors are pooled together to form an image representation vector by maximum pooling or average pooling [5]. CNN [6] is a deep neural network of exploiting image space structure. It consists of convolutional layers, pooling layers, non-linear activations, and fully connected layers. Convolutional layers capture the existence of various patterns, which are detected by different convolutional kernels. Pooling layers preserve the maximum saliencies in local regions by downsampling the convolutional layers. Fully connected layers are generally appended at the end, which simply represents a multi-layer perceptron. Non-linear activation functions are necessary to learn a complex function. A predominant difference between BoVW and CNN is that BoVW works with hand-crafted features such as SIFT and Histogram of Oriented Gradient (HOG), while CNN can extract automatically image features after training on a significant amount of data.
In recent years, CNN has achieved many superior results on some challenging image datasets such as ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [7]. However, when there is not enough training data, CNN shows a poor performance due to over-fitting [8]. To solve this, an effective method is to modify a CNN already trained on another large dataset, known as transfer learning [9]. The first layers of a pre-learned CNN are used as a mid-level feature extractor, and the last layers are modified to fit a certain target task. When training the modified CNN, only the weights of the modified layers are updated, while the weights of other layers are fixed. Some methods [10]- [12] based on transfer learning have reported obvious improvement over BoVW and achieved significant results in medical image classification.
Despite the significant effectiveness of transfer learning, its success depends on a pre-learned CNN. A pre-learned CNN requires a huge amount of data and time for training. It is worth noting that, the choice of the source dataset used for pre-learning a CNN and the number of the images in the target dataset influence the classification result [13]. If there is a large visual difference between the source dataset and the target dataset, negative transfer is likely to happen, resulting in poor performance. In addition, the number of the parameters of CNN is large for some popular CNNs, leading to considerable memory space consumption, such as 520MB for VGG-16 [14]. At the same time, BoVW model is a plugn-play method that can be used without any prior initialization or very time-consuming training [15]. Hence, BoVW model might work well when dealing with some classification tasks that only provide a small amount of training data. Moreover, BoVW model has evolved in an understandable way in the past 15 years or so. By analyzing the target classification task, it is feasible to make use of human knowledge obtained to improve BoVW model from the aspects of feature extraction, feature description, dictionary learning, feature coding, and feature pooling. In view of this, for some simple tasks, BoVW model is probably capable of attaining satisfactory results. Besides, BoVW model can also be used jointly with CNN to acquire higher classification accuracies especially when training data are lacking, as done in a very recent research [16]. In consequence, we advocate the conventional yet effective BoVW model in this article.
In BoVW model, the practice of reducing the reconstruction errors of local features can improve classification accuracy owing to the decrease of information loss.
To this end, reconstruction-based coding methods are proposed to reconstruct local features via resolving a least-square optimization problem with constraints. Besides, at the training stage, the reconstruction-based coding method also learns a visual dictionary using local features extracted from training images. Nowadays, a number of reconstruction-based coding methods have been proposed, such as sparse coding [4], locality-constraint linear coding (LLC) [17], laplacian sparse coding (LSC) [18] and so on. The main difference among various reconstruction-based methods lies in the constraint. Except for this practice, another effective way of improving classification accuracy is to learn the category-specific visual dictionaries and then encoding features by the learned dictionaries. Various category-specific dictionary learning methods [19]- [22] have been presented in the last decade.
By considering the above two practices together, we propose a simple category-specific dictionary learning method tailored for reconstruction-based feature coding in this article. We aim to reduce the reconstruction errors of local features from positive samples and increase the errors of features from negative samples via encoding based on categoryspecific dictionaries. Specifically, a universal dictionary is learned by employing a reconstruction-based coding method and then refined for each category to obtain the categoryspecific dictionary of this category. For each category, its category-specific dictionary is learned only using the local features of this category. When encoding a local feature by a category-specific dictionary, the visual words for encoding it are decided in advance by the indices, which correspond to the non-zero elements of its coding vector obtained with the universal dictionary. Our method can be used as a universal one to improve the classification accuracies of many reconstruction-based coding methods theoretically, which is the highlight of our method. In this article, we apply our method on four representative coding methods, i.e., sparse coding, approximated LLC (aLLC) [17], LLC, and LSC. The effectiveness of our method is validated by observing whether there is an accuracy improvement after applying our method. We also investigate the computation time spent by our method to evaluate the practicability of our method. Besides, our method is carefully compared to a common method. By comparison, we observe that they have different performance characteristics according to the mixability of spatial distributions even if they follow very similar ideas of learning category-specific dictionaries, which is not yet illustrated in the existing works to our knowledge. The experiments are conducted on three small datasets, i.e., Scene-15 [3], Caltech-101 [23] and UIUC-Sports [24]. Our results show that the classification accuracies of the four coding methods are improved by about 0.3% to 2.7%, and the computation time spent by our method is acceptable. This phenomenon implies that our method is capable of improving the classification accuracies of many reconstruction-based coding methods with added yet acceptable computation time.
The remainder of this article is organized as follows: the proceeding section is about the related works.
Section III illustrates our work in detail. Experimental evaluation and analysis are reported in Section IV, and the conclusion is drawn in Section V.

II. RELATED WORKS
Many works have focused on dictionary learning in the past ten years. An early method calculates the clustering centers using local features from training images by K -means and takes each clustering center as a visual word [3]. To reduce the quantization errors of local features, the reconstruction-based coding method is proposed to learn a visual dictionary using local features from training images via resolving a least-square optimization problem with constraints. Different constraints result in different dictionaries. Sparse coding adds a l 1 -norm constraint on the coding vectors of local features to keep their sparsity. Wang et. al [17] added a locality constraint to project the local features into their local coordinate systems. Gao et. al [18] required spatially close and similar local features to have similar coding vectors. An extended work [25] to [18] considered the similarity among local regions instead of local features. Bengio et. al [26] encouraged that local features from the images of the same category are encoded with fixed visual words, by imposing a mixed-norm regularization.
Some works are devoted to learning category-specific dictionaries. In [19], the authors used Gaussian Mixture Model (GMM) to learn a universal dictionary and adapted it for each category to generate the category-specific dictionary of this category. Kong et al. [20] proposed a categoryspecific dictionary learning method named DL-COPAR, which aims at separating commonality and particularity. In their method, local features are encoded with the union of a universal dictionary and all category-specific dictionaries. In [21], the authors learned a universal dictionary and multiple category-specific dictionaries jointly by adding a discriminative constraint according to Fisher discrimination criterion. Gao et al. [22] also proposed to learn a universal dictionary and multiple category-specific dictionaries for fine-grained classification, by imposing cross-dictionary incoherent constraint and self-dictionary incoherent terms. Yang and Xiong [27] employed K -means and K -SVD to learn a category-specific dictionary for each category using the local features of this category. Based on the learned category-specific dictionaries, the authors proposed a kind of feature named category-sensitive saliency feature and used it to obtain image representation vectors. This method achieved comparable or better results in comparison to many advanced BoVW methods.
In order to preserve the relationship among neighboring local features, [28] and [29] proposed visual phrase and visual local graph, respectively. Accordingly, visual phrase dictionary and visual graph dictionary are learned in [28] and [29], respectively. In [5] and [30], the authors combined spatially close local features into joint features, and then learned a dictionary using joint features by sparse coding.
In recent years, Analysis Dictionary Learning (ADL) has shown excellent performance in image classification tasks, such as [31] and [32]. It is worth noting that ADL is often applied to image representation vectors to acquire more discriminative vectors. Therefore, the dictionary obtained by ADL is not used for encoding local features, which is different from the one in BoVW model.

III. OUR WORK
In this section, we first illustrate our work under the framework of BoVW model. Afterward, the detail on our method is presented clearly.

A. PROCESS OF IMAGE CLASSIFICATION
In the last decade, BoVW model has formed a unified framework consisting of five basic steps [1]. It includes image patch extraction, image patch description, dictionary learning, feature coding, and feature pooling. Our work only involves in dictionary learning and feature coding. Fig. 1 shows the process of image classification including our work. As shown in Fig. 1, the input image is classified through the six stages (a) to (f).
The first stage (a) is to extract the image patches from the input image. This process is implemented via sampling local areas of the image usually in a dense manner, e.g., the dense patches of 16 × 16 pixels with the step of 8 pixels.
Then, the image patches are described as the feature descriptors (local features) at the second stage (b). Many methods such as Scale-invariant Feature Transform (SIFT), Local Binary Pattern (LBP), and HoG, can be used for describing image patches. For example, SIFT is widely employed to describe image patches as 128-dimensional vectors.
Next, we obtain the indices of the visual words for encoding local features at the third stage (c). For each feature, its indices indicating which visual words are involved in the coding process are recorded by a 0-1 vector, where the 1 elements correspond to the non-zero coding coefficients of its coding vector obtained with a universal dictionary. The universal dictionary is learned using local features from training images via solving a least-square optimization problem with constraints (illustrated in Section III(B.2)).
Afterward, at the fourth stage (d), each local feature is encoded as C coding vectors by C category-specific dictionaries, respectively (illustrated in Section III(B.4)). At this stage, the visual words for encoding each feature are decided in advance by the 1 elements in its 0-1 vector. The category-specific dictionary of each category is learned by refining the universal dictionary with the local features from the training images of this category (illustrated in Section III(B.3)).
At the next stage (e), the coding vectors are aggregated into one vector by performing a pooling operation, such as maximum pooling and average pooling. In many existing methods, Spatial Pyramid Matching (SPM) [3] is widely used to incorporate the spatial information of the local features from an image into an image representation vector. It partitions the whole image region into the multiple blocks at the different resolutions levels of 1×1, 2×2 and 4×4. The coding vectors in each block are pooled together to form a pooling vector, and then the pooling vectors of all the blocks are concatenated into one vector, namely, image representation vector.
At the last stage (f), the image representation vector of each category is fed into the SVM trained for this category to obtain the score denoting how the input image belongs to this category. The category indicated by the maximum score is taken as the category of the input image. For any category, its SVM is a one-versus-rest linear SVM, which is trained on the image representation vectors obtained with the category-specific dictionary of this category. The training images of this category are taken as the positive samples and the training images of other categories are the negative samples.

B. CATEGORY-SPECIFIC DICTIONARY LEARNING
In theory, classification accuracy can be improved by this practice, i.e., reducing the reconstruction errors of the local features extracted from positive samples and increasing the errors of the local features from negative samples in general. The reason is that the images restored from the coding vectors of positive samples become ''clear'' (information loss decreases), while the images restored from the coding vectors of negative samples become ''blur'' (information loss increases). Hence, the difference between positive and negative samples increases by this practice, which is beneficial to classification tasks.
To achieve this, we learn its category-specific dictionary for each category using the local features extracted from the training images of this category. The category-specific dictionary is obtained by resolving a least-square optimization problem. In this case, for any category, the visual words from its category-specific dictionary are specifically learned to minimize the reconstruction errors of the local features of this category. As a result, for the category-specific dictionary of any category, it tends to generate low reconstruction errors for the local features of this category and high reconstruction errors for the features of other categories.
Concretely, we first use the local features extracted from all training images to learn a universal dictionary via FIGURE 2. Toy example of category-specific dictionary learning and feature coding. In these square boxes denoting the same 2-dimensional feature space, the diamonds represent the visual words and the small circles are the local features. Each local feature is encoded only by the nearest word to it, and its color indicates which word encodes it. (a) for each local feature of each category, obtaining the index (denoted by a kind of color) of the nearest word to it; (b) refining the universal dictionary for each category using the local features of this category according to their indices. (c) for the local feature f , obtaining the index of the word from the universal dictionary for encoding it; (d) encoding the feature f with the word from each category-specific dictionary indicated by its index.
resolving a least-square problem P. Then, the universal dictionary is refined for each category to obtain the category-specific dictionary of this category, by resolving another least-square problem P using the local features of this category. The problem P is a non-convex function including two kinds of variables, i.e., dictionary and coding coefficients, thus they need to be solved alternatively. When solving the coding coefficients of a local feature, only the visual words indicated by its indices are used for encoding it. The indices of a local feature correspond to the non-zero coding coefficients of its coding vector, which is obtained by encoding the feature based on the universal dictionary. After attaining the category-specific dictionaries of all categories, a local feature is encoded in the following steps. Given a local feature f , firstly, it is encoded by the universal dictionary as a coding vector. Afterward, the indices corresponding to the non-zero coefficients of its coding vector are recorded, which indicate which visual words of the universal dictionary encode f . At last, for each category, only the visual words from its category-specific dictionary indicated by the indices of f , are used to encode f to obtain the coding vector for this category. Fig. 2 illustrates the principle of our method by a toy example.
From the viewpoint of feature space, for a local feature, the visual words indicated by the non-zero elements in its coding vector (obtained with a universal dictionary), define a hyperplane it projects onto. Different coding methods give different projection schemes for the same feature. When solving the coding coefficients of a feature at the two stages of refining dictionary and encoding features by any category-specific dictionary, the feature is always projected onto the ''same'' hyperplane, which is constructed by the visual words from a non-universal dictionary indicated by the non-zero elements in its coding vector (obtained with a universal dictionary).
In the following, the unified representation of reconstruction-based coding methods is given in Section III(B.1). On the basis of the unified representation, the details on how to learn a universal dictionary are presented in Section III(B.2). Our category-specific dictionary learning method is illustrated in Section III(B.3), and the coding method is explained in Section III(B.4).

1) UNIFIED REPRESENTATION OF RECONSTRUCTION-BASED CODING METHOD
The core idea of reconstruction-based coding is to reconstruct a feature with visual words via resolving a least-square optimization problem with constraints [1]. The unified representation of reconstruction-based coding can be generally written as arg min where is the coding vector of the local feature f and λ balances the least-square term ||f − Bx|| 2 2 and the constraint term φ(x). The least-square term ||f − Bx|| 2 2 pursues accurate reconstruction, and the constraint term φ(x) makes the coding vector have a certain characteristic, for example, similar/different features obtain similar/different coding vectors [18]. The main difference among various reconstruction-based coding methods lies in the constraint term φ(x). Table. 1 lists the constraints of sparse coding [4], non-negative sparse coding [33], LLC [17] and LSC [18] as examples, which have different purposes as illustrated in Section II.
At the stage of dictionary learning, the reconstructionbased coding method is also used to learn a visual dictionary using the local features extracted from training images. In this case, the optimization problem includes not only the variable x but also the variable B. It can be written as arg min where Since the taget function G (X, B) is a non-convex function, the varibles X and B are solved alternatively. Specially, the constraint ||b k || 2 <= 1 is required in some methods such as [17] and [18]. After obtaining the dictionary B, the coding vector x of local feature can be obtained by resolving formula (1).

2) UNIVERSAL DICTIONARY LEARNING
In our method, we need to learn a universal dictionary at first. Given a training image set with C categories, we extract a fixed-size set S of local features from the training images of each category, and gather all the sets {S 1 , S 2 , . . . , S C } into a set S u for learning a universal dictionary. The set S u is formed as the matrix F u . The universal dictionary B u is learned by resolving formula (2) with the input F u .

3) CATEGORY-SPECIFIC DICTIONARY LEARNING
In order to obtain the category-specific dictionaries of all categories, the universal dictionary B u is refined for each category using the local features of this category individually. There are two necessary steps, i.e., obtaining the indices of the visual words from B u for encoding local feature, and learning the category-specific dictionaries B c 1 , B c 2 , . . . , B c C via resolving a least-square problem. Clearly, for the sth category, the set S s of local features is extracted from the training images of the sth category. For each feature f s,i in S s , its coding vector x u s,i is calculated by resolving the formula (1) with the universal dictionary B u . The 0-1 vector v s,i recording the indices of the visual words for encoding f s,i , is obtained by: where v s,i (j) is the jth element in v s,i , and x u s,i (j) is the jth coding coefficient in x u s,i . After obtaining the 0-1 vectors of all the local features in S s , the category-specific dictionary B c s of the sth category is learned by minimizing the below target function, which is written as: arg min arg miñ In this step, considering that almost all reconstructionbased coding methods include the least-square term ||f − Bx|| 2 2 , we directly minimize this term (resolving the formula (4)) to learn a dictionary that provides the smallest lower bound of reconstruction error for F s under the con- Although encoding local feature is performed with constraints (as shown in Table. 1) at the stage of feature coding, the dictionary obtained by resolving formula (4) leads to lower reconstruction error compared with the universal dictionary (demonstrated in Section IV(C)).

4) FEATURE CODING
At the stage of feature coding, the universal dictionary B u and all the category-specific dictionaries B c 1 , B c 2 , . . . , B c C are used jointly to encode local features. Given a local feature f i , firstly, its coding vector x u i is calculated by resolving the formula (1) with the universal dictionary B u , and then its 0-1 vector v i is obtained by the formula (3). Next, we use the category-specific dictionary of each category to encode the local feature. For the sth category, its coding vector x c i,s for this category is attained by resolving the following formula with the category-specific dictionary B c s of this category.
arg miñ whereB c i,s is a small dictionary consisting of the visual words from B c s indicated by the 1 elements in v i , and the element values inx i,s are the values in x c i,s indicated by the 1 elements in v i . Providing there are C categories, C coding vectors will be generated for each local feature. In comparison to the computational time spent on encoding by the universal dictionary B u , the time spent on encoding by category-specific dictionary B c is much less owing to the small dictionaryB c constructed from B c (demonstrated in Section IV(F)).

IV. EXPERIMENTS A. DATASETS
In our experiments, three small datasets are used to evaluate the classification performance of our proposed method.
Scene-15: Scene-15 dataset consists of 15 scene categories. There are 4492 images in total. The number of images per category varies from 260 to 440. We consider 100 training images per category. The remaining images are used as testing images.
Caltech-101: Caltech-101 dataset is a challenging object recognition dataset, which contains 9,144 images in 101 object categories and one background category. The number of images per category ranges from 31 to 800. We choose randomly 30 training images from each category to form the training set, and up to 30 testing images from each category to form the testing set.
UIUC-Sports: It consists of 8 sport event categories. There are 1579 images in total, and each category has from 137 to 250 images. 70 and 60 images from each category are used for training and testing, respectively.

B. IMPLEMENTATION DETAILS
In this article, we apply our proposed method on four representative coding methods, i.e., sparse coding, aLLC, LLC and LSC. For aLLC, the visual words for encoding a local feature are the K -nearest words to it in feature space, which are decided by calculating the Enclidean distance between the feature and each word. Therefore, the indices of the words for encoding a feature are the indices of its K -nearest visual words from universal dictionary. For LLC, when encoding a feature f by any category-specific dictionary, the distance ||f − b i || 2 (as shown in Table. 1) needs to be recalculated. We only need to compute the distances from the feature to the words from category-specific dictionary indicated by the indices of f . For LSC, before encoding feature by the sth category-specific dictionary (s = 1, 2, . . . , C), the coding vector of each template feature (introduced in [18 Table. 1) when encoding feature by the sth category-specific dictionary.
For images from all the datasets, we extract the dense patches of 16 × 16 pixels. The step between two neighboring patches is set to 8 pixels for Scenes-15 and UIUC-Sports, 6 pixels for Caltech-101. Each patch is described as a SIFT descriptor (128-dimensional vector). The dictionary size is set to 1024 for Scene-15 and UIUC-Sports, and 2048 for Caltech-101. For aLLC, K -means is employed to learn a universal dictionary. As suggested in [17], the number of the visual words to encode a local feature is set to 5. SPM is applied to incorporate spatial information of local features into image representation vectors. For all the datasets, a oneversus-rest linear SVM for each category is trained. All the experiments are conducted on a 64-bit Windows 10 with Intel Core i5-4590 at 3.30 GHz * 4 on 16GB RAM.
In order to more accurately evaluate whether the classification accuracy is improved after applying our method, the following setups are taken. For each dataset, we randomly split it 6 times to obtain 6 training sets and 6 testing sets, and conduct experiment 6 times on these training sets and testing sets for each experimental setup. The average of the classification accuracies of 6 experiments is reported in this article. The accuracies obtained with universal dictionaries (like done in [4], [17], [18]), are treated as baselines for comparison.

C. RECONSTRUCTION ERROR
In this section, we investigate on Scenes-15 whether the reconstruction errors of local features are reduced after applying our method. To achieve this, ten thousand local features are randomly extracted from the images of each category, respectively, and two reconstruction errors are calculated for each feature. For each kind of reconstruction error, the average of the errors of all the local features is computed and reported in Table. 2. As shown, the average errors of the four coding methods all decrease after applying our method. Among these coding methods, aLLC achieves the largest error drop since the universal dictionary used by aLLC is generated by K -means instead of minimizing a target function including the term ||F − BX|| 2 F . We compare the two reconstruction errors of local features when sparse coding is used as a coding method. Fig. 3   FIGURE 3. Comparison between the two reconstruction errors of 15000 local features.
shows the errors of 15000 local features (1000 features per category). Overall, the errors (universal dict.) of most local features decrease after applying our method. However, the errors (universal dict.) of about 3000 features increase a little, and about 600 features have no error drop. In addition, we also note that, for the local features with high reconstruction error (universal dict.) (e.g., 0.4 to 0.7), most of the features have relatively obvious error drop, and only a small number of features increase a little in reconstruction error.
We further investigate the reconstruction errors obtained when the local features of each category are encoded by the category-specific dictionary of every other category, as shown in Fig. 4. For the local features of the ith category, we compute their average error e ij when they are encoded by the jth category-specific dictionary, and show the value max(e i1 , e i2 , . . . , e iC ) − e ij at the ith row and the jth column. As shown, any element at the main diagonal is the brightest one at the row and the column it locates at. This phenomenon means that, for any category, its category-specific dictionary learned by our method tends to generate relatively low errors for the features of this category and relatively high errors for the features of other categories.

D. EFFECTIVENESS VALIDATION
In this section, we evaluate whether classification accuracy is improved after applying our method. Table. 3 reports the experimental results. As shown, for all the coding methods, the category-specific dictionaries learned by our method result in the better accuracies on the three datasets compared with the universal dictionaries. Despite that the different coding methods lead to the different classification accuracies, the accuracies are all improved a little after applying our VOLUME 8, 2020  method. This phenomenon demonstrates the universality and effectiveness of our method. Among these coding methods, aLLC always acquires the largest gain since it has the largest reconstruction error drop (as shown in Table. 2).
Based on sparse coding, we further investigate the classification accuracy of each category in Caltech-101. Fig. 5 reports the two accuracies of each category, which are obtained with the universal dictionary and the categoryspecific dictionaries, respectively. As shown, the accuracies of some categories are improved obviously such as ''ant'' (5.6%) and ''anchor'' (5.6%), but not all the categories have an accuracy improvement. Some categories show an obvious drop such as ''crab'' (−4.4%) and ''platypus'' (−6.4%). Overall, the large accuracy improvements are almost achieved on the categories corresponding to the low accuracies (e.g. less than 60%). By analyzing the confusion matrix, we note that the category ''crab'' becomes more confused with the category ''pizza'' after applying our method. Fig. 6 lists some similar images of the two categories. It is easily found that the images of the category ''crab'' are similar to the ones of the category ''pizza'' to some extent. Each of these images has an ellipse shape. This means that they have some similar local features, which are extracted on the edges of the ellipses they all have. In this case, the categoryspecific dictionary B c crab refined for the category ''crab'' also reduces the reconstruction errors of the local features from the edges of ellipses. In other words, the edges of the ellipses in the ''pizza'' images (restored from the coding vectors obtained with B c crab ), become ''clear'', while the interior zones of the ellipses become ''blur''. Consequently, the restored ''pizza'' images become more similar to some ''crab'' images, reducing the difference between the category ''crab'' and the category ''pizza''.

E. COMPARISON WITH A COMMON METHOD
There is a common method (Com. Method) which follows a very similar idea to our method. In this method, the local features from the training images of each category are used to learn the category-specific dictionary of this category by directly solving the formula (2). The accuracies obtained by the common method on the three datasets are reported in Table. 4. Besides, the average reconstruction error of local features is also calculated as done in Section IV(C). LLC and LSC are not adopted in this section due to the huge computation time. As shown, the highest accuracy always corresponds to the lowest error. This means that the practice of reducing the reconstruction errors of local features is beneficial to image classification tasks. Besides, the category-specific dictionary (com. method) results in the best accuracy on Caltech-101 but performs worse than the category-specific dictionary (our method) on Scene-15. We also note that, for Scene-15, although the error (0.387) achieved by the category-specific dictionary (com. method) is smaller than the one (0.394) by the universal dictionary, its corresponding accuracy (82.98%) is still lower than the one (83.00%) achieved by the universal dictionary. This implies that only reducing reconstruction error not necessarily improves classification accuracy. The errors reported in Fig. 7 further supports this conclusion. The highest accuracy of each category in Scene-15 does not always correspond to the lowest reconstruction error.
Here, we report in Fig. 7 the three classification accuracies of each category in Scene-15 when sparse coding is used as a coding method. As shown, our method leads to better results than the common method on the indoor categories (e.g., PARoffice, bedroom, kitchen, living room, and store), whose images include a number of similar local features. This can be explained by that, in our method, when the local features of a category are encoded by the category-specific dictionary of this category (or any other category), the visual words indicated by their indices are specific to decrease (or increase) their reconstruction error overall. As shown in Fig. 2, for the representative local feature f of the category c 2 , the lowest error is obtained when it is encoded by the category-specific dictionary of the category c 2 .
In contrast, the common method does not have this trait. Compared with the accuracies obtained with the universal dictionary, the accuracies obtained by the common method decrease a little on the categories of ''PARoffice'', ''industrial'', ''kitchen'' and ''store''. The reason is that the images of these categories include a large number of similar features, thus the dictionaries learned for these categories tend to reduce the reconstruction errors of these similar features, as a result, increasing the similarity among these categories instead. On the other hand, the common method performs better than our method on some categories such as MITforest and MITmountain. For anyone of the two categories, it has an obviously different visual presentation with other categories. In other words, the spatial distributions of the local features of the two categories both have a low mixability with the distributions of the features of other categories. Based on this observation, we infer that, for the features from positive samples and the features from negative samples, if their spatial distributions in feature space have a low mixability, the dictionary learned by the common method for positive samples, is better in reducing the reconstruction errors of the features from positive samples. As shown in Fig. 8, the category-specific dictionary learned by the common method gives rise to the lowest reconstruction errors for the features (denoted by the small triangles) from positive samples. In contrast, in our method, the index of the visual words for encoding local feature has no change in the refining process, as a result, restraining the refined dictionary to achieve lower error.
We also report in Fig. 9 the three classification accuracies of each category in Caltech-101. In comparison to the universal dictionary, the category-specific dictionary learned by our method increases the accuracies of 33 categories but decreases the accuracies of 14 categories, while the dictionary learned by the common method increases the accuracies of 45 categories but decreases the accuracies of 26 categories. Overall, although our method performs more stable than  the common method, the average classification accuracy of the common method is higher than our method. This phenomenon can be explained as follows. For most of the categories in the object recognition dataset Caltech-101, each of the images of these categories includes only one object, as shown in Fig. 10. This means that the spatial distributions of the local features of these categories in feature space have relatively low mixability compared with Scene-15. In this case, the common method can achieve low reconstruction errors for the features highly relevant to these object categories. Consequently, the common method performs better on Caltech-101 than on Scene-15. Nevertheless, some categories have a considerable accuracy drop after applying the common method, as shown in Fig. 10. We find by analyzing that, these categories have a large intra-category visual diversity (e.g., mayfly), or a large change in viewpoint (e.g., water lily), or only a small number of key features highly relevant to object category (e.g., strawberry). This means that, for anyone of these categories, the dictionary learned using the features from training images, is not capable enough of reducing the errors of the features from testing images, as a result, the images (restored from coding vectors) of this category  and other categories all become more ''blur'', i.e., increasing the similarity of this category with others.
By the above analysis, it can be concluded that, despite that our method and the common method follow the very similar ideas of learning category-specific dictionary, they have different performance characteristics according to the mixability of the two spatial distributions of the features from positive samples and the features from negative samples. VOLUME 8, 2020 FIGURE 11. Comparison on the computation time spent on dictionary learning and feature coding. Fig. 11(a) compares the time used for learning a universal dictionary and the time used for refining the universal dictionary. Fig. 11(b) reports the average time spent on encoding the features from an image by a universal dictionary (resolving the formula (1)), and by a category-specific dictionary (resolving the formula (6)). Fig. 11(c) lists the time spent on learning a universal dictionary, learning the category-specific dictionaries of all the categories by our method, and learning the category-specific dictionaries of all the categories by Com. Method. Fig. 11(d) reports the average time consumed on encoding the features from an image by the universal dictionary, by all the category-specific dictionaries learned by our method, and by all the category-specific dictionaries learned by Com. Method.
By and large, our method can perform better than the common method when the mixability is high, and the common method can be good when the mixability is low.

F. COMPUTATION TIME
In this section, the computation time spent on dictionary learning and feature coding is investigated separately on Scene-15 when sparse coding is used as a coding method. The results are shown in Fig. 11.
As shown in Fig. 11(a) and Fig. 11(b), the time spent on learning a category-specific dictionary is much less than learning a universal dictionary, and the time spent on encoding feature by a category-specific dictionary is also obviously less than by a universal dictionary. The reason is that, when refining a universal dictionary (learning a category-specific dictionary) and encoding local features, small dictionaries are constructed according to the indices of features to solve coding coefficients. However, our method needs to learn its category-specific dictionary for each category and then encode features by all the category-specific dictionaries. For our method, the time spent on dictionary learning is the sum of the time on universal dictionary learning and the time on category-specific dictionary learning, the time spent on encoding the features of an image is the sum of the time on obtaining the indices of the features and the time on encoding the features by all the category-specific dictionaries. Hence, the time consumed by our method on dictionary learning and feature coding increases a lot, as shown in Fig. 11(c) and Fig. 11(d). Nonetheless, the time required by our method is much less than Com. Method. The reason is that, for Com. Method, the time spent on learning a category-specific dictionary for each category is almost the time on learning a universal dictionary, and the time on encoding by a category-specific dictionary is almost the time on encoding by a universal dictionary. For our method and the common method, the time spent on dictionary learning and feature coding increase linearly as the number of categories increases.

G. ACCURACY COMPARISON WITH OTHER METHODS
In this section, we compare the classification accuracies achieved by our method and other methods, including the deep learning methods (e.g., [34], [36], [38], [39]) and the BoVW methods (e.g., [27], [35]). The results of these methods are obtained without the help of transfer learning, i.e., using a pre-learned CNN to extract image features.   As shown in Table. 3 and Tables. 5-7, the classification performance of our method relies on the performance of the coding method adopted in our method. Our method achieves the high accuracies on the three datasets when adopting LSC as a coding method. Compared with other methods, although the accuracies obtained by our method are not attractive enough, our method is easy to be combined with a number of BoVW methods to obtain higher classification accuracy (illustrated in Section III(H)). We also note that the deep learning methods do not achieve obvious improvement over the BoVW methods due to the lack of training data.

H. DISCUSSION
The characteristics of our method are listed as follows: • Our method can be used as a universal method to improve the classification accuracies of many reconstruction-based coding methods with added yet acceptable computation time. Except for the four coding methods (i.e., sparse coding, aLLC, LLC, and LSC) adopted in our work, many coding methods such as extending LSC [25], nonnegative sparse coding [33], hierarchical sparse coding [40], local coordinate coding [44] and so on, can also have an accuracy improvement by applying our method, theoretically.
• Despite the universality and effectiveness of our method, there is computation time associated with dictionary learning and feature coding, as shown in Fig. 11. However, compared with the common method, the time required by our method is much less. This trait ensures the practicability of our method.
• Although the classification accuracies obtained by our method are not attractive enough in comparison to some existing methods, our method is easy to be combined with a number of BoVW methods to achieve higher classification accuracy. The reason is that it is just a reinforcement method tailored for reconstruction-based coding methods, and it only involves in the stages of dictionary learning and feature coding. Therefore, In addition to employing more advanced coding methods, the advanced methods focusing on feature extraction, feature description, and feature pooling can also be used jointly to further improve the classification accuracy. Besides, the works on Analysis Dictionary Learning (ADL) can also be applied to the image representation vectors obtained by our method, resulting in the more discriminative image representation vectors.
• Some works can benefit from our method owing to the relatively less computation time consumed on categoryspecific dictionary learning. For example, the method proposed in [27] achieves comparable results to many advanced BoVW methods, but it suffers from the huge computation time spent on the category-specific dictionary learning. In this method, the category-specific dictionary of each category is learned by K -means or K -SVD using the local features of this category. Therefore, our method can be used to replace the category-specific dictionary learning method adopted in [27], as a result, improving the practicability of [27].
Besides, some methods such as [45] and [46] learn block-specific dictionaries for the blocks divided by SPM. For each block, its block-specific dictionary is learned by applying a reconstruction-based method on the local features extracted in this block. Similarly, the time consumed on block-specific dictionary learning can be reduced by a modified version of our method. In this version, the universal dictionary is refined for each block instead of each category.

V. CONCLUSION
In this article, we proposed a category-specific dictionary learning method tailored for reconstruction-based feature coding. The category-specific dictionary of each category was learned by refining a universal dictionary using the local features of this category. When encoding a local feature by a category-specific dictionary, the visual words for encoding it are decided in advance by the indices, which correspond to the non-zero elements of its coding vector obtained with the universal dictionary. Our results on three small datasets show that the classification accuracies of four representative coding methods were improved by about 0.3% to 2.7%, which experimentally demonstrates the universality and effectiveness of our method. Furthermore, by comparing our method with a common method, we found that they have different performance characteristics according to the mixability of the two spatial distributions of the features from positive samples and the features from negative samples. By and large, our method can perform better than the common method when the mixability is high, and the common method can be good when the mixability is low. The future works we are pursuing are: 1) taking into account the discriminability of local features when refining a universal dictionary, to make features with high discriminability have low reconstruction errors; 2) applying the thought of our method to transfer learning.