Learning Contour-Based Mid-Level Representation for Shape Classification

This article proposes a novel contour-based mid-level shape description method for shape classification. This method resolves the shortcomings of low-level shape descriptors in dealing with the shapes of objects with large intra-class changes and non-linear deformation (articulation, occlusion and noise), thus improving the accuracy of shape classification. First, we extract the outer contour of an object and sample it. We next describe each sampling point on the shape contour with a triangular feature and regard it as a local feature. Then, a shape codebook is learned, and the Fisher vector encoding method is used to produce a compact mid-level shape feature. Finally, the learned mid-level shape features are sent to the linear support vector machine (SVM) classifier for shape recognition. The proposed method has been extensively tested on several standard shape datasets, and the experimental results show that our approach attains high accuracy of shape classification. Comparisons to other state-of-the-art shape classification approaches further prove the superiority and effectiveness of our method.


I. INTRODUCTION
Shape is an important visual information of a target object, and it is also an important basis for the human visual system to recognize and classify the target object. Unlike other features of an image, shape can remain unchanged under illumination and other deformation conditions. Because of these advantages, shape-based object recognition has always been a hot topic in computer vision and image processing [1]- [6], and it has been widely used in text analysis, neuroscience, agriculture, biomedicine and engineering technology [7]. Shape recognition is generally referred to as a classification task. The goal of this task is to predict the class label of the test shape when providing a set of training shapes. The main challenge in shape recognition is how to extract the discriminative shape feature of the target object, and it is invariant to the geometrical changes (rotation, translation and scale), the intra-class variation and the nonlinear deformation of the shape.
The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar . Although many shape representation approaches have been presented over the past decade, they can mainly be divided into contour-based and region-based approaches [8]- [10]. The contour-based methods use the sampling points on the contour of an object to represent the shape, such as curvature scale space [11], the shape context (SC) [1], the shape tree [12], hierarchical string cuts (HSCs) [13], and multiscale triangular centroid distance (MTCD) [14]. The region-based methods employ the interior region of the contour to represent the shape, such as Zernike moment [15], generic Fourier descriptors [16], polar harmonic transforms [17], and multi-scale bisector integrals [18]. In the past few years, there has been more research activity on contour-based shape description methods than the region-based ones. There are two main reasons for this: human beings can readily recognize the object only with the contour information of an object; and the contour of a shape is more important than the inner region in many applications, such as contour-based object detection [19], [20]. Thus, this article primarily constructs the mid-level representation based on the contour information of a shape. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. The pipeline of building mid-level shape representation by the proposed approach: (a) original shape; (b) the outer contour (black boundary) of the shape (a) and the sampling points of the shape contour (red points), which are usually taken as the input of a contour-based shape recognition approach; (c) the triangular feature of each point of the shape contour; and (d) the final mid-level shape representation.
However, the above shape descriptors are low-level representation methods, and the main disadvantage of these methods is that they cannot handle the shapes of an object with large intra-class variation and nonlinear deformation. In addition, in shape recognition, these methods use oneto-one matching, such as the dynamic programming (DP) approach, to obtain the matching cost of every two shapes. They then adopt the nearest neighbor (NN) classifier to identify the shapes. The main drawback of this recognition method is that it is extremely time consuming when the number of training samples is very large. Unlike the above low-level shape description methods, we develop a mid-level shape descriptor to deal with the large changes of an object's shape. In recent years, there has been some related work on the construction of mid-level features for shape-based object recognition, such as literature [21]- [23]. Of these, the most similar method to the work of this article is literature [22], which is also based on the contour information of a shape, and then uses the bag of words (BOW) model to construct the mid-level shape representation. However, there are four main differences between the proposed method and the method in [22]. First, they need to normalize the shapes using the major axis to keep the rotation of the shape unchanged because the method in [22] is based on the spatial pyramid matching (SPM) strategy. However, due to the large difference in the shape of an object, the major axis method is unstable for shape normalization and will therefore affect the accuracy of shape recognition. Second, our method does not need to divide the contour segments. We use each sample point in the shape contour as a local feature. This is because the shape of an object has an emergent property that becomes only apparent once all the object boundary contours have been grouped. If we divide a complete shape contour into a number of contour segments, so that a large number of redundant contour segments will be generated, the accuracy of shape recognition will be affected. Third, the local feature description method is different from the method of shape context used in [22]. We adopt a triangular feature description method, which can describe the global and local information of the shape contour very well, and this method has a lower feature dimension. Finally, unlike the BOW model used in [22] to construct mid-level features, the proposed method adopts the Fisher Vector (FV) encoding method [24], [25], which can describe the first-order and second-order statistical information of data. The shape descriptor obtained by this coding method has better shape discrimination ability, thereby improving the performance of shape classification. In brief, the proposed mid-level shape description method is simpler and more straightforward and can describe the shape information of objects very effectively.
The pipeline of the proposed contour-based mid-level shape representation method is shown in Fig. 1. First, for a binary image, we extracted the outer contour of an object's shape and sampled it, as shown in Fig. 1(b). Then, we used the triangular feature to characterize each point of the shape boundary, as shown in Fig. 1(c). This shape descriptor can adequately capture the fine details and global information of a shape, and it is also invariant to translation, scale and rotation of a shape. Third, the learning method of the Gaussian mixture model (GMM) was employed to obtain a shape codebook, and the FV encoding method was adopted to obtain the mid-level shape representation, and power normalization and L 2 normalization were subsequently performed on it. The final mid-level shape representation is shown in Fig. 1(d). Finally, the learned mid-level shape feature was sent into the linear support vector machine (SVM) classifier for shape recognition.
The rest of the article is structured as follows. We briefly review some related work in Section II. In Section III, we introduce the proposed contour-based mid-level shape representation method in detail. Section IV provides the experimental results of our approach on some shape datasets. In Section V, we present our conclusions.

II. RELATED WORK
Many shape classification approaches have been presented over the past decade [2], [22], [23], [26]- [35]. These methods are divided into exemplar-based and model-based methods. Below, we briefly review the most important of these methods.
The exemplar-based methods [2], [26], [28], [30], [31], [33], [35] mainly extract informative and robust shape descriptors, then adopt the shape matching method to compute the similarity between two shapes, and finally classify them according to the NN classifier. For example, Ling [2] proposed an inner distance shape context (IDSC) descriptor for shape classification. This descriptor is an extension of the SC descriptor, which uses the inner distance instead of the Euclidean distance of the SC to better capture the articulation changes of objects. Daliri and Torre [28] converted the contour points of a shape into symbolic representations, adopted the edit distance to compare the similarity between two symbols, and finally classified the shapes according to the distance values. Wang et al. [31] proposed a height function descriptor for shape matching and classification. This method adopted a dynamic programming matching approach to calculate the dissimilarity between two shapes and then classified the shapes based on the calculated distance. To better deal with the large deformation of objects, some researchers used the skeleton description method to identify shapes. For instance, Sebastian et al. [26] presented a shape recognition framework based on a shock graph description. This method is very effective for detecting the visual deformation of shapes. Macrini et al. [30] proposed a shape matching method based on bone graphs and achieved good shape recognition performance. To further improve the performance of shape classification, some new exemplar-based methods have been proposed. For example, Bicego and Lovato [33] presented a bioinformatics method for 2D shape classification. They first converted the 2D shape matching into a sequence matching problem, and then used the sequence matching approach in bioinformatics to calculate the dissimilarity between the two shapes. Finally, the classification task of shapes was completed using the NN framework. In [35], a hexagonal grid-based triangulated feature descriptor was proposed for shape retrieval and classification. This method achieved better shape retrieval and classification performance and had lower computational complexity. However, because the exemplar-based methods use low-level shape representation, they cannot handle complexly deformed objects very well, and at the same time they have very high matching cost when the amount of data in the training set is quite large.
The model-based approach [22], [23], [27], [29], [36], [37] is not based solely on a single match of every two shapes but instead on learning a classification model to identify the shape. For example, Sun and Super [27] presented a Bayesian approach that employed the normalized contour segments of a shape for shape recognition. Bai et al. [29] used a Gaussian mixture model to integrate the skeleton and contour information of a shape, and then trained a generative model for shape recognition. Daliri and Torre [36] presented a kernel-based shape recognition method. They first converted the sample points on the contour into symbol strings, then used the edit distance to evaluate the dissimilarity between the symbol strings. Next, they converted the distance to the appropriate kernel before adopting the SVM classifier for shape recognition. Wang et al. [22] presented a mid-level shape description method called bag of contour fragments (BCF) for shape classification. This method uses the BOW model to build mid-level shape representation and attains better recognition accuracy in shape classification. Ramesh et al. [23] proposed a shape classification method based on invariant characteristics and context information in the BOW model. First, they used the spectral magnitude of the log-polar transformation as a local feature. Next, they integrated the context information into the Bow framework and proposed a method to select the appropriate codebook size in the BOW model. Finally, they trained the SVM classifier to identify shapes. Shen et al. [37] proposed a learnable pooling function to effectively combine the features of the BCF and bag of skeleton-associated contour parts (BoSCP) methods. Then, they adopted the SVM classifier for shape recognition based on this combination of methods.
Recently, the deep learning method has attained significant success in computer vision [38], [39], speech recognition [40], [41], natural language processing [42] and other fields [43]. Some researchers have begun to adopt the deep learning approach for 2D shape analysis and recognition [32], [34], [44]. Eslami et al. [44] presented a Shape Boltzmann Machine method for shape recognition. To model binary shapes, they used the Deep Boltzmann Machine, which can produce very realistic shape samples that are different from training samples. As a result, this method has a strong generalization ability. Ke and Li [32] proposed a convolution neural network (CNN) method to learn robust high-level shape features and used the method to handle the rotation issue of objects in shape recognition. Li et al. [34] introduced a deep neural network optimization method that uses the stochastic gradient Markov chain Monte Carlo (SG-MCMC) to learn the weights uncertainty in a deep neural network, which provides accurate estimates of the uncertainty of the model. They used this method and achieved excellent performance in both 2D and 3D shape classifications. However, the deep learning method also has several shortcomings, such as the need for a great deal of training data, the need for more computing resources for model learning and the poor interpretability of the learned model. The proposed method overcomes these shortcomings: it requires only a small amount of data to learn, the model learning requires few computing resources, and the learned model has better interpretability. Furthermore, subsequent experiments showed that the deep learning method is not the most effective method for 2D shape VOLUME 8, 2020 recognition through subsequent experiments, and that our proposed method achieved the best performance in shape classification task, superior to not only the traditional shape classification method but also to the shape classification method based on deep learning.

III. PROPOSED CONTOUR-BASED MID-LEVEL SHAPE REPRESENTATION METHOD
This section describes the proposed contour-based mid-level representation method for shape classification in detail. First, we define the triangular feature description of the shape contour sampling point and regard it as a local feature. Next, we describe how the FV is used to build the mid-level shape representation based on triangular features. Finally, we use the linear SVM classifier for shape recognition via the built-in mid-level features.

A. DEFINITION OF TRIANGULAR FEATURE REPRESENTATION
The triangular feature description of the contour sampling points introduced in this subsection is based on our previous work [45]. But unlike the method in [45], we do not need to use the final multiscale Fourier descriptor which is obtained using the Fourier transform. We only use the triangular features of the shape contour sampling points. A brief explanation of the triangular features of the shape contour sampling points is as follows.
For a binary shape image, we use the edge tracking algorithm to obtain the outer contour S of an object, and then we uniformly sample it to acquire the contour sampling point of the object as S = {P 1 , . . . , P N }. Here, N represents the number of sampling points, and P 1 and P N represent the starting and end points of the contour S, respectively. For each sample point P i = (x i , y i ), (i ∈ [1, N ]) of the shape contour S, we can use T s triangles at different scales to represent it. T s is the number of triangles, which takes T s = log 2 (N /2) , and log 2 (N /2) is the lower bound of log 2 (N /2). Fig. 2 shows this more specifically. As shown below, each sampling point P i of the shape contour S can be represented by Ts triangles Tr 1 i , Tr 2 i , . . . , Tr T s i , where the triangle Tr k i is formed by the contour points P i , . Further, l(k) is the logarithmic distance between sample points of the contour S, while l represents an increasing function where l(T s ) < N /2. In our experiment, we set l(k) = 2 k−1 (cf. Fig. 2). Three triangular features are used to characterize the sampling point P i , which is defined as follows: Here, TAR(Tr k i ) denotes the signed area of triangle Tr k i , TCD(Tr 1 i ) represents the distance from point P i to the centroid of triangle Tr k i , TASL(Tr 1 i ) is composed of two side lengths and intersection angle at point P i of triangle Tr k i .
) represent the coordinates of the points P i−l(k) , P i and P i+l(k) , respectively. Initially, we compute the signed area of triangle TAR(Tr k i ), which is given by ( The TAR value is capable of characterizing the concave and convex properties of a shape. Next, we calculate the central distance TCD(Tr 1 i ) of Tr k i , which is given by the following expression: Here, (x g ik , y g ik ) is the coordinates of the centroid point g ik , with i ∈ [1, N ] and k ∈ [1, T s ] of triangle P i−l(k) P i P i+l(k) , which is defined as follows: Lastly, we calculate the two side lengths and intersection angle at P i of Tr k i . Let L 1k i , L 2k i and L 3k i denote the lengths of the three sides of the triangle Tr k i , and the lengths are arranged from small to large (L 1k The intersection angle α k i at point P i of Tr k i is given by From the definition of the above three triangular features, we can see that the sizes of these triangular features are T s , T s and 3 × T s . Therefore, we can obtain the multiscale triangular feature (MTF) of point P i , which can be expressed by the following equation: We can see that the size of MTF(P i ) is 1 × L, where L = 5 × T s . As a result, we obtain the multiscale triangular feature of shape S, which is given by We can see that the size of MTF(S) is N × L, where the row i represents the feature MTF(P i ) of the sampling point P i of S. Next, we analyze the geometric invariant properties (translation, scale, and rotation) of the MTF. We find that the TAR and TCD descriptors remain unchanged for translation and rotation of a shape, except for changes in scale. The TASL descriptor remains unchanged for translation, rotation and scale of a shape. At the same time, to keep the included angle of the TASL descriptor in the interval [0, 1], we normalized the included angle of the TASL on π. Finally, we normalize the TAR and TCD using the local normalization method given in [46]. The calculation formula is This calculation formula indicates that we normalize the TAR and TCD descriptors by dividing the maximum absolute value at each scale. All values of the MTF matrix after normalization are between -1 and 1. Therefore, we obtain the triangular features of each sample point of the shape boundary, which is employed as a local feature for the subsequent construction of the mid-level shape representation. Moreover, the descriptor has four advantages: (1) the geometric transformation of the shape remains unchanged; (2) the multi-scale representation structure, which can describe the local and global features of a shape; (3) the compactness of the feature description, which requires less memory to store the descriptor; (4) the descriptor is obtained by adjacency contour points, which can represent the spatial layout information between sampling points. This is also the reason why we do not need to use the SPM method to add spatial layout information in the subsequent construction of mid-level shape representation.

B. CONSTRUCTION OF THE CONTOUR-BASED MID-LEVEL SHAPE REPRESENTATION
After we extract the local features of the shape, we then use the FV to build the mid-level shape representation. The FV is first proposed for the recognition of natural images [24], [25], which achieves good recognition accuracy in natural image classification. Below, we briefly describe how to use this method in the field of 2D shape analysis. For a given shape contour S, we can obtain the local features of its multi-scale triangle as S = {f t ∈ R L , t = 1, 2, .., N }. Here, we assume that S is generated via a probability density function u λ whose parameter is λ. According to literature [24], [25], we choose u λ as a GMM: Here, w i , µ i and i represent the mixture weight, the average vector and the covariance matrix of Gaussian u i , respectively. At the same time, we also suppose that i is a diagonal matrix and σ 2 i is used to represent the variance vector. The GMM u λ is obtained by the maximum likelihood (ML) estimation, which is trained on a set of local shape features from the training shapes. Let γ t (i) be the soft assignment of local feature f t to Gaussian i: . (10) Since the dimension of local feature f t is L, we can obtain two gradients of L dimension in regard to mean µ i and standard deviation σ i , respectively. Their computational expressions are Among them, the division between vectors is item-byitem arithmetic. We then concatenate the two gradient vectors G S µ,i and G S σ,i (i = 1, . . . , K ) to form the mid-level shape representation FS S λ , whose characteristic dimension is 2KL. Next, we use the power normalization [24] through the element-wise nonlinear operation to suppress the problem of burstiness. For each component fs i (i = 1, . . . , 2KL) of the mid-level shape representation FS S λ , the element-wise nonlinear operation is defined as: where the value of α is in the interval [0, 1]. In the experiment, we set the α value to 0.2. Then we employ the L 2 normalized method to obtain the final mid-level shape representation that is used for shape classification.

C. SHAPE CLASSIFICATION USING MID-LEVEL FEATURE
The resulting contour-based mid-level shape feature is a vector, so we can employ the linear SVM to classify shapes directly. We adopt the method introduced by Crammer and Singer et al. [47] for multi-class SVM. Let be the training set, which is composed of M shapes from C categories, where f i and y i represent the contour-based mid-level feature and the category label of i-th shape, respectively. Next, we train a multi-class SVM classifier, which is  (14) where l i = arg max l∈ [1,2,...,C],l =y i w T l f i . In Eq. (14), the left part of the equation represents the regularization term, and the right part represents the hinge-loss. The parameter β is used to control the relative weight between the left and right parts in Eq. (14). To solve Eq. (14), we employ the off-theshelf SVM, LibLinear, introduced by Fan et al. [48]. For a test shape feature f , its class label is predicted bŷ y = arg max l∈ [1,...,C] w T l f .
Here, we select the linear SVM classifier because the efficiency of this method is relatively high. At the same time, the method can be extended to large-scale shape retrieval and recognition when the linear SVM classifier is used. In addition, the accuracy of shape classification is further improved when the other nonlinear kernels are used, such as the intersection kernel or the Gaussian kernel function.

IV. EXPERIMENTAL RESULTS
We evaluated the classification performance of our method on several standard shape datasets and compared it with the state-of-the-art shape classification approaches discussed in this section. First, we described the experimental setup. Then, we provided the classification result of our method on four standard shape datasets, including the Animal dataset [29], the MEPG-7 dataset [49], the Swedish leaf dataset [50], and the ETH-80 dataset [51]. Next, the robustness of our method was studied, and we then analyzed the influence of the parameters introduced in our method on the performance of shape classification. Finally, we tested the performance of our method for object classification in real-world images.

A. EXPERIMENTAL SETUP
For a given shape S, we sampled it for N = 1024 points. Thus, we obtained that the size of each sampling point in the shape was L = 5 × T s = 5 × log 2 (N /2) = 5 × 9 = 45. We regarded the feature of each sampling point as a local feature of the shape. When learning the shape codebook, the number of clustering center K was set to 256. We studied the effect of different numbers of clustering centers on the performance of shape classification. Then, we employed the FV encoding method to obtain the size of each shape as D = 2 × L × K = 2 × 45 × 256 = 23, 040. Finally, we used the linear SVM for shape classification, where the weight of the parameter β was set to 10 in all experiments. The algorithm was implemented using the MATLAB language, and the experiment was conducted on a PC with an Intel Core i7-8700K 3.7GHz CPU and 32 GB RAM. We only needed around 28 ms to calculate the triangular feature descriptor of a shape. The learning of shape codebook took around 5 min, and it took about 10 ms to encode the feature vector for a shape. It took around 22 ms to test a shape.
We also used the principal component analysis (PCA) method to decrease the dimensions of shape features, thereby reducing memory usage. The PCA method is one of the most widely used data dimensionality algorithms in computer vision. In the experiment, we directly adopted the PCA method on the contour-based mid-level shape feature, thereby reducing the dimension of the feature (from D toD as shown in Table 1). The influence of different feature dimensions on shape classification performance was then analyzed. As is shown later, in the experiment, the classification rate of the proposed contour-based mid-level feature was not greatly affected, but sometimes it is beneficial to the classification result due to the feature selection ability of PCA.

B. ANIMAL DATASET
We tested the proposed approach on the Animal dataset, which was introduced by Bai et al. [29]. This dataset is composed of 2,000 shapes, which are divided into 20 categories, each containing 100 shapes. Several samples of the Animal dataset are presented in Fig. 3. Shape classification is challenging on this dataset because the shapes in the dataset contain very large intra-class variations and gesture changes. For direct comparison with other methods, we used the same performance evaluation criteria as in [22], [23], [27], [29], [33]. We randomly chose 50 shape images per class for training, and the remaining shapes were used for testing. To avoid the deviations caused by randomness, we performed 10 experiments and averaged them for classification accuracy. We first analyzed the effect of PCA on the performance of shape classification, as shown in Table 1, then we compared our approach to the other shape classification approaches. Table 2 compares the experimental results of our approach and other state-of-art shape classification approaches.
We can observe from Table 1 that the classification accuracy of our method remains stable after using the PCA dimension reduction algorithm. At the same time, we can see that our method obtains the highest classification performance when the feature dimension is reducedD = 2048. A possible reason for this is that the feature selection ability of PCA is amplified by the SVM classifier. In addition, we can see that there is almost no loss of classification accuracy when the dimension is reducedD = 512. In the subsequent experiments, we only give the classification accuracy when  using PCA to reduce the dimension toD = 512. Therefore, the PCA dimension reduction algorithm reduces the memory usage and improves the efficiency of shape classification.
We can see from Table 2 that our approach achieves the highest classification performance among all of the competing approaches except for the BoSCP-LP method [37]. The classification accuracy of our method is 4% higher than the deep learning method of pSGLD [34]. In addition, the classification accuracy of our approach is nearly 6% higher than the popular model-based shape classification method of BCF [22] and nearly 3% higher than the Contextual Bow method [23]. Our method also achieves comparable classification performance with the recently proposed model-based BoSCP-LP method [37]. The classification accuracy of our approach is 5% higher than the advanced exemplar-based method of Bioinformatics [33]. We also see that the recognition accuracy of our method is 21% higher than that of the multiscale Fourier descriptor [45]. Although our mid-level shape feature is built based on the multiscale Fourier descriptor, the high-level semantic information of an object's shape can be obtained using the FV encoding method. Therefore, the built-in mid-level shape feature is better at shape discrimination. Low-level shape features are very difficult to deal with large intra-class variations and inter-class similarities of objects. Because the shape of objects in the Animal dataset varies greatly, the recognition accuracy of the low-level multiscale Fourier descriptor on this dataset is low. We adopt the 1-NN approach to obtain the classification result of the multiscale Fourier descriptor method. If not specified otherwise, the subsequent experimental results of the multiscale Fourier descriptor are obtained by the 1-NN classification method.
We also provided the classification accuracy rate of each class of the Animal dataset, as shown in Table 3. It can be seen that our approach greatly improves the recognition accuracy of the cat, leopard and monkey classes. The results indicate that our approach can capture the objects with large intra-class variations and non-linear deformations. At the same time, our approach achieves the highest recognition accuracy in most categories of the Animal dataset. Therefore, we deem our method more appropriate for shape recognition than the other methods.

C. MPEG-7 DATASET
The MPEG-7 dataset [49] is extensively used to evaluate the recognition accuracy of various shape classification methods. It is made up of 1400 shapes, which are divided into 70 categories, each containing 20 shapes. Fig. 4 shows several examples of this dataset. Some shapes of the MPEG-7 dataset have very large intra-class variation and complex deformation, so the dataset is very challenging for shape classification. There are several shape classification methods that have reported experimental results on this dataset [22], [27], [33]. In order to compare our approach to these methods, we employed the same performance assessment criteria, namely the half training and the leave-one-out. Both are methods commonly used to assess classification performance. The half training evaluation criteria indicate that we randomly select half of the shapes in each category (i.e., 10 shapes per category) to train the model and the other half of the shapes to test the performance of the model. The experiment is repeated 10 times, and the average classification accuracy and standard deviation are given. The leaveone-out evaluation criteria indicate that we adopt all shapes except the current shape for training and employ the rest of the shapes for testing, and the average shape recognition accuracy is given. Table 4 compares the classification results of our method with those of the other shape classification approaches. We can see from Table 4 that our approach achieves the highest classification accuracy of all the approaches when using the half training evaluation criteria. At the same time, our approach also achieves the best classification performance among all the competing approaches using the leaveone-out evaluation criteria. In addition, the performance of our method remains stable even when the PCA is used to reduce shape features to 512 dimensions. Similarly, the classification accuracy of our method exceeds that of all the other methods in the half training evaluation criteria and in the leave-one-out evaluation criteria.

D. SWEDISH LEAF DATASET
The third dataset used in our shape classification experiment was the Swedish leaf [50]. This dataset is composed of 15 species of leaf images, each with 75 leaves for a total of 1125 leaf images. Fig. 5 shows several sample images from this dataset. Because the leaves in this dataset have very large intra-class variations and inter-class similarities, it is very challenging for leaf classification. We can see from Fig. 5 that several species are very similar in terms of leaf shape, like the first, third and ninth species. The last image represents an example of a mask image that we used to extract the shape contour of the leaf.
To compare our approach to the existing methods, we followed the same experimental setup as literature [2], [12], [14], [22], [45], [50]. For the 75 leaf images in each species, we randomly chose 25 leaves for training and used the remaining 50 leaves for testing. We conducted the experiment 10 times and reported the recognition accuracy and standard deviation on this dataset. Table 5 presents the recognition results of our method and the other shape classification approaches. The recognition results of other approaches come directly from the published results. We can see from Table 5 that our method achieves the highest classification accuracy of all the methods compared. The recognition accuracy of our approach is nearly 3% higher than the deep learning method of hierarchical feature using CNN [32], is nearly 2% higher than the model-based method of BCF [22] and is nearly 4% higher than the model-based Manifold MKL SVM method [54]. Furthermore, the recognition accuracy of our approach is higher than the two exemplar-based methods of pattern counting [60] and multiscale Fourier descriptor [45] by 1.2% and 1%, respectively. We also see that our approach still achieves very high classification accuracy when the PCA is used to reduce shape features to 512 dimensions. Even though the shapes of the different species are highly similar, our approach still shows a strong ability for shape discrimination.

E. ETH-80 DATASET
The final dataset utilized in our experiment was the ETH-80 [51], which contains 80 high-resolution 3-D objects from eight classes. Each object contains 41 images captured from various angles, so the dataset is composed of 3,280 images. The segmentation masks of all images are given, so it is very easy to assess shape classification approaches on the ETH-80 dataset. Fig. 6 presents several samples of this dataset. The leave-one-object-out performance assessment standard is utilized on the ETH-80 dataset, which is introduced in [51]. It means that images from 79 objects are used to learn the classifier, and the images of the rest of the objects are employed for testing. We compared the recognition result of our method with other shape classification methods, and Table 6 lists the classification results of the different approaches on this dataset.
It can be seen that our approach achieves the highest classification accuracy of all the methods. The classification accuracy of our approach is 6% higher than the best example-based method of Bioinformatics [33]; 4% higher than the best model-based manifold MKL SVM method [54]; and more than 2% higher than the advanced deep learning-based hierarchical feature using the CNN method [32]. We can also see that the classification accuracy of our approach exceeds that of all other shape classification approaches when using PCA to reduce the shape feature to 512 dimensions. Therefore, it can be deduced that our method has a very strong ability for shape recognition.

F. ROBUSTNESS TO NOISE
We also tested the effect of our method on shape classification performance in the presence of noise. We added noise to the shape contours and performed shape classification on these noise shape contours. We used the entire MPEG-7 dataset [49] as the initial shape contours. These noises were appended by perturbing the x-and y-coordinates of the shape contour. The noise values are derived from a Gaussian random variable with zero mean and standard derivation σ . Fig. 7 shows an example of noise contours with different σ . As σ becomes larger, the shape of an object is affected more and the shape contour becomes rougher. Fig. 8 shows the contour-based mid-level features under different noise conditions. We can see that the contour-based mid-level features can remain relatively stable, and the increasing of σ has very little effect on our method.
To measure the dissimilarity of their features at different noise levels, we use the L 1 distance to calculate the dissimilarity between noise shape and original shape at different noise levels. Fig. 9 shows the L 1 distance at five different noise levels. We can see that the proposed mid-level feature can maintain the overall stability under different noise levels. At the same time, we also analyzed the mid-level features of different classes of object shapes to measure the stability of the proposed method. We use L 1 distance to compute the shape difference between mid-level features. Fig. 10 shows a total of five shapes, in which Fig. 10(a) shows a beetle image and Fig. 10(b) represents its scale-and rotation-changed version. Fig. 10(c) shows a beetle image which has a large intra-class change compared with that of Fig. 10 (a). Fig. 10(d) and 10(e) VOLUME 8, 2020   show the shapes of apple and elephant objects, respectively. Then we calculate the L 1 distances between the shape in Fig. 10(a) and the other four shapes in Fig. 10 (b) to 10 (e). We obtain the distances L 1 (a, b)=0.0310, L 1 (a, c)=0.0391,  L 1 (a, d)=0.0601, L 1 (a, e)=0.0620. We can see that the L 1 distance between the shapes of the same class is much smaller than between the shapes of different classes. Therefore, the proposed mid-level feature can significantly distinguish objects of different classes and achieves a strong ability to distinguish shapes.
Finally, we also analyzed the classification accuracy of our method under different noise levels σ . We used the evaluation criteria of the MPEG-7 shape dataset, which adopts half of the shapes in each category of the dataset for training and the remaining shapes for testing. Fig. 11 shows the classification results of our method when the noise level σ varies from 0 to 1. It can be seen that the classification accuracy of our approach is almost unaffected when the noise level σ = 0.2. At the same time, the classification accuracy of our method is only slightly reduced under the noise from σ = 0.4 to σ = 1.0. The classification accuracy of our method is only reduced by 0.8% when σ increases from 0 to 1. This experimental result further shows the robustness of our approach to noise.

G. PARAMETER STUDY
In this section, we describe our study of the influence of the parameters introduced by our method on shape classification performance. The two important parameters in our approach are the number of the shape contour sampling points N , and the number of clustering center K . We employed the Animal dataset for our parameter analysis. In the experiment, one parameter serves as a variable, and the other parameter remains unchanged. The performance evaluation criteria are the same as the Animal dataset, where half of shapes in each category are used for training, and the rest of the shapes are used for testing. Table 7 lists the classification results for different N and M values.  We can see from Table 7 that our method obtains the highest classification accuracy on the Animal dataset when N = 1024 and K = 256. We also see that the classification result of our method increases with the increase of N . In addition, the increase in the classification accuracy gets smaller and smaller when the value of N is further increased, as we can see from N = 512 to N = 1024. This is mainly because increasing the number of sampling points N can enhance the ability of triangular features to capture the local details and spatial information of a shape. This will result in further enhancement of the discriminative ability of the final mid-level shape feature and improvement of the performance of shape classification. We can also see that the recognition results of our method first increase and then decrease with the increase of K . The best classification result is obtained when K = 256. Therefore, we set N = 1024 and K = 256 in all the experiments.
To analyze the influence of local shape features on the final shape classification performance, we studied the classification accuracy of different local shape features on the Animal dataset. First, we analyzed the influence of different combinations of the proposed triangular feature on shape classification performance. Next, we further compared the proposed triangular feature with other classical local shape descriptors. Table 8 shows the classification results of different combinations of the triangular features on the Animal dataset for N = 1024 and K = 256. Each row in Table 8 indicates that only one or several feature combinations are used. For instance, the first row of Table 8 indicates that only the TAR feature is used as the local shape feature, and the last row indicates that the TAR, TCD and TSAL are combined as the local shape feature. It can be seen from Table 8 that the best classification result is obtained on the Animal dataset when the TAR is combined with TSLA. The classification accuracy is slightly less when the TCD feature is added to the local shape feature, but the overall classification performance is still relatively stable. The classification accuracy of a single triangular feature is low. For example, the TCD feature obtains only a classification accuracy of 74.61% on the Animal dataset, which is 14% lower than the best classification performance. Combining multiple triangular features can achieve higher shape classification performance. For example, combining TCD and TAR features achieves 85.81% classification accuracy on the Animal dataset, which is higher than only TAR and only TCD features by 7% and 11%, respectively. Therefore, the experimental results indicate that local shape feature has a significant influence on shape classification performance.
To further verify the effect of local shape features on shape classification performance, we compared our proposed triangular feature with the classical shape context (SC) [1], inner-distance shape contexts (IDSC) [2], and height function [31] local shape descriptors. In the experiment, all the parameters were set the same except the local shape feature. We first used the FV method to encode these local shape features and to obtain the corresponding mid-level shape features. Then, we use the mid-level shape feature for shape classification. Table 9 shows the comparison results between our method and these classical shape descriptors. Our method achieves the best classification accuracy on the Animal dataset among all the comparison methods. The recognition rate of our method is higher than that of SC [1], IDSC [2], and height function [31] by 13.9%, 13.2% and 24.2%, respectively. The height function method obtains the lowest recognition accuracy of all the approaches because of its inability to characterize the complex shape changes of an object. The recognition accuracy of the IDSC method is slightly higher than that of the SC method, primarily because it can deal with the articulation changes of an object's shape better. This experiment further shows that the multiscale triangular feature can better deal with the complex change of an object's shape, and the accuracy of shape recognition can be improved by combining it with the FV encoding method.
Next, we also analyze the relationship between the shape complexity and the contour sampling point N . Here, we select five categories from the Animal dataset: birds, cats, deer, fish, VOLUME 8, 2020 and monkeys. For each category, we choose six representative images. Fig. 12 shows a total of 30 shapes for five categories. The shape complexity (SC) is defined as When the complexity of the contour is low, humans are usually more sensitive to the deformation of shape contour. The shape complexity here is defined as the average, over all sampling points, of the absolute difference between the maximum and minimum TAR, TCD, and TASL values at all scale levels. Then, we take their average value as the final shape complexity. First, we calculate the complexities of each shape for different sampling points N , which is shown in Fig. 13. We can see that the shape complexity of each shape becomes larger as the number of contour sampling points N increases. For different shapes, shape complexity is also different, which shows that the shape complexity is unique and can reflect the complexity of each shape. Next, we calculate the average complexity of each category, as shown in Fig. 14. We can see that the complexities of different categories are different, and their complexities also become larger as N increases.  Finally, we analyze the relationship between different image sizes and shape complexity. Fig. 15 shows the bird image with different image sizes. Table 10 presents the shape complexities of different image sizes for different contour sampling points N . It can be seen that the shape complexities of different image sizes increase gradually with the increase of contour sampling points N . The shape complexities increase as the image size increases when N = 64, N = 128, and N = 1024. For N = 256 and N = 512, the shape complexities first increase and then decrease as the image size increases. At the same time, we can observe that the shape complexity remains stable as the image size increases when N gets increasingly larger. We can see from the experimental results that the PCA method can sometimes improve the accuracy of shape classification. The possible reason is that the PCA method has the ability of feature selection, which is enhanced by the SVM classifier. We may want to know whether we can select some distinguishing features through other feature selection methods. On the one hand, it can reduce the dimension of features and improve the efficiency of shape classification; on the other hand, it can eliminate some noise data and improve the accuracy of shape classification. Next, we adopted three feature selection methods, including two classic Relief-F [61] and Laplacian score [62] feature selection methods and the recently proposed infinite latent feature selection (ILFS) [63] method, to study the influence of different features on the performance of shape classification. We still use the Animal dataset for experimental analysis, and we select five feature dimensions: 128, 256, 512, 1024 and 2048. Table 11 shows the classification results of the three feature selection methods on the Animal dataset for five different feature dimensions. It can be seen from the experimental result that the Laplacian score method achieves the best classification accuracy among the three feature selection methods. The classification accuracy of the ILFS method is better than the classic Relief-F method when the feature dimension is greater than or equal to 1024. At the same time, the classification accuracy of the Relief-F method is better than the ILFS method when the feature dimension is less than or equal to 512. However, the classification accuracy of these three feature selection methods is far lower than that of the PCA method. The possible reason is that the PCA method obtains some better low-dimensional features through feature transformation, and these three feature selection methods select a number of features from high-dimensional data. Therefore, some important features will be lost when the selected feature dimensions are less, which affects the classification result. This experimental result shows that the PCA is a highly effective dimension reduction method.

H. OBJECT CLASSIFICATION ON REAL-WORLD IMAGES
All of the above experiments were performed on manually extracted shapes, and one might wonder how the proposed method detects and recognizes objects in real-world images. To explain this, we combined the Weizmann Horse [64], which as 328 horse images, with the ETHZ Cow dataset [51], which has 111 cow images, for a total of 439 images. We adopted an unsupervised game-theoretic saliency detection approach [65] to extract the foreground region for each image. Fig. 16 shows the saliency detection results of several images. The first two columns in Fig. 16 show several typical results, while the other columns show difficult images with severe segmentation errors.

FIGURE 16.
Foreground extraction results of test images from the Weizmann Horse and ETHZ Cow datasets. The first row shows several test images, the second row represents the salient detection results by the unsupervised game-theoretic method and the third row represents the binarization result of the salient maps. The first two columns indicate several typical results, and the other columns represent several unsatisfactory results of foreground extraction.
To evaluate the performance of the various shape classification methods, we randomly selected 60 images in each category to train the classification model and used the remaining images to test the model. We carried out 10 experiments and averaged them as to the classification accuracy. We compared the proposed method with the BCF method [22] and the multiscale Fourier descriptor method [45]. These two methods are representative methods of mid-level shape feature and low-level shape descriptor, and our method is based on multiscale triangular feature descriptor. The comparison results of these methods are shown in Table 12. We can see that our method achieves the highest classification accuracy of all the methods, including 10% higher than the BCF method [22] and 11% higher than the multiscale Fourier descriptor method [45]. It is worth noting that the addition of the TCD descriptor in this experiment can improve the overall classification. The possible reason for this is that the noise in foreground extraction has an effect on the dataset, but the overall classification performance of our method remains relatively stable. This experiment proves that our method can be applied to object classification of real-world images and is sufficiently robust to foreground extraction errors and local noisy boundaries.

V. CONCLUSION
We have proposed a novel contour-based mid-level shape description method for shape classification. This method can deal with the intra-class variation and non-linear deformation of an object's shape, and enhance the distinguishing ability of the shape. Compared with the previous work, we offer three contributions. First, we regard each point of the shape boundary as a local feature, unlike previous methods using the contour fragment as a local feature. Second, we present an effective local feature description method named the triangular feature, which can describe the details and global characteristics of a shape and represent the spatial layout information between sampling points. This is why our method does not require the spatial pyramid matching (SPM) to append the spatial layout information. Finally, we build a mid-level shape descriptor with good discrimination performance based on the FV encoding method for shape classification. The performance of our method has been extensively evaluated on a large number of shape benchmarks, and the experimental results show that our approach exceeds existing state-of-the-art shape classification approaches. In future work, we will focus on how to apply this method to the recognition of target objects in complex scenes.