Efficient Feature Mapping in Classifying Proportional Data

In image classification, traditional kernels or feature mapping functions of Support Vector Machine(SVM) use discriminative features without considering the true nature of the data. Our work in this paper is motivated by the need to consider intrinsic distribution of L1 normalized histograms and develop a flexible feature mapping technique by combining histogram based features and distribution based density features. The proposed mapping technique contains prior knowledge about the the data which provides a flexible representation and thus increases the discriminative power of the classifier. Such flexibility is achieved due to the explanatory capabilities of Dirichlet, generalized Dirichlet and Beta-Liouville distributions to model proportional data. In addition to that, we present a general framework to estimate the parameters of these distributions by taking maximum likelihood (MLE) approach. Experimental results show that the proposed technique increases the effectiveness of SVM kernels for different computer vision tasks such as natural scene recognition, satellite image classification and human action recognition in videos.


I. INTRODUCTION
Appropriate and accurate representation of the data for classification models is one of the existing problems in machine learning. Several classification and hybrid models have been developed. However, a little attention has been given to a get a proper representation of the data through distribution based feature mapping in discriminative approaches [1]. In this paper, we address this issue in supervised learning problems for proportional data. A popular image representation is the Bag of Visual Words (BoVW) which is essentially quantizing similar patches of an image to the corresponding cluster center which is known as codebook [2], [3]. Modelling such data after normalization in a probabilistic manner needs to satisfy the constraints of non-negativity and unit sum. Examples of such data includes L1 normalized histogram for images and normalized bag of words representation of texts (or images) data. In particular, we are motivated by the problem of modelling features in images and videos where each feature represents a portion of the total features considered. For example, an image can be represented by a normalized histogram of bag of vectors where each vector The associate editor coordinating the review of this manuscript and approving it for publication was Bohui Wang . element represents a sub-region of the image. Knowledge about statistical characteristics of such representations has to be used effectively in order to get better accuracy for the classification tasks. Dirichlet, Generalized Dirichlet and Beta-Liouville distributions can model this type of data to get the prior information which can be used as a feature. The advantage of such distributions are that they can capture the nature of the data and provide flexibility. For SVM, traditional kernels do not take into account the nature of the data. Incorporating our proposed feature mapping technique increases the classification accuracy of these kernels.
Performance of machine learning algorithms depend on the representation of the input data. Incorporating invariance in the representation using prior knowledge is a common technique to make the learning task more efficient and in general, prior information makes it possible to generalize training examples to novel test examples [1]. In supervised learning, hyperparameters of the classifiers work as prior information. Another approach is to select features that convey most relevant information regarding the data or the task. Such features are automatically incorporated in some kernels such polynomial kernel for SVM [4]. On a different note, distribution based flexible feature mapping can be efficient in different classification tasks [5]. Our contribution falls into the second category. For SVM, input data are represented as points in high dimensional space. This representation needs to be linearly separable to make the model work properly. Therefore, for non-linear data, performance of SVM model lacks accuracy. However, kernel trick or feature mapping technique has made it possible to model non-linear data which is essentially taking the data space to higher dimension where the data become linearly separable. It is a common idea to extract new features from the input variables through a feature mapping function which increases the separability between the data classes. On the contrary, feature mapping without statistical measure about the data does not guarantee the improvement in model's performance. Selecting the most informative attributes from the set of redundant attributes is sub-optimal for a classifier and on the contrary, it may keep out some relevant features as well [6]. Therefore, extracting or creating new features from the data with prior information using a parameterized feature mapping function can be incorporated in classification model with certain degree of confidence. Histogram representation of the extracted data can be modelled in a probabilistic way by performing L1-normalization and Dirichlet or Liouville type distributions is the choice to estimate the density of such data. Therefore, a parametric distribution based mapping function can be developed to increase the flexibility of the datapoints in the feature space.
Rest of the paper is organized as follows, section II highlights some related works, section III introduces the Dirichlet, generalized Dirichlet and Beta-Liouville distributions along with the parameter estimation method for these distributions. Support vector machine and kernel tricks are discussed in section IV. Our proposed feature mapping technique is discussed in section V. In section VI, we show the experimental results of the proposed methodology for image and video classification tasks. Concluding remarks are discussed in section VII.

II. RELATED WORKS
Many researchers have focused on improving traditional SVM by implementing new kernels or new feature mapping functions. To teach the machine to differentiate between different images, frequencies of local features of an image or video frame are quantized into a histogram [7]- [9]. Feature mapping function proposed by [5] based on Dirichlet distribution has proved to be efficient in different classification and regression tasks for proportional data. In addition, the authors indicated data normalization technique for proportional data to be used in the feature mapping function. Histogram based feature transformation with probabilistic modelling is addressed in [10]. Kernel based methods have been applied in different learning tasks such as Gaussian kernels with different distance measures which proved to be efficient in image classification task [11]. In addition, a combination of generative and probabilistic learning is shown to be effective in image recognition and segmentation tasks [12], [13]. In such approaches the kernel is generated by learning the generative process of the data using probabilistic models such as Gaussian mixture model [14] or Dirichlet and related distributions based mixture models [12], [15]- [17]. In contrast to this, we consider a feature mapping function which considers both discriminative features and density features. In supervised learning, measuring the similarity of L1 normalized histograms using Euclidean distance is not effective [11]. In such cases, histogram distances such as χ 2 distance has proved to be effective [18]- [21]. Several histogram based distances and their derivatives have been proposed by many researchers such as [22], [23], [24]- [26]. Apart from this, [27] proposed a non-linear mapping technique based on polar coordinate system. Modification of RBF kernel using first order Taylor series approximation proposed by [28] has achieved better accuracy for semanteme data. However, in contrast to these approaches, we are interested in increasing the discriminative power of SVM using more flexible feature mapping technique for proportional data.

A. DIRICHLET DISTRIBUTION
Dirichlet distribution is the generalization of Beta distribution and most appropriate candidate in probability and statistics when modelling proportional data [29]. It is a distribution over the multinomials in a simplex with supports [0, 1]. If a vector p = (p 1 , p 2 , . . . , p D ) of length D resides in a D dimensional closed simplex of R D then it is defined as, If the proportional vector p ∈ C(1), 1 then the joint probability density function of p = (p 1 , p 2 , . . . , p D ) is defined as, where denotes the gamma function and α = (α 1 , α 2 , . . . , α D ) is a positive parameter vector which defines the shape of the distribution in D dimensional space. Total mass, α 0 = d α d is the concentration or scale parameter and the base measure (α 1 , α 2 , . . . , α D ) = α d α 0 . In case of symmetric distribution, the mean of the distribution is determined by the base measure. In addition, altering the measurements in α affects the variance of the distribution.
1 C(n) = C(1); n = sum of the multinomials VOLUME 9, 2021 It should be noted that, small values of the concentration parameter α 0 favors the extreme values of the density function and as a result, data are distributed all over the simplex data and it is more compact at the corner of the simplex. The shape parameter α makes it possible to model data in linear, convex and concave hulls [5]. Figure 1 shows the flexibility of the distribution by changing the parameters. α 0 controls the peak of the distribution and α d determines the location of the peak. If the expected values of the parameters are equal then data are distributed uniformly over the simplex. The higher the parameter value, more confident we are about that parameter and hence density values are more peaked on that side.

B. GENERALIZED DIRICHLET DISTRIBUTION
From (3), we see that any two random variables following Dirichlet distribution are negatively correlated. If the variables are positively correlated, then Dirichlet prior is not a proper choice. A modification in such cases is the generalized Dirichlet (GD) distribution which entertains both negatively and positively correlated random variables [30]. In a D dimensional closed simplex, generalized Dirichlet distribution with parameter vector θ = (α 1 , β 1 , α 2 , β 2 , . . . , α D , β D ) is defined as, Here, , then it can be transformed to follow independent Beta distributions for each dimension using the following transformation proposed by [31].
It is evident that generalized Dirichlet distribution has 2D parameters. Unlike Dirichlet distribution where the expectation is fixed, in GD distribution, the expectation of each dimension d continues to evolve over the dimension d − 1.
where, d, f = 1, 2, . . . , D. Flexible covariance structure of GD distribution enables it to have different degrees of belief on random variables while keeping the same expectation [30]. From Fig. 1, it is evident that for Dirichlet distribution, symmetrically distributed random variables are less concentrated at the center (for example, α = [2, 2, 2]) than the random variables following generalized Dirichlet distribution which are more concentrated at the center asymmetrically (α = [2,4]; β = [4,4]). It can be shown that generalized Dirichlet distribution reduces to Dirichlet distribution when β d = α d+1 + β d+1 (see [12] for details). If the expectation is varied and for example when α = [2, 6]; β = [6,8] in Fig. 1, generalized Dirichlet distribution captures the variation of the data more flexibly.

C. BETA-LIOUVILLE DISTRIBUTION
While generalized Dirichlet distribution is more flexible than Dirichlet distribution, it requires twice the number of parameters. An efficient replacement for Dirichlet and generalized Dirichlet distribution is the Beta-Liouville distribution which overcomes the limitations of Dirichlet distribution and requires lesser parameters to estimate than generalized Dirichlet distribution [32]. This distribution is a a special case of Liouville family of distributions. Vector, the normalizing constant of vector sum and u = D d=1 p d is an independent random variable. The joint probability density function of this distribution is given by [32], It is evident that the Beta-Liouville distribution has 2 additional parameters than Dirichlet distribution. The mean, variance and covariance of the Beta-Liouville distribution are expressed as follows [32].
From (13), we see that Beta-Liouville distribution has more generalized covariance structure compared to negative covariance of Dirichlet distribution. In addition, two random variables with same expectation can have different variances. Such properties of Beta-Liouville distribution makes it more flexible to estimate density of proportional data.

D. PARAMETER ESTIMATION
The concentration parameter α can be determined from the observed proportional data D obs which consists of N observation and each observation is a D dimensional proportional vector. The the joint probability function of the whole dataset can be computed as follows, In order to maximize (14), we need to take the gradient and set it to zero. It is cumbersome to apply chain rule with the product terms in (14). Therefore, we take maximum likelihood estimation (MLE) approach. Since the distributions discussed above are from exponential family, taking the logarithm will turn it into a convex optimization problem [33] and thus a line search algorithm such as Newton-Raphson method or fixed point iteration method can be applied [34]- [36].
The derivative for one α d is, where is the digamma function. The gradient, g for the dataset is D × 1 and can be written as follows, In exponential family of distribution, when the gradient is set to zero, the observed and sufficient statistics becomes equal and as since Dirichlet distribution is from the exponential family, it is possible to formulate an equation and solve it as a fixed point iteration problem to determine the concentration parameters α (see [34] for details). For a vector, this can be expressed as follows- Fixed point iteration method converges only when |g| < 1 and is linearly convergent meaning that decreasing error in each step is roughly proportional to previous step. In contrast, Newton-Raphson method solves has quadratic convergence rate and guarantees to converge given that the initial guess is close to final estimate. The Hessian of the log-likelihood function is, where B=diag: function. For Newtons algorithm, the Hessian needs to be inverted and [37] provided the following inversion technique using Sherman-Liberman formula- Therefore, update for the Newton's algorithm becomes, As discussed, it is important to estimate the initial values of the parameters more accurately rather than taking random initial guess so that (22) converges to global optima. There are some propositions for the initial estimation of these parameters. Method of moments technique provides good estimate of the initial guess of the parameters. The first and second moment of the data can be calculated from the moment generating function. The moment generating function of a vector X of random variable x is given by E(e tX ) and is defined by M X (t). With the utilization of Taylor series expansion solving the general moment equation for Dirichlet distribution results in the first and second moments of the Dirichlet distribution presented as follows- Solving the above equations, we get the values of the parameters α which can be used as an initial guess for the Newton's algorithm.
Other techniques such as Expectation Maximization and Expectation Maximization gradient algorithm can also be employed to deter the parameters of the Dirichlet distribution [38].

IV. CLASSIFIER-SUPPORT VECTOR MACHINE
SVM is a well known and common choice for the the supervised machine learning problem. Empirically it has shown good generalization performance in different fields of research and applications [39]- [41]. The aim of using this classifier is to find the support vectors that maximizes the margin between class labels where number of support vectors is proportional to generalization error [42]. Considering the primal representation of the optimization problem, we have min w,b,ξ Assume the dataset D obs = {(p i , y i )} N i where N is the number of images and each image is represented by a L1-normalized histogram (p i ) and the corresponding label y i .
The objective is to determine the infinite number of linear classifiers that maximizes the geometric margin between the classes with the lowest generalization error. In case of non-separable data, we look into higher dimensional space to find the appropriate hyperplane that maximizes the geometric margin and minimizes the misclassification error through some feature mapping technique. To control the trade off between the large margin and error rate, the hyperparameter C is incorporated.
The above is a convex quadratic optimization problem with linear constraints. Solving this problem will result in the maximum geometric margin between classes. Here, φ(p i ) is the embedding or feature mapping function from the input space, χ to the feature space, H. If no extra features are extracted from the data then this function represents the original input data known as the attributes and the kernel, K which is the inner product between each datapoint become p i , p j instead of φ(p i ), φ(p j ) . For non-linearly separated data, slack variable ξ i are introduced in the objective function and the constraints are modified accordingly. C is a hyperparameter that regularizes our objective function for misclassification. N i ξ i is the upper bound of the generalization error. For hard margin classifier C is set to high value to lower the misclassification error and for soft margin classifier C is set to low values to provide flexibility at boundary region for some data to be miss-classified.
Solving the dual problem is computationally convenient for large datasets. Relaxing the constraints with the help of Lagrange multipliers, dual solution becomes, Only the support vectors have γ values elsewhere it is zero. Getting the support vectors, the decision function classifies the data by comparing the kernel with the support vectors. The decision function of the support vector machine becomes,

V. FEATURE MAPPING: DIRICHLET SVM, GENERALIZED DIRICHLET SVM, BETA-LIOUVILLE SVM
In this section we focus on the primal and dual form of the optimization problem in (26) and (27) to modify the feature mapping function φ(p). As discussed, optimum performance of SVM depends on the choice of the kernel function and there is no structured procedure to select the kernel function or feature mapping [43]. One of the advantages of embedding input vectors into the feature space is providing flexibility in choosing the mapping function φ(p) depending on the structure of the data. Taking the advantage of Dirichlet, generalized Dirichlet and Beta-Liouville distributions for proportional data modelling, a new feature map can be constructed as follows, To estimate the parameters in (29), a similar technique is followed as described by [5]. Using the kernel trick, the proposed feature mapping technique can be used with the traditional non-linear kernels to map input space into feature space implicitly without knowing about the feature space. The dimension of the input space is increased by 1 by doing the feature mapping mentioned in (29). We can formulate the Dirichlet SVM (DSVM) as follows, In a similar fashion, generalized Dirichlet SVM (GDSVM) and Beta-Liouville SVM (BLSVM) can be formulated. For a new data p , the trained Dirichlet parameter α is used to determine the feature mapping according to (29). The decision function for this new data becomes, Applying the flexible mapping function φ(p) in (29) changes the similarity measure and thus enables us to modify the base kernel. Apart from the regular kernels such as RBF, polynomial, sigmoid, χ 2 which are discussed vastly in the literature, we combine our proposed feature mapping technique with the following kernels as well, Bhattacharya coefficient is a divergence type measure between distributions and defined as [44], Considering a D + 1 dimensional vector, it can be geometrically interpreted that the Bhattacharya coefficient measures the cosine of the angle between the vector elements. Since, p i and q i represent probability distributions and if they have the similar density function then the coefficient is 1. However, this coefficient can not be used as a metric distance since it violates the axioms of being a distance metric [45]. To make a proper representation of the distance metric, [44] modified the coefficient as The kernel for this distance with hyperparameter γ , • Generalized Histogram Intersection Histogram intersection kernel is a positive definite kernel and satisfies Mercer's condition to be used in SVM [2], [46]. Global or low-level features are commonly used for this, however, use of local features works well with this kernel as well. Given two vectors namely p i and p j containing the elements of two normalized histogram, histogram intersection measures the similarity between the them by using (34) [47].
• Jeffrey Divergence KL-divergence is non-symmetric and sensitive to histogram binning [48]. In addition, it is not robust and does not qualify to be used as a metric of the spread since it violates the triangle inequality. In response to this, Jeffrey divergence is empirically derived and it is mostly invariant to noise and histogram binning [49].
• Rational Quadratic From the probabilistic graphical point of view, several squared error kernels are derived and rational quadratic is one of them. This kernel is a scale mixture of different characteristic length scales [50]. This kernel is useful for modelling data which varies in multiple scales.
Here, α is scale mixture parameter and l is the scale length.
• Inverse Multiquadratic Inverse multiquadratic function is a member of generalized multiquadratic (GMQ) family of radial basis functions defined by K (p, q) = (c 2 + ( r) 2 ) β [51] where is the shape parameter and parameter β determines the positive definiteness of the function [52]. VOLUME 9, 2021 Unlike multiquadratic kernel, inverse multiquadratic is a positive definite [53]. Setting β = 1 2 , we get the following expression for this kernel- • ANOVA ANOVA kernel is one of the examples of convolution kernels [54]. This kernel uses factor d to get higher order interactions of the features that we are interested in and then sum over the terms to get the similarity score.
• Generalized T-student Kernel This is a positive semi definite kernel and satisfies the condition of Mercer's theorem [55]. It has similar form to Inverse Multiquadratic kernel.
MinMax is a graph kernel proposed by [56] which is similar to Tanimoto kernel when applied to binary dataset. MinMax kernel models count data and thus takes into account the values between 0 and 1. Therefore, this kernel is suitable for proportional data modelling.
• Cauchy Derived from the long tail Cauchy distribution, Cauchy kernel puts more weight on interaction of distant non-zero values [57]. [58] applied the Cauchy kernel for sparse coding of natural scenes data.
Unlike Gaussian kernel, in this kernel moving from the center gives more weight to the features. A combination of these two kernels showed good classification performance on some dataset [57].

• Cosine Similarity
In an inner product space, cosine similarity measures the similarity between the two vectors by calculating the direction of each vector [59]. This is a non-metric measure since it does not satisfy all the conditions to be a metric. • Base kernel: Compute K(p, q) from (33) to (45) for φ j (p i ) in (29) only when j = 1, 2, . . . , D.
• BLSVM: Use first and fourth form of (29) for φ j (p i ) and apply (33) to (45) to compute BLSVM kernel, K(p, q). 5. Optimization: Solve the primal problem in (26) or dual problem in (27) to get the support vectors.

• Tanimoto or Extended Jaccard Similarity
A modification in the cosine similarity function results in Tanimoto similarity index [56]. It represents the number of attributes shared by the vectors.
Here, p i , q i = D d=1 p id × q id and the term p i , p i = ||p i || 2 and q i , q i = ||q i || 2 is the Euclidean norm or the length of the vector. [60] derived the modified Tanimoto coefficient in relation with Cosine similarity as, Here, cossim(x i , y j ) is calculated from (42).
• Sorensen Similarity Similar to cosine similarity Sorensen similarity index is a non-metric measure as it does not satisfy all the axioms of being a metric. This measure is more appropriate in retaining the sensitivity of the heterogeneous data than Euclidean distance and in image segmentation and lexical association [61], [62].
Algorithm 1 shows the steps for the Dirichlet SVM, generalized Dirichlet SVM and Beta Liouville SVM using (29).

VI. EXPERIMENTAL RESULTS
In this section, we evaluate the proposed feature mapping technique for natural scene classification, satellite image  classification and human action recognition in videos. The dual form of the SVM optimization problem is solved using [63]. For multi-class classification, one-vs-all technique is applied and the tolerance value 10 −3 is used as stopping criterion and a hard limit on the solver is imposed by setting maximum iterations to 5000. All the models are evaluated using 10 fold cross validation. 9 folds are used for training and the remaining fold for testing the model. Similar to [5], for image classification best score is reported for each kernel and for action recognition, average score with standard deviation are reported for all kernels. For misclassification, the hyperparameter C in the objective function is varied from 1 to 15 in 10 base logarithm scale and best models are found by doing a simple grid search and are reported thereby. For polynomial kernel, degree 3 is considered and for RBF kernel the similarity measurements are scaled down by dividing the length of vocabulary size. In general, BLSVM performs better than DSVM and GDSVM approaches. As mentioned, generalized Dirichlet distribution has twice the number of parameters than Dirichlet distribution and density values are more concentrated around the mean compared to Dirichlet distribution (in Fig.1). Since our approach is to perform feature mapping after combining discriminative features with distribution based features, we assume that the feature pair similarity values in similarity matrix for generalized Dirichlet distribution are hard to separate after solving the dual form VOLUME 9, 2021 SVM in (27). Therefore, Beta-Liouville distribution is proved to be a better generalization of Dirichlet distribution in our proposed approach.

A. 15 SCENES DATASET CLASSIFICATION
Scene recognition is very essential for reasoning in navigation and recognition tasks. Specially in terms of robotics and automation it is significant to enhance machine's visual understandings [64]. 15 scene dataset consists of 15 different scene categories. First 13 categories were collected combinedly by [65] and [66]. For our experiment, from each category 100 images were selected totalling to 1500 images. Local features are extracted using Scale Invariant Feature Transform (SIFT) [67] algorithm as it is invariant to scale and rotation. In our experiment, we calculate dense SIFT [65] for speed using [68]. Descriptors are computed for densely sampled keypoints with similar size and orientation. Each images is converted to gray-scale and for each pixel descriptors are computed over a patch of 16 × 16 pixels. The extracted features are quantized into a vocabulary size of 200. Table 2 shows the best results for the baseline SVM, DSVM, GDSVM and BLSVM.
We can see that, with our proposed feature mapping technique accuracy score for the classification task has significantly improved. The reason is because of increased separability among the support vectors from each image category. In the case of linear feature map, BLSVM shows a 2% improvement in accuracy score. Non-linear kernels with BLSVM performs better than DSVM and GDSVM. We conduct statistical hypothesis testing (t-text) to investigate the scores of each approaches. Results of DSVM and BLSVM are statistically significant (p-value < 5%). However, performance difference of baseline score and GDSVM is not statistically significant for this dataset (p-value of 0.40). Mean average accuracy of BLSVM is 74.67% compared to DSVM's 74.27%. Thus, BLSVM is the preferred method for this dataset which requires only 2 more parameters to learn than DSVM. Such improvement is perhaps because of Beta-Liouville distribution's better generalization capabilities shown in (11)- (13) to capture data distribution with less number of parameters. Fig.2 shows the probability distribution of the experimental results presented in Table 2. It is  evident that Beta-Liouville distribution based feature mapping can be used more confidently with traditional kernels functions. The wider region around the mean in the violin plot represents a higher probability of getting consistent average result.

B. SATELLITE IMAGE CLASSIFICATION
This dataset has 19 categories of google satellite images collected from http://www.escience.cn/people/yangwen/ WHU-RS19.html. Each category has 50 images and the resolution of each image is 600 × 600. The challenges in classifying high resolution satellite image data is that the dominance of structures and objects in the image leads to misclassification [69]. For feature extraction, we use the same configuration as described in previous section.  For all the kernel, BLSVM outperforms baseline SVM, DSVM and GDSVM except for the exponential kernel where generalized Dirichlet SVM achieves higher accuracy of 88.991% (Table 2). Considering the core form SVM, BLSVM gives highest accuracy of 90.196% whereas linear SVM achieves 86.364% accuracy. Smaller p-value (less than 0.005) from Student's t-test confirms that performance results obtained from BLSVM are statistically significant and thus we reject the null hypothesis of being equal averages with other approaches. Fig.6 shows the distribution of accuracy score for BLSVM has less variance than other approaches which guarantees that it can be used confidently with selected kernels for this dataset.

C. HUMAN ACTION RECOGNITION
Recognizing human action in videos is an interesting learning task for surveillance and navigation tasks. For the purpose of evaluation of our model for videos, we choose KTH-human action recognition data introduced by [7]. This dataset contains 6 categories each having 100 videos with 4 different scenarios and each action is performed by 25 different persons with different variations e.g. different color of clothing, different motion of the person, camera angle, zooming, zittering etc. In total, there are 2391 sequences in this dataset. We are interested in dense features as it is more accurate than sparse features. Thus, we use dense optical flow algorithm proposed by [70]. Open source computer vision library such as [71] is used with default values to extract features with the codebook size of 500. Each frame is resized to 160 × 120 and further downsampled to 16 × 12 by taking the pixel values of the positions which are divisible by 10. L 2 normalization is used for feature invariance. To create Dirichlet, generalized  Dirichlet and Beta-Liouville SVM, the whole dataset is normalized as proposed in [5]. For 10 fold cross validation, mean accuracy with standard deviation are reported in Table 3. Total 384 videos are used training and 216 videos are used for testing. In the test data, each class has 36 videos.
From Table 3, highest average accuracy of baseline SVM is 92.034% for MinMax kernel which is increased to 94.104% when we combine MinMax kernel with Dirichlet feature mapping function. Fig. 7 shows the confusion matrix for DSVM MinMax kernel which achieves 87.50% accuracy for the test data compared to base MinMax kernel's score of 86.11%. Significance testing using Student's t-distribution shows that DSVM and BLSVM is statistically significant (p-values between 0.0009 to 0.007). GDSVM score is not significantly different than baseline SVM (p-value of 0.7). Heavy right tail of BLSVM's performance distribution in Fig.8 signifies that there is a greater probability of getting a higher accuracy score than DSVM.

VII. CONCLUSION
This paper shows a novel feature mapping technique for proportional data based on Dirichlet, generalized Dirichlet and Beta-Liouville distributions which shows good accuracy in classifying images and videos. Such data are prevalent in data mining, image processing and pattern recognition problems which motivated us to exploit the statistical representation of the data in order to enhance the discriminative power of the traditional SVM kernels. In particular, we have introduced three feature mapping functions for proportional data to be used in SVM learning algorithm. Our experiments show good performance of the proposed technique in classifying natural and satellite images and also in classifying human action recognition in videos. The results also show that either of the proposed distribution based feature mapping functions increases the accuracy of the corresponding SVM kernel.
MD. HAFIZUR RAHMAN received the bachelor's degree in industrial engineering from the Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh, in 2017, and the master's degree in quality systems engineering from Concordia University, Montreal, Canada, in 2020. He is currently working as a ML Researcher and Developer at Heyday.ai. His research interests include deep learning, Bayesian deep learning, and graph representation learning and their multidisciplinary applications.
NIZAR BOUGUILA (Senior Member, IEEE) received the Engineering degree in computer science from the University of Tunis, Tunis, Tunisia, in 2000, and the M.Sc. and Ph.D. degrees in computer science from the University of Sherbrooke, Sherbrooke, QC, Canada, in 2002 and 2006, respectively. He is currently a Professor with the Concordia Institute for Information Systems Engineering, Concordia University, Montréal, QC, Canada. His research interests include image processing, machine learning, data mining, 3-D graphics, computer vision, and pattern recognition.