Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus

Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional text representation methods have been successfully applied to self-contained documents of medium size. However, information in short texts is often insufficient, due, for example, to the use of mnemonics, which makes them hard to classify. Therefore, the particularities of specific domains must be exploited. In this article we describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions for personal finance management, a problem that was not previously considered in the literature. We trained and tested that system on a labelled dataset with real customer transactions that will be available to other researchers on request. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into account complexity and computing time. Finally, we present a use case with a personal finance application, CoinScrap, which is available at Google Play and App Store.


I. INTRODUCTION
Financial companies need to develop new strategies to keep and to expand their customer base.Their product portfolios have diversified over the years and customer behaviour has shifted from long-term loyalty to online interaction.
The fierce competition between banks has led to a growing need to convert customer data -which include shorttext banking transaction (BT) descriptions -into information relevant for decision making.
Data mining has been successfully applied to finance in various ways: identifying likely candidates for loan disbursement [1] and product acceptance [2]; characterizing product segments [3]; and analysing customer attrition and retention [4].However, to the best of our knowledge the problem of au-tomatic classification of short-text BT descriptions (according to a predefined set of labels) has not yet been tackled.
From a broader perspective, automated text classification has become a popular research area due to the many public digital text sources available.Text classification is useful for a wide range of applications, such as web searching [5], opinion mining [6] and event detection [7].Nevertheless, most text classification methods are valid for long texts.Some distinctive aspects of short texts are: 1) Sparsity: Short texts often have fewer than 150 words and are usually organized in few sentences.They convey very little effective information.Since sparsity affects the quality of short text semantics, traditional techniques as those used for long texts are impractical [8], [9], as it is difficult to extract key features from large feature spaces for accurate classification training.2) Real-time generation: Nowadays vast amounts of information are continuously produced in the form of short messages.Consider, for example, chat and micro-blog information and news comments, among others.They reflect reactions in real-time to outside world events and, therefore, are difficult to collect.Consequently, short-text classification methods must be highly efficient.3) Irregularity: Short-text terminology is not standardized and vocabularies are informal or specific (in our case related to banking).
Two key aspects are that words are seldom repeated in a given BT description and that few words are irrelevant.The level of significance of a word cannot be simply determined by its repetition within the text.However, for the same reasons, short texts are less noisy than long texts.
Our proposal is based on Natural Language Processing (NLP) and Machine Learning (ML).It characterizes financial short messages with features such as character and word n-grams, which feed a supervised Support Vector Machine (SVM) classifier.Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance.Therefore, our proposal consists in a two-stage classifier combining this detector with a SVM.In any case, the sizes of shorttext banking description datasets discourage the application of deep learning techniques [10].
The rest of this article is organized as follows.In Section II we review the state of the art in short-text classification.In Section III we describe the classification problem.Subsections III-A1-III-A4 explain the modules of our system.In particular, Section III-A4 describes the short text similarity detector to reduce training set size based on the Jaccard distance.Section IV presents the experimental text corpora and evaluates our approach with real data.Section V cites a real world solution based on our approach.Finally, Section VI concludes the paper.

II. RELATED WORK A. CUSTOMER ANALYSIS
BT data have grown considerably with the expansion of electronic banking [11].The banking sector is well aware of the value of customer information covering demographics, leisure, wealth, insurance, financial transactions, and so on.
Several studies have been conducted on the analysis of customer attrition and retention.Some focus on aspects influencing customer choices, such as customer care, speed and quality of service, variety of services, fees, online accessibility, etc [12]- [14].Other studies have focused on customer churn (that is, leaving one bank to another) [12], [15]- [17], fraud [18], [19] and even spatial distribution from transaction activity in commercial areas [20].

B. PERSONAL FINANCE MANAGEMENT
Personal finance management or PFM aggregates household bank accounts and offers users a view of their day-to-day personal finances.It involves planning and budgeting, cash flow control, investment, taxation, and insurance [21].It is becoming increasingly popular and many PFM resources such as BudgetBuddy1 , AccBiz2 , Prosper3 , Finn 4 and Figo5 exploit PFM by recommending personalized insurance products or long-term financing plans.These applications also provide budgeting and credit scoring tools to help households track their expenses and credit score.

C. OPEN BANKING EUROPEAN REGULATION
The European path to digitization is based on four pillars [22]: (1) extensive reporting requirements to control systemic risk and change financial sector behaviour; (2) strict data protection rules; (3) open banking to enhance competition; and (4) a legislative framework for digital identification.In this line, the Second Payments Services Directive6 (PSD 2) empowers customers to make their banking data available to third parties such as FinTech companies.In essence it paves the way for new banking products and services, by promoting competition without compromising security.

D. TEXT CLASSIFICATION
Most existing approaches for text classification rely on simple document representations in word-oriented input spaces.Despite considerable efforts to introduce more sophisticated techniques for document representation such as those based on higher-order word statistics [23], NLP [24], string kernels [25] and word clusters [26], simple bag-of-words (BOW) approaches [27] are still popular.
Different ML methods, such as Naive Bayes [28], logistic regression [29] and SVMs [30] have been proposed for text classification.In particular, linear classifiers, which are efficient, robust and easy to interpret, have been successful at sentiment analysis [31].
Diverse complex features have been added to these text classification models.Some examples are parts-of-speech and phrase information [32], syntax integration by means of explicit features and implicit kernels [33], and, for sentiment analysis, dependency tree features [34] and semantic composition models [35].In [36] it was shown that BOW and bigram features are more productive than much more complex features.Distributed word representations [37]- [39] have enriched discrete models for semi-supervised learning.Word embeddings have mostly been used to feed neural network models such as recursive tensor networks [40], dynamic pooling networks [41] and deep convolutional neural networks [42].Finally, direct learning of distributed vector representations of paragraphs and sentences for text classification was discussed in [43].
As previously mentioned, unlike normal text classification, short-text classification must tackle the problem of sparsity [44].Rare and even missing words in training texts may appear in testing data.Most words only appear once in the texts that include them.Therefore, the term frequency-inverse document frequency (TF-IDF) metric is not representative.To address this issue, some researchers enrich data contexts with information from Wikipedia [45] and ontologies [46].However, this requires solid NLP knowledge and highdimensional representations that may be expensive in terms of memory and computing time and, thus, inefficient for real-time solutions.The more sophisticated approach in [47] applied a Dirichlet multinomial mixture model for short-text classification.The approach in [48] clustered texts using the Locality Preserving Indexing (LPI) algorithm.In recurrent neural network (RNN), textual trees are also computationally expensive [49].Therefore, the design of efficient models is still challenging.
Two well-known methods for short-text classification are Probase Bag-of-Concepts short-text classification (Probase BOC STC) [50] and Entity Explicit Semantic Analysis (ESA) [51].ESA is based on semantic relation degrees [52]- [54] from Wikipedia.It associates all words in a Wikipedia page to the corresponding Wikipedia entry (concept) using the TF-IDF value as correlation metric and produces indexes that map each word in a short text to the concepts considered.Note that the short text may not mention the concept explicitly.ESA uses the vector representation of a short text as the input of a SVM classifier.Regarding Probase BOC STC, it is based on the Probase knowledge base of entity relationships and other related information that Microsoft extracted from massive Internet data using the is-a relationship.A key difference with ESA is that Probase is a knowledge base by itself that has been produced with an automatic extraction algorithm.However, as in the case of ESA indexing, is-a relationships may lack relevant information for short-text classification.
To the best of our knowledge no previous research has considered short-text BT classification.We propose a simple and efficient approach that could be easily adapted to other application domains.

III. SYSTEM DESCRIPTION
We seek to develop a simple and efficient short-text classification system by taking advantage of the particularities of BT descriptions, with high macro-average precision, recall and F measures.Our approach has three stages, as described in Figure 1

1) Data retrieval
Data was retrieved with the CoinScrap coin scrapper embedded into electronic banking apps of real users, who granted us permission.Section IV-A describes the resulting dataset.

2) Text tokenization and stopwords
The language of BT descriptions is quite particular because it must be concise.The meaning of the message is condensed in few characters.In most cases verbs are totally absent.Nevertheless, BT descriptions may still contain useless information that may affect text classification.
First, each BT description is split into tokens, and, in some cases, into sentences.Then meaningless words or stopwords, such as determiners and prepositions ('el'/'the', 'en'/'in', 'entonces'/'so', 'aunque'/'although', 'pero'/'but' and so on) are removed.Table 1 shows some stopword examples 7 .Next, all punctuation marks apart from '.' and ',' are also removed.Finally, proper names are detected using lists of names and surnames 8 and replaced by a tag.Taking the real BT description 'Compra en supermercado Elvira Madrid 28.TARJ.:*320546' as an example, after text tokenization, stopword removal and proper name extraction, the result is 'Compra # supermercado #PNegi# Madrid 28.TARJ.#320546'.The "#" symbol marks the place where a word is removed.Note that each proper name is substituted by "#PN" followed by a set of characters ('egi' in the example) and "#".Thus, a given name is always replaced by the same identifier ('Elvira' by #PNegi# in the example).Credit card number was always anonymized.

4) Training sample reduction with similarity detection stage
We take advantage of the fact that many BT descriptions are similar to reduce the size of the training set.For that purpose, we insert a similarity detector based on the Jaccard distance [55] before the classifier.This is inspired by spam detection techniques that use this distance to seek for characteristic sentences [56]- [58].
The similarity detector only considers the text of the descriptions.When the Jaccard similarity between a new labelled description and a previous entry in our dataset exceeds 85%, and both belong to the same category, the new description is not added to the SVM training set.Otherwise, we keep it.When the similarity between a new unlabelled description and a previous entry exceeds 85%, we assign to the description the class of the entry.Otherwise, the description is passed to the SVM for classification.
Figure 2 illustrates the architecture of the system including the Jaccard similarity detector.The SVM classifier is explained in Section III-B3.

B. MACHINE LEARNING ANALYSIS
In this section we explain the knowledge-based linguistic extraction as well as the feature selection.

1) Linguistic knowledge extraction
In this step we create lexica whose entries are related to the categories of the classification problem. Figure 3 represents the lexicon generation procedure.
First, starting from the preprocessed BT descriptions in the training set, which are labelled according to the classification categories, all non-alphabetic characters such as numbers, punctuation marks and symbols are cleared.Next, useful final elements for the lexica are extracted.These are the unigrams that appear at least five times in the text corpora for each category (all others are excluded) and the bigrams that are present at least three times in the corpora.Single-character alphabetic elements are also discarded.The final result is a set of lexica with unigrams and bigrams and their corresponding categories.
For example, let us suppose that the training set only has the following entries for a given category: The resulting lexicon would only contain the words 'compra' and 'supermercado' and the bigram 'compra supermercado' followed by the categories.

2) Feature selection and weight calculation
The system uses a standard SVM algorithm for modelling and prediction.Short texts are encoded according to the vector space model in [59].The smallest data unit in the model corresponds to a feature.A text T may be seen as an ndimensional vector in the vector space, as follows: where t is the value of a feature of text T and w its weight.The greater the w, the more information the feature contains in that case [60].Many different types of features are possible, such as Boolean, word frequency (number of times a word appears in the text) and TF-IDF.Note that classification results depend greatly on feature selection [61], [62].An efficient feature selection method not only reduces the dimension of the feature space but also avoids useless features.The features in our system are the following: 1) Lexicon data.These features count the words in the BT descriptions that appear in the lexica for each existing category.2) Amount.The range of the BT amount field, since ranges are more significant for our application than exact values.Specifically, we consider non-overlapping intervals limited by 20, 60, 200, 800, 1500 and 3000 euros.3) Sign of the amount.This feature indicates if the BT is an income (positive) or an expense (negative).4) Date.The information in the date field of each BT.
Again, we use ranges.This is because some events occur on specific days of the month (e.g.salary at the end), whereas other events (e.g.purchases) may happen anytime during the month.The selected ranges were the last five, ten, twenty and twenty-five days of the month.
5) Word n-grams.N -gram representation is languageindependent.It transforms documents into highdimensional feature vectors where each feature corresponds to a contiguous sub-string.Formally, an n-gram consists of n adjacent items from alphabet A. Items can be phonemes, syllables, letters, words or base pairs depending on the application.Hence, the number of different n-grams in a text is |A| n at most.The dimension of an n-gram feature sub-vector may therefore be very high even for moderate values of n.However, since not all n-grams are present in a document, the dimension is substantially reduced.During the formation of an ngram feature sub-vector, all upper-case characters are converted into lower-case characters and punctuation marks are converted to spaces.Sub-vectors are then normalized.The optimal n depends on the text corpora.We explain feature sub-vectors with an example that computes the n-grams from one to four words for the BT description 'Operación tarjeta débito Amazon' ('Amazon debit card transaction').The resulting vector consists of the following components: 'operación' ('transaction'), 'tarjeta' ('card'), 'débito' ('debit'), amazon; 'operación tarjeta' ('card transaction'), 'tarjeta débito' ('debit card'), 'débito amazon' ('amazon debit'); 'operación tarjeta débito' ('debit card transaction'), 'tarjeta débito amazon' ('amazon debit card'); 'operación tarjeta débito amazon' ('amazon debit card transaction').6) Character n-grams.Character n-grams have been proven useful for a variety of ML problems, such as language detection.Simple models based on them have outperformed convolutional and recursive deep neural networks (CNNs and RNNs) [63]- [65].
They have been applied in scenarios with misspelling errors [66], [67].Character n-grams may also capture other effects of language usage, such as re-named entities and abbreviations, e.g.'maths' instead of 'mathematics'.In our case, they are justified by the many shortened words in BT descriptions.

3) SVM classifier
We decompose the overall problem into pairwise two-class problems, following a one-versus-one approach.Therefore, k(k − 1)/2 SVM classification models are necessary for k text classes.The category is decided by majority voting.

A. DATASET
The dataset comprises 30,844 BT descriptions from customer accounts of major Spanish banks, written mostly in Spanish and issued between August 2017 and February 2018.They were collected during the CatCoin project with the collaboration of CoinScrap Finance S.L., Spain, using the Coin-Scrap platform.The entries of the dataset have the following attributes: 1) ID: a unique numeric identifier.
2) Description: the BT short-text description.
3) Amount: the amount in euros of the BT, either positive (income) or negative (expense).4) Date: the date when the BT occurred.Every entry has an extra field with the category label that determines the classification goal.The dataset may be requested to the authors by e-mail.Table 2 shows the numerical distributions of the fifteen categories in the dataset.Table 3 shows some examples of dataset entries.

B. EVALUATION METRICS
Due to the issues of accuracy with class asymmetries [68], [69], we employed precision, recall and F metrics using a macro-average approach.
Macro-averaged results were computed as indicated by [70].Consider a binary evaluation metric B(t p , t n , f p , f n ) that is calculated based on the number of true positives (t p ), true negatives (t n ), false positives (f p ) and false negatives (f n ).Let t p λ , f p λ , t n λ and f n λ be the amounts of true positives, false positives, true negatives and false negatives, respectively, after binary evaluation for label λ.The macroaverage evaluation metric is calculated as follows: Macro-averaging weights all classes equally, whereas micro-averaging weights all document classification decisions equally.Since F ignores true negatives and its magnitude is mostly determined by the number of true positives, large classes dominate over small classes in micro-averaging [71].For this reason we preferred the macro-average approach.
To calculate precision, recall and F rates we first computed each of these measures separately for each category using expressions (3)-( 5): Recall microq = tp q tp q + fn q (4) These metrics were then averaged by category using expression (2) to produce the macro-averaged metrics.

C. NUMERICAL RESULTS
We performed cross-validation in different dataset splits of training and testing subsets (in all cases the first and second percentages correspond to training and testing subset sizes, respectively): 30%-70%, 40%-60%, 60%-40% and 70%-30%.The purpose was to check the robustness of our system when fewer training data were available.In each experiment we extracted the lexica of the set as explained in Section III-B1.Table 4 shows the distributions of words in the lexica for all categories before applying the similarity detector.We added features incrementally to the model to assess their significance.Therefore, first we only used word n-grams and lexica, then we added BT amount and date, and finally character n-grams features.
Given the target sector (finance), precision may be more important than recall.This is because banking campaigns prefer to obtain less positives for key categories.By doing so, they maximize the probability that customers will be receptive to personalized products.We compared our system with three competitor approaches, All-In-1 [72] and two variants of the method by IITP (Indian Institute of Technology Patna) [73].These approaches analyzed customer feedback to manufacturers, which also consisted of short texts, although with more elaborate sentences than BT descriptions.Note that no other researchers have considered BT to date.For the sake of fairness, we applied the Jaccard distance detector stage to the competitors as well.
The All-In-1 approach in [72] is based on a classic SVM classifier that takes character n-grams and monolingual word embeddings as input.Logically we only used the monolin-gual version.
The two IITP variants [73] are based on CNNs.The second variant combines a CNN with an RNN.Specifically, a convolutional feature extractor is applied to the input, a recurrent network is applied to the CNN output, an optional fully connected layer is applied to the RNN output, and finally a softmax layer delivers the result.
Tables 5 and 6 show average elapsed training and testing times for our system and its competitors for the different splits and selections of features, obtained with crossvalidation (five different dataset samplings in each experiment).For our system, the values of n in character and word n-grams were adjusted to 3-5 and 1-4 respectively.
Note that, even though testing times were comparable, the training times of the competitors were significantly higher.This is due to their greater computational complexity.Specifically, the SVM classifier of All-In-1 uses word embeddings and the two variants of IITP are based on CNNs.
Table 7 shows the average training set reductions that the similarity detector achieved for the different splits.Note that they exceeded 56% in all cases.

1) 30%-70% split
In each experiment the training dataset had in average 4,031 entries (after similarity detection) and the testing dataset had 21,590 entries.
Tables 8 and 9 show the results of BT classification.Note that we did not modify the design or the implementation of the selected competitors.Thus, for a fair comparison, only word and character n-grams features were enabled in Table 8.Our system outperformed the competitors in terms of precision and All-In-1 was the best option in recall and F -measure.
In Table 9 we observe that, after activating the lexicon feature, the precision, recall and F of our system increased by about 15%, 38% and 35%, respectively, so the lexicon feature was crucial.Another key result is that meta-information features yielded a precision increase of 8% in our system.After activating all features, precision, recall and F further improved by around 3%, 19% and 13%, respectively.
Our system attained the best precision, but only attained better F than All-In-1 if all features are activated.On the other hand, All-In-1 was better in recall but the difference with our system in that regard was only about 3%.Note, however, that regardless of the fact that precision is more important in our scenario, our system is simpler than its competitors (based, depending on the case, on CNN, CNN+RNN or a SVM with word embeddings), especially in terms of training time, as shown in tables 5 and 6.

2) 40%-60% split
In this case, in each experiment the training dataset had in average 4,849 annotated entries (after similarity detection) and the testing dataset had 18,506 entries.
Tables 10 and 11 show the results.In this case our improvement in precision with the basic features was about 2% compared to the competitors.The precision, recall and F of our system increased by about 17%, 37% and 34%, respectively, after activating the lexica feature.By adding meta-information, the improvements were 7%, 3% and 5%, respectively.Total precision, recall and F improved compared to the baseline (word n-grams) by about 2%, 18% and 12%, respectively, after activating all the features.
We again attained the best precision performance, outperforming the best competitor by almost 4% when all features are activated.

3) 60%-40% split
The training dataset had in average 6,209 annotated entries (after similarity detection) and the testing dataset was composed of 12,338 entries.
Tables 12 and 13 summarize the results.Table 12 shows that our system had the best precision performance if both word and character n-grams features are enabled, but All-In-1 was still the best alternative in terms of recall and F for the basic combinations of features.In this case, the precision, recall and F metrics of our system improved versus the baseline by about 27%, 58% and 50%, respectively, after activating all the features.We again attained the best precision, outperforming the best competitor by almost 2% when all features are activated.Our system ranked second in recall after All-In-1 by a narrow margin of about 3% again when all features are activated.We remark again that we are not using semantic information from word embeddings.

4) 70%-30% split
The training dataset had in average 6,780 annotated entries (after similarity detection) and the testing dataset had 9,253 entries.
Tables 14 and 15 show the results.With the basic features all systems achieved similar results (Table 14).In this case, the precision, recall and F of our system improved by about 28%, 57% and 49%, respectively after activating all the features.We again attained the best precision and almost matched All-In-1 in terms of recall and F when all features are activated.
Table 16 shows the performance of our system by BT category when all the features are enabled.In general the performance was satisfactory.The worst performance corresponded to the categories with fewer entries in the training set (according to Table 2).

D. SUMMARY OF NUMERICAL RESULTS
To evaluate the performance of our system we applied crossvalidation in five dataset splits between training and testing subsets (30%-70%, 40%-60%, 60%-40% and 70%-30%), to check the robustness of our approach as the sizes of the testing subsets decreased.In these experiments we added features to the model incrementally to assess their significance, in the following order: word n-grams, lexica, amount, date and character n-grams.We compared our system with three competing approaches from the state-of-the-art, All-In-1 and two variants of the IITP method.The Jaccard similarity detector achieved reductions of training data exceeding 56% for all splits.
For the 30%-70% split, our system attained the best precision.It was inferior to All-In-1 in recall and F unless all features were enabled.If they were, our system also outperformed its competitors in F .For the 40%-60% split, our system outperformed the competitors in terms of precision, recall and F when all features were enabled.It was better in precision even with the basic combination of features.For the 60%-40% and 70%-30% splits, our system again outperformed the competitors in terms of precision, and the performance gap with All-In-1, in the cases it existed, was reduced.Indeed, our approach is simpler than the competitors, which allowed significant training time reduction.

V. USE CASE:COINSCRAP
CoinScrap launched its mobile app for iOS and Android in November 2016, and since then it has had thousands of downloads.A new version of the application was launched October 2018.It includes journey improvement for product fulfil-ment; dynamic "gamified" saving rules (e.g.saving when your favourite team wins, or when you take a coffee); and personalised recommendations for financial management.
The latter rely on our system to classify BT transactions.In this line, CoinScrap recommends personalized services and products based on financial necessities and objectives.Figure 4 shows an screenshot of the app.

VI. CONCLUSIONS
Compared to normal texts, short texts analysis is challenging due to sparsity, irregularity and real-time data generation.In this paper we describe a short-text SVM BT classification system using a combination of meta-information and linguistic knowledge (by relying on specialized lexica).
Motivated by existing solutions in spam detection, we achieved a significant reduction of training information with a short text similarity detector based on the Jaccard distance.
Experimental results, by comparing our approach with three state-of-the-art competitors with higher computational complexity, are very promising.Our lexicon feature is crucial to attain high precision, especially if the training dataset is small.The effectiveness of the proposed system was demonstrated on a real dataset reflecting the activity of real customers of Spanish banks, organized in fifteen different classes including means of transport, shopping, household expenses, taxes, charges and payroll.This labelled dataset is a valuable asset that will be available to other researchers on request.
Our system attained the best precision (which is the most relevant metric in PFM) and performed similarly in terms of recall and F if enough features were enabled, especially when the methods were stressed by reducing the training-totest subset size ratio.
Given the encouraging results in this work, we are currently expanding it to obtain sub-categorisations of the descriptions.Our approach has been put into production in a real PFM application, CoinScrap.

FIGURE2:
FIGURE2: Flow diagram of the system with Jaccard similarity detector.
Some examples of stopwords. TABLE1: Distribution of categories in the labelled dataset.
TABLE2:TABLE4: Average word distribution in the lexica for the different training-testing splits before applying the similarity filter.
Elapsed training and testing times of our system for different dataset splits.Word n-grams 2.20 ± 0.40 Word n-grams + char n-grams 5.00 ± 1.55 Word n-grams + lexica 2.20 ± 0.40 Word n-grams + lexica + amount + date 2.20 ± 0.40 Word n-grams + lexica + amount + date + char n-grams 6.20 ± 1.47 TABLE5: Elapsed training and testing times of the competitor systems for different dataset splits.Word n-grams 2.20 ± 0.40 Word n-grams + char n-grams 5.00 ± 1.55 All-In-1 Word n-grams 1.83 ± 0.08 All-In-1 Word n-grams + char n-grams 30.04 ± 1.66 Training sample reduction for different dataset splits.Average evaluation metrics of the proposed system for all combinations of features, 30%-70% split.Average evaluation metrics for the basic combinations of features, 40%-60% split.Average evaluation metrics of the proposed system for all combinations of features, 40%-60% split.Average evaluation metrics for the basic combinations of features, 60%-40% split.Average evaluation metrics of the proposed system for all combinations of features, 60%-40% split.Average evaluation metrics for the basic combinations of features, 70%-30% split.Average evaluation metrics of the proposed system for all combinations of features, 70%-30% split.Performance of our system by category with all features enabled, 70%-30% split.