Ensemble Methods for Instance-Based Arabic Language Authorship Attribution

The Authorship Attribution (AA) is considered as a subfield of authorship analysis and it is an important problem as the range of anonymous information increased with fast-growing of internet usage worldwide. In other languages such as English, Spanish and Chinese, such issue is quite well studied. However, in the Arabic language, the AA problem has received less attention from the research community due to the complexity and nature of Arabic sentences. The paper presented an intensive review of previous studies for Arabic language. Based on that, this study has employed the Technique for Order Preferences by Similarity to Ideal Solution (TOPSIS) method to choose the base classifier of the ensemble methods. In terms of attribution features, hundreds of stylometric features and distinct words using several tools have been extracted. Then, AdaBoost and Bagging ensemble methods have been applied to Arabic enquires (Fatwa) dataset. The findings showed an improvement of the effectiveness of the authorship attribution task in the Arabic language.


I. INTRODUCTION
From linguistics analysis perspective, authorship attribution (AA) aims to identifying the original author of unseen text. The idea is basically formulated as follows: for each author, there are a set of features that distinguish his writing style from others. Despite author's writing style that can change from topic to topic, some persistent uncontrolled habit and writing styles are still valid over the time. The author of anonymous text can be recognized by matching the observed writing style to one of the candidate author set. From the 19th century, several approaches have been proposed to tackle the AA problem. The early approaches had a statistical background [1][2][3][4] where the length and frequency of words, characteristics, and sentences were used to characterize the writing style. These approaches, in general, were human expert-based [5] and the applications also covered literary, religious and legal texts [6]. From sixties of the last century up until 1990s, both the approaches and application were shifted to cover new challenging problems such as the source code attribution [7][8][9], spam detection [10,11], and plagiarism [12][13][14][15]. The approaches at that time were aimed to quantifying the writing style by extracting some features from the text. Although the statistical approaches are good to identify the author of long documents, they suffer when the length of the text, under investigation, is short. The main challenges in such cases include: are the small extracted features sufficient enough to make a fair attribution? how can we improve the precision of the authorship attribution? does the size of the training set effect on the result? what does happen if the dataset unbalanced? what is the optimum data size? Recently, current studies in authorship attribution benefit from explosion in machine learning domain [16] where the AA task can be considered as a multi-class, single-label classification problem [17]. Basically, the machine-learning approach tackles the AA problem by assigning class labels to text samples. Surveying the literature, we found a large number of methods and approaches that were developed to tackle the AA problem such as Support Vector Machine (SVM) [18][19][20][21][22][23], Naive Bayes [4,20,[24][25], Bayesian classifiers [25][26][27], k-nearest neighbor [28,29], decision trees [29,35]. Although the ensemble methods showed a good performance to improve machine learning results, few studies such as [30][31][32][33][34] employed them in AA area. The ensemble methods combine several classifiers in order to decrease variance (bagging) and bias (boosting) and then new data are classified by taking a (weighted) vote of their predictions. Arabic language is the mother tongue for more than 250 million people reside mainly on two different continents. However, the works on AA for Arabic are still less numerous than those on English [5,23,[35][36][37][38][39][40][41][42][43][44][45]. Thus, this paper aims to bridge the gap and investigates whether applying the ensemble methods lead to improve the accuracy of the AA task in the Arabic language, in addition to selecting the base classifier for ensemble methods and optimal combination of features. Furthermore, since appropriate tuning of the size of the training set and feature data set can render significantly lighter the machine-learning processing [17], this paper gives some recommendations for selecting the optimal settings of data set size that maximizes the accuracy of classifiers. The rest of the article is structured as follows: Section 2 presented the related studies on authorship attribution. it also reviews the studies on the Arabic Language Authorship Attribution (ALAA) and a set of base classifiers were chosen. Section 3 presents the experimental setup, datasets used, and techniques employed. The results and their discussion are given in Section 4. Finally, we conclude the study in Section 5.

II.RELATED STUDIES
While AA can be considered as a particular type of authorship analysis, ensemble methods is a known approach in machine learning where a set of classifiers with their results are focused in some way to obtain better decisions [47]. In this section, we briefly describe what the authorship attribution is, the features used, and the typical machine-learning based attribution process. Then, we also present some techniques for improving the classification accuracy of class-imbalanced data. In addition, a review on Arabic Authorship Attribution (ALAA) was presented.

A. AUTHORSHIP ATTRIBUTION
As earlier said, authorship attribution can be considered as a subfield of authorship analysis. It is about identifying the author(s) of an anonymous text document depending on document's characteristics or features. In literatures, such characteristics or features are known as author's writing style or stylo-features [25]. These features are extracted in deferent ways based on how the AA algorithm covers the whole samples. In general, these ways are categorized into two major groups: profile-based and instance-based approaches [16]. While the former group extract stylo-features by concatenating all the samples, that belong to a particular author, within the training set in one big file, the latter group handles each sample in the training corpus of each author separately and in consequence extracts the writing style features from each document (see Fig. 1). In addition, the former group of approaches enables to catch the most persistent and uncontrolled habits in author's writing style, whilst the latter group enables to detect any variation in the writing style. Thus, a combination of both ways is a practical instrument to improve the accuracy of attributing process.

1) AUTHORSHIP ATTRIBUTION PROCESS
Typically, the authorship attribution goes through two main stages: features acquisition, and attribution model construction. The features acquisition is a process where author's writing styles are extracted regardless the way that is used to handle the training text corpus. The earlier attempts to handle stylo-features go back to 19th century. Most of such methods were statistical attempts in its nature where the researchers have tried to quantify the writing style. However, with emergence the Internet, a vast amount of electronic texts was produced and the need for handling these texts are increased. In the shadow of these needs, domains such machine learning, natural language processing, and information retrieval have impact in guiding the authorship attribution research directions.
Back to the earlier era of authorship attribution, we can classify the used features in attributing stage into two main classes: unitary invariant class and multivariate analysis which are both classified as human expert-based approaches. The unitary invariant class uses only a single feature, such as word length, words frequencies, and sentence length to distinguish between authors. The unitary invariant methods gave unreliable results. The multivariate analysis methods, on opposite, deal with a set of features to statistically attribute texts. Methods such Bayesian statistical analysis [4], Principal component analysis (PCA) [49], Linear discriminant analysis (LDA) [50], and Distance-based methods [25;51-54] are used to attribute the texts. The attribution model construction aims to build an adequate model that can classify the anonymous texts and match them to the right author. With the development of machine-learning techniques, the accuracy of attribution model is enhanced obviously [16]. Machine learning is a branch of artificial intelligence concerned with learning computer systems directly from examples, data, and experience. Learning methods can be categorized into two groups: supervised machine learning methods and unsupervised ones. In supervised methods, dataset is divided into sets: training set and testing set. The former set is used to learn classifiers how to predict class labels, whilst data outside the training set (called testing set) is used to evaluate how well the model does. Classification and regression analysis are the common supervised learning task. Unsupervised methods are type of learning methods that is used to find patterns in data. It does not require to split data or label them. Data visualization and clustering are classified as unsupervised learning methods. The goal of applying machine-learning methods in AA task is concludes in building a vector of features extracted from the training text corpus, then build a classifier that can attribute anonymous texts on the testing corpus. Figure 2 shows a typical machine-learning based of an authorship attribution process.

2) AUTHORSHIP ATTRIBUTION FEATURES
As earlier state, the authorship attribution process begins with building a vector of features elicited from the text under consideration. The aim of this step is to extract "writing style" features which are internal characteristics of text. Surveying authorship attribution studies, these features can be categorized into: lexical, character, syntactic, semantic, content-specific, structural and language-specific [35,47,16].
• Lexical features are one of the most common features used to attribute authorship [5]. Such features can be extracted from a text by tokenizing text into list of words, sentences, numbers, and even punctuation marks. Indeed, in a case of applying the lexical features, results of AA is dependent on the ability of tokenizer to detect the boundaries of words and sentences1. • Character, the character features can be considered as subset of lexical features where the text content are treated as a sequence of characters. The character features are partial language-dependent which means features such uppercase and lowercase characters cannot count in e.g. Arabic. • Syntactic, from text to another, the author may tend to use similar syntactic patterns unconsciously. These patterns can be a more reliable authorial fingerprint than the lexical features. However, they require a specific parser to analyze the text. The most common syntactic measure is a part-of-speech (POS) [16]. • Semantic, on opposite of aforementioned features, semantic features are high-level natural language processing task. Surveying literatures, only a few attempts address semantic features. • Application-specific, these features can be either structural, content-specific, and language-specific. author's signature, font colors, and font size are obvious structural features used for attributing author [55]. Content-specific features can be extracted from the available texts only and only if all authors, in corpus, are of the same topic. The language-specific features are also common in attributing author. However, to measure them, it has to be defined manually.

B. ENSEMBLE LEARNING
Improving accuracy of a classifier model is a critical task. One way to do that is by fusing the output of a set of classifiers which called in data mining domain as "ensemble methods". It is obvious that classifiers are vary in its accuracy and some of them perform better others in some cases. Thus, finding a way to combine them tend to be more accurate than working with each classifier separately. Ensemble methods are type of learning algorithms that combine a set of classifiers and then use a (weighted) vote of their prediction for classify new data points. Current section highlights some aspects of ensemble methods. It gives a brief introduction of the most common methods: bagging, boosting, and random forests.

1) ENSEMBLE METHODS
As earlier stated, an ensemble combines a set of classifiers "base classifiers". The ensemble performs e.g., majority voting method to prioritize class label of each classifier and outputs the class in majority. Due to the fact that a separated classifier may make a mistake, the ensemble will misclassify only if over half of the base classifiers are in error. Thus, the accuracy of an ensemble is more accurate than its base classifiers [56].

2) SELECTION OF BASE CLASSIFIER OF ENSEMLE METHODS
The diversity of existing machine learning classifiers that one can select as a base/weak classifier of the ensemble method makes such selection a challenging task. In [77], Zhou et al., proposed a genetic algorithm-based selective ensemble approach. The proposed approach aimed at selecting the appropriate classifiers for composing an ensemble from a set of available classifiers. However, like any optimization based approaches, falling in a local optimum point is probable. Hence, the researchers have proposed other approaches. Lazarevic and Obradovic proposed a clustering-based approach [78] which uses k-means to identify the groups that had similar classifiers and then eliminated redundant classifiers that were in each cluster. The similar approach is also found in [79] where the hierarchical agglomerative clustering algorithm is used. However, the empirical analysis shows that the clustering-based selective ensemble techniques have a bad influence on the effect [80]. In [81] ranking-based method is proposed. The results showed an improvement in the performance of the ensemble. However, the ranking-based techniques are also time-consuming and require a large amount of storage. At this end, selection the right base classifier plays the vital role in minimizing the total misclassification errors as well as the cost of training. The selection process of base classifier can be led by many factors: accuracy of classification, ability of the base classifier to deal with high dimensional data and its performance when the dataset size is increased, and sensitivity to noise data. Decision tree, in particular, C4.5 is considered a robust learner against noisy data, whereas support vector machine (SVM) is more noise-sensitive [82]. Sáez et al. in [82] showed that the SVM has better performance without noise than C4.5. However, the situation is reversed when some noisy data are added. The average performance of C4.5 is better which indicates that the C4.5 method globally behaves better with noisy data. From sensitivity to increase the dataset size, the SVM shows a notable robustness rather than C4.5. Nikam in [83] provided a comparative study of many classification methods including k-NN, NB, artificial neural networks. As conclusions, the k-NN classifier shows sometimes a robustness with regard to noise data, however, the performance of the classifier is significantly influenced by the number of the dimensions used as well as the dataset size and number of records. The NB shows also a great Computational efficiency and classification rate when the dataset is increased.

3) ENSEMBLE WITH IMBALANCED DATA SETS
To deal with imbalanced data set problem, there are four general methods: oversampling, under-sampling, threshold moving and ensemble techniques. The first three techniques did not carry any change to the construction of the classification model. The oversampling and under-sampling techniques cause only a change in the distribution of the data in the training sets, whereas threshold moving effects the final stage of making a decision of classification new data. The ensemble methods can apply, as earliest stated, bagging, boosting and random forest to build a composite model. However, in case of imbalanced data, the oversampling technique is used to split training set into sets with the same positive and negative tuples. On the contrary, the undersampling tends to decrease the number of negative tuples in the training sets until the number of positive and negative tuples are equals. The threshold moving technique does not involve any sampling. The classification decision is returned based on the output values. The simplest form is as follows: for the tuples that satisfies the minimum threshold, are considered positive, whilst the others are negatives.

C. ARABIC AUTHORSHIP ATTRIBUTION
The authorship attribution problem in languages such as English, Spanish and Chinese are quite properly studied. However, authorship attribution problem on contexts of Arabic texts has been received much less attention [45]. In this section, we present some issues that have a direct impact on AA in context of Arabic. Some challenges that complicate researchers' works in Arabic are highlighted. Next, we present a deeper review of the recent works on Arabic authorship attribution which covers period from 2005 up to 2018.

1) ARABIC CHARACTERISTICS
From the morphological point of view, Arabic is a very rich language. The nature and structure of Arabic words make Arabic very highly derivative and inflective language [46]. In addition, the compound structures of Arabic words add more complexity/ challenges especially for machine translation task where the words should syntactically be regarded as phrases rather than single words. The orientation of writing in Arabic, as it is known, is from right-to-left and the letters are connected each other which make Arabic writing differs distinctly from any other Latin-based languages like English, French, etc. In Arabic, there are a quite small set of productive prefixes and suffixes, however, the number of possible produced words is very high. In many cases, it is enough to change the letter position or its diacritic2 to produce a new word. Although the inflection and diacritics increase the number of words, extracting stylometric features such as vocabulary richness measures might influence [47].

2) CHALLENGES IN ARABIC CONTEXT
Arabic is a very rich and challenging language. As stated above, Arabic is very derivative and inflective language [46]. Due to that, several challenges have to deal with before working on authorship attribution task: diacritics, morphological characteristics, structure and orientation of writing, elongation, word length, and word meaning [57].
• diacritics, are special marks placed above or below the words. Diacritics play essential role in representing short vowels and changing the word meaning and pronunciation. • morphological characteristics, one of distinguished features of Arabic is a number of produced words from a common root. Such process is known as inflection where the word is derived by adding affixes (prefixes, infixes and suffixes) [5]. Arabic words, in general, are grouped into four groups: word, morpheme, root and stem [58]. • structure and orientation of writing: In Arabic, sentences are written right to left, no upper-case letters, the shape of a letter is changed based on its position in the sentence. • elongation, to emphasize a feeling or meaning, special dashes are inserted between two letters. In addition to that, these dashes play a stylistic role. • word length and meaning, word, in Arabic, can be: trilateral root, quadrilateral, root, pent-literal root and hex-literal. However, a letter might to play the role of words. The word might to have several different meaning based on the context [57].

D. MACHINE LEARNING METHODS IN ARABIC AUTHORSHIP ATTRIBUTION
In context of authorship attribution, various methods for attributing Arabic texts have been used. Abbasi and Chen [47] were the first who addressed authorship attribution in Arabic context. Support vector machine (SVM) and C4.5 decision trees were applied on Arabic web forum messages. To cope with the elongation challenge, they proposed a filter which is used to remove elongation from the text. However, number of elongation characters is calculated and it is used later as a feature. In [35], Abbasi and Chen repeated the experiment with the same machine learning methods (SVM and C4.5) and have been applied on Arabic web forum massages however 2 Diacritic is special mark which is placed above or below a letter to represent short vowels. the word roots were extracted by de Roeck and Al-Fares's algorithm [59]. Stamatatos [37] proposed a SVM based model for solving imbalance class problem. The dataset was collected from Alhayat newspaper reports. Ellen and Parameswaran [60] applied k-NN with cosine distance and SVM with two kernel functions to classify 2636 Arabic language forum posts from 9 different website forums. Ouamour and Sayoud [39,40,69] used SMO-SVM, linear regression (LR) and multilayered preceptron (MLP) methods for attributing authors of very old Arabic texts. Features such characters n-grams and word ngrams were used as input. The best precision they reached was 80%.
Alam and Kumar [61] also used SVM method to identify author of Arabic articles. Several stylometric features were extracted. They followed the method adapted by Abbasi and Chen [35] to conduct experiments. The best accuracy obtained was 98% when they applied the SVM with all feature combination.
Alwajeeh et al., [42] used Naive Bayes (NB) and SVM classifiers for automatically attributing Arabic articles. The dataset was collected and labeled manually. Through the experiment, the authors examined the effect of stop words and stemming. The findings were interesting: whilst it was expected that applying Khoja stemmer leads to enhance performance of the classifiers, the accuracies are degraded. In addition to that SVM classifier overcomes NB in most subsets. The best accuracy obtained was 99.8%. Howedi and Mohd [62] investigated the effectiveness of NB and SVM classifiers on attributing short historical Arabic texts written by10 different authors. On opposite of the findings in [42], NB exceeds SVM in term of accuracy. In addition, the characterbased features give better results than the word-based features. Among the character-based features, the punctuation marks showed a significant improvement in the performance of the classifiers. The accuracies are increased from 67.5% to 74.99%. Otoom et al., [63] introduced a hybrid approach which consists of 27 stylometric features. The ensemble classifier that consists of many decision trees, MultiBoostAB, NB, SVM and BayesNet classifiers were employed on dataset with 456 Arabic newspapers instances. The best accuracy was 88 % achieved by MultiBoostAB classifier with the hold-out test and 82% with the cross-validation test.
Sayoud [64] addressed the problem of authorship discrimination. For this purpose, the Quran and Prophet's statements were used. The SMO-SVM, Linear Regression (LR) and Multi-Layer Perceptron (MLP) were employed. All classifiers proved its ability to discriminate the author of the text under consideration with 100% accuracy.
Al-Falahi et al., [65] applied Markov chain classifier on Arabic poetry with 33 different poets belong to the same era. The feature set used by Al-Falahi et al., [65] include a contentspecific features such as metre of poem and rhyme. The features were partitioned in testing phase into different sets as follows: set1: five single features (F1 set-character features, F2 set -word length, F3 set-sentence length, F4 set-first word in sentence and F5 set-rhyme).
set2: Character features + word length feature set3: Character features + word length + sentence length set4: Character features + word length + sentence length +first word in sentence set5: Character features + word length + sentence length +first word in sentence+ rhyme The best accuracy obtained was 96.7%. They also repeated the experiment with applying NB, SVM and SMO [23]. The features set consists of those features that were used in [65] and the metre of the Arabic poetry. They followed the same methodology as in [65]. The best average accuracy they got was 72,83% when the set of all features was used and SMO was applied.
Bourib and Khennouf [66] addressed the authorship attribution problem when the genre and topic are quite similar. The texts size in the training set was varies from 100 words to 3000 words per a text. The character n-gram and words were employed and SMO-SVM, MLP and LR were used. The findings show that the performance of classifiers are dependent mainly on the text size, on one hand. On the other hand, it is effected by the used features and the classification techniques themselves.
Social media posts were also under consideration. Rabab'ah et al., [67] investigated the effect of authorship attribution classifiers on tweets written in Arabic. The features set consists of: 57 morphological features MF most of which are POS based features and 340 stylometric features SF. The NB, SVM and decision trees were used. The highest accuracy was 68.67% which was achieved by applying SVM classifier on the combined feature sets. In [45], they extended the experiment to include features extracted by bag-of-words approach. Several reduction techniques were used. The findings show that SVM classifier outperforms all of the other methods in term of accuracy and the SubEval feature selection technique led to reduce the classifier running time.  [68] extended the work in [64]. They proposed to fuse two approaches: feature-based decision fusion which combine three different features, namely character-tetra-gram, word and word bigram; and classifierbased decision fusion which fuses Manhattan centroid, SMO-SVM and MLP classifiers.
Finally, AL-Sarem and Emarra [44] addressed the attribution problem in contexts of modern Islamic fatwā'. In term of attribution classifiers, the locally weighted learning (LWL) classifier, decision tree C4.5, and Random Forest (RF) were used. The features set used by [44] consists of 10 stylometric features. Similar to the work of Al-Ayyoub [45], they investigated the effect of feature selection techniques on the performance of the classifiers. The SubEval, GainRatioEval and PCA were used. The findings show that applying C4.5 method with SubEval technique gives the best accuracy obtained is 51.70%.

III.MATERIALS AND METHODS
At the end of the previous section, we saw that different classifiers have been applied to solve the authorship attribution problem. The SVM with "linear" kernel (LinearSVM) or SMO optimizer for SVM (SMO-SVM), naïve Bayes (NB) are the most commonly used classifiers. Therefore, there is a need to investigate the performance of all mentioned earlier classifiers, which is a time-consuming and lobar intensive. Instead of that, we propose to use Analytic Hierarchy Process (AHP) weighted TOPSIS method to prioritize the classifiers. On the other hand, to avoid topic-oriented biases. Thus, this section is organized as follows: first, we describe the method used to select the base classifiers of ensemble model. Then, we test the effect of ensemble techniques on Arabic authorship attribution based on the best TOPSIS alternative. In addition, the used corpus, the main phases of authorship attribution and the experimental evaluation were also described in details.

1) TOPSIS-BASED AHP METHOD
In [70], Saaty introduced (TOPSIS) a technique for order preferences by calculating their similarity to so-called ideal solution. TOPSIS is widely used technique for scoring, ranking and choosing the best alternative. Its proficiently ability to handle both subjective and objective attributes is the reason to be one of the most used multi-attribute decision aking method. The TOPSIS method uses AHP to choose the 3 The value can be changed based on number of publications that can be published later weights for each attribute. So, to employ TOPSIS method (see Fig.5), the following steps should to follow: (i) Determine attributes and alternatives To make our TOPSIS model more reliable respect selecting authorship attribution classifiers, we propose to use the following attributes: A-Average accuracies of classifiers stated in published papers, as shown in Table II  (iii) Assign weights to attributes Following Saaty scale [70], importance of attributes is assigned by making a pair-wise comparison which might lack of subjective opinion. Thus, we invite three experts to assign the weights of attributes. The relative importance matrix × is produced by following the algorithm stated in [72] as: The eigenvalue = | |   Regarding the alternatives listed earlier, the average accuracy of classifier , commonness indicator C, high dimensionality indicator D and the performance sensitivity P are considered as an entry of the positive ideal solution, whereas the sensitivity for noise data S is an entry of negative ideal solution. The ideal solutions obtained from matrix is represented as follows:

(vii) Calculate the Euclidean distance
The Euclidean distance is computed to measure how a solution is far from the ideal one. It is calculated as follows: So, the Euclidean distance for both + and − is: The alternative with highest closeness score is considered as the best preferred alternative. In our case, the SMO classifier turns out to be the best preferred classifiers among those considered in this work followed by SVM and Naive Bayes classifiers.

2) CORPUS
Absence a benchmark dataset of authorship attribution on Arabic makes additional difficulties for evaluating attribution classifiers' performance. Most of publications on Arabic authorship attribution domain use different dataset (see Table  II). Not far of that, our dataset was gathered from Dar Al-ifta AL Misriyyah 4 website. The website contains a huge set of fatwas which are written in several language including Arabic and 9 other languages. Typically, the fatwa follows a welldefined structure. Apart of that, we deal with it as a regular textual content. We limit our corpus to only those fatwas written in Arabic. To extract the fatwas' content from the website, the OctoParse 7.0.2 web scraping tool5. The Octoparse is an easy configurable visual tool. It allows to run an extraction on the cloud as well as on the local machine. The scraped data can be exported in TXT, CSV, HTML or Excel formats. The main challenge was in scrapping the right data. Thus, first we explore the website page manually to group the similar pages and insure that the page contains required texts, then feed the scrapper the right URL. The output was an Excel sheet with some useful information: (i) fatwa's title: a given title which describes its message briefly; (ii) fatwa's date gives information about the period when the fatwa was published; (iii) mofti's name is the person or Islamic scholar who interprets and expounds the law; (iv) fatwa's question which is posed by a questioning person. It contains a lot of helpful information which aims mofti to drive his opinion and final decision; and (v) the fatwa's answer which contains the details of the scholar's. Among of the aforementioned information, mufti answer (fatwa answer) is the more important. The fatwa answer might be varying in length dependent on the nature of fatwa type and the detailed explanation given by the mofti. One thing should to mentioned here that the corpus can be unbalanced regarding the distribution of fatwas per author (Mofti). Thus, the training set has to managed before employing an attribution classifier.

3) DATA PRE-PROCESSING
Before doing any preprocessing, the corpus is firstly divided into two sub-corpuses. Current step allows us to investigate impact of training set size on the performance of the SMO classifier: (i) balanced sub-corpus in which the number of fatwas per each mofti is equal, and (ii) unbalanced sub-corpus where the distribution of texts per author is different. In addition, each sub-corpus is also grouped into sets of texts size. The last grouping also necessary to test the effect of increasing the training set size on the overall performance. As the dataset organized, others necessary preprocessing steps are performed: • Function words and non-letter removal: unlike text mining tasks, we kept these features in order to provide more authorial evidence [5]. • Stemming: to find the root of the words, we proposed to use the Khojah's stemmer 6. To deal with the above preprocessing steps, we used the Alwajeeh's ArabicSF tool 7 for both sub-corpora before extracting attribution features.

4) FEATURE EXTRACTION
Since the instance-based approach [16] suggested to treat each text in the training set individually, the result of the feature extraction step is a vector of numerical values. Our features set consists of: (i) 392 features 335 out of them features extracted by the Alwajeeh's Arabic SF tool, and 56 morphological features extracted by MADAMIRA 8 tool, and (ii) 350 distinct words extracted by the WEKA 9 tool.

5) ENSEMBLE METHODS
As stated earlier, the SMO-SVM is assigned as a base classifier of the ensemble method. The ensemble method is trained and tested within WEKA 3.6.12 on a personal computer with an Intel Core(TM) i7-4600U CPU @2.70GHz CPU, a 8-Gbyte RAM and a 64-bit Windows 8 operating system. In addition, the Cross-validation was employed in 10folds version and accuracy, precision, recall and F1-score are used to measure the effectiveness of the attribution model. To answer the second posed question, the features were partitioned into three different sets and the classifier is trained and tested on four different groups size as follows: The training set is partitioned into subsets with 50,100, 200 and 300 texts per author. We denote them β1, β2, β3 and β4 respectively. In addition, the amount of words within a text does not take in consideration. Training Set Size: Unbalanced group group1(U1): The training set has instances of 11 authors. It varies from 11 fatwas per author to 975. The number of the words within a fatwa varies between very short text (31words per text) and quit long text (400 words per text). group2 (U2): The training set has instances of eight authors.
The number of texts are between 13 and 401 per author. The number of the words within a fatwa is between 400 words per a fatwa and 800 words. group3 (U3): The training set has instances of five authors.
The size is quite small. The distribution of instances per authors varies from 7 fatwas per an author to 80. We limit amount of words within the text to be between 800 words per a fatwa and 1200 words. group3 (U4): The training set has instances of eight authors.
The size is also quite small with quit long fatwa text. The training set contains those texts whose lengths exceed 1200 words per a texts.

A. FEATURE-BASED LEVEL
To investigate the performance of using different stylometric features (ASFMs, DWs and ASFMs+DWs), Table  VII-XV summarize the results obtained by the two ensemble methods on balanced and imbalanced datasets in terms of the accuracy, recall, precision and F1-score. The results shown that the combination set of features (ASFMs+DWs) obtained the best performance using Bagging and AdaBoost methods for balanced datasets, except for dataset subset β1. The dataset size of β1 is only 50 texts per author, which makes the DW features more effective than ASFMs that may include more zeros in the feature vector. For the imbalanced datasets, the ASFMs obtained the best results (5 out of 8 cases). Similar to the case of β1, the DW features obtained better results for the dataset subset U1.  For balanced datasets, the tables show that the AdaBoost classifier, in most cases, gives the highest performance. It achieves the best accuracy with 99.83%. In addition, the results show that the performance of the classifiers is effected positively with decreasing the number of the authors in the dataset. As a conclusion of that, we recommend to use the Adaboost method for solving the authorship verification problem for balanced datasets. However, for imbalanced datasets the performance of Bagging method outperformed the Adaboost method using all datasets subsets. In addition, the results shown that when the size of imbalanced dataset increased, the performance of Bagging classifier decreased.

V.CONCLUSION AND FUTURE WORK
Authorship Attribution (AA) problem in Arabic language has been addressed in quite few studies and several analysis methods were applied to tackle the issue. However , the performance of these methods needs to be improved. This work distinguishes from the existing works in employing the ensemble techniques which have not been investigated for ALAA. In addition, the TOPSIS method has been used for scoring, ranking and choosing the best alternative base classifier. In order to make the TOPSIS model more reliable for selecting authorship attribution base classifiers, several attributes were used: (i) average accuracies of classifiers stated in published paper, (ii) prevalence degree or commonness of use the classifier in publications, (iii) ability to deal with high dimensional data, (iv) performance and (v) sensitivity to noise data. Indeed, adding others attributes can lead to enhance the TOPSIS method. As a conclusion, the SMO-SVM classifier has been chosen as a base classifier of ensemble methods. On the other hand, two types of features have been used: 397 stylometric features (ASFMs) which was extracted by Alwajeeh's ArabicSF tool and MADAMIRA tool and 350 distinct words extracted by the WEKA tool. These features were extracted from Arabic texts (Islamic fatwas) collected from Dar Al-ifta AL Misriyyah website using the OctoParse 7.0.2 web scraping tool. Then, Bagging and AdaBoost methods have been applied. The performance of the methods was examined for balanced and unbalanced training datasets. The results showed different characteristics for the ensemble methods. The AdaBoost methods obtained the highest accuracy for the balanced dataset, whereas the Bagging methods obtained the highest accuracy with unbalanced set. The findings also showed that fusing the ASFMs features and DWs features yielded the best results. In future work, new attributes will be researched and examined using the TOPSIS method and other ensemble methods will be investigated for ALAA.