Evaluating Author Attribution on Emirati Tweets

Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely <italic>Fextractor</italic>, with a highly intuitive API. The library generalizes all existing <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>-gram-based feature extraction methods under the <italic>at least</italic> <inline-formula> <tex-math notation="LaTeX">$l$ </tex-math></inline-formula> <italic>-frequent,</italic> <inline-formula> <tex-math notation="LaTeX">$\texttt {dir}$ </tex-math></inline-formula> <italic>-directed,</italic> <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> <italic>-skipped</italic> <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> <italic>-grams</italic>, and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.


I. INTRODUCTION
E-text stylometry is concerned about analyzing the writing styles of input e-texts in order to extract information about their authors. Such inferred information could be the identity of the authors, their genders, age groups, personality types, or even the diagnosis of certain illnesses [1], [2]. Author Attribution (AA) is an important problem in e-text stylometry and is defined as follows: given a set of texts with known authors, find a classification model that predicts which of these known authors is also the author of the input test texts whose authors are not known. The target classification label, in this case, is the identity of the author [3]- [5]. This is a closed-set classification task, which means that the classification model is expecting the actual author of the input test text to be represented in the learning set.
While various stylometry problem solvers have been evaluated against texts of various domains, the accuracy of AA techniques on Emirati social media texts is unknown. This The associate editor coordinating the review of this manuscript and approving it for publication was Victor S. Sheng. work aims to address the following two challenges that face e-text stylometry problems, namely: • The lack of evaluation datasets for stylometry problem solvers, when executed against e-texts that are written in Emirati Arabic, a dialect of the Arabic language that is natively spoken in the United Arab Emirates (UAE). This effectively casts uncertainty concerning the performance of all stylometry methods, when evaluated against electronic texts that are written in this dialect. As a result, the applicability of e-text stylometry methods against Emirati texts to enhance forensics, anti-forensics, or market analysis, is unknown.
• The lack of conveniently-available, and extensive, software that implement the many existing stylometry methods and feature extraction functions. There is often a tremendous need in re-developing the many proposed methods or functions, and because of the sheer amount of effort that is required to develop as such, it is common that most of the methods or functions are not adequately evaluated. As a result, the actual value of the numerous VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ independent contributions, relative to each other, is often not adequately known.
Hence, this work has these main contributions: • The construction of an original AA assessment dataset made of Emirati tweets (the KIT-30 dataset).
• The new category of grams, namely compound grams, that allows a significant classification accuracy increase.
• The extensive assessment of AA classification techniques against the introduced dataset.
• The implementation of an extensive stylometry feature extraction library with easy-to-use interface in Python. While alternative feature extraction libraries exist [6], to the best of our knowledge, our library Fextractor is, by far, the most extensive library of its kind to date. Our library supports language-independent features, as well as language-dependent features for the following languages: Arabic, English, Chinese, French, German, and Spanish.
• The generalization of numerous feature extraction methods. This allows us to define novel variants of the existing feature extraction methods, in addition to simplifying the implementation.
• The release of the library under a permissible opensource library. We hope that this would enable other researchers to conveniently study the feature extraction methods, or evaluate their methods against the existing ones, without facing the time and effort barrier that is currently required to implement the many methods.
The results of the performance evaluation of more than 10,000 AA models show that the techniques using the introduced compound grams have significantly higher accuracy than those using other types of grams. The results also show that even when the number of suspect authors increases to 30, some AA models can achieve high accuracy in the context of Emirati tweets when using suitable text vectorization methods.
These results are remarkable as they also imply that the accuracy decrease when adding more authors, is not as severe as proclaimed before [7]. More specifically, while the most accurate Twitter AA models are less accurate than 0.8 even with two authors [7], our top performing technique achieves 0.98 accuracy with 30 authors.
The remaining of this paper is as follows. Section II discusses related works. Section III introduces the KIT-30 dataset. Section IV presents the technique used for solving the AA problems, while compound grams are introduced in Section V. Sections VI and VII show the evaluation approach and the results, respectively. Section VIII introduces, Fextractor, our extensive feature extraction library, and the conclusion is given in Section IX.

II. RELATED WORKS
The most relevant investigation to the present work is the PAN'12 closed-set AA challenge [8], where a number of AA models are evaluated against several problems, including closed-set AA ones. Although the considered datasets were only in the English language, this investigation is important as it shows the performance of leading AA algorithms. Khonji et al. have shown in [9] that the Random Forests (RFs) classification algorithm can achieve an accuracy that is equivalent to that of the best AA models of the closed-set AA evaluation of PAN'12.
Other AIDs evaluation problems in the literature, including the following editions of PAN competitions expanded the set of considered languages within their datasets. E.g., the following languages were added in recent PAN evaluations: Dutch, Greek, and Spanish. However, evaluation of stylometry methods, such as AA solvers, against Emirati texts remained absent in the literature. 1 The most related evaluation dataset to our constructed KIT-30 is perhaps the Arabic Sentiment Tweets Dataset (ASTD) by Nabil et al. [11]. Still, while our techniques of collecting the tweets are influenced by their work, the ASTD dataset has the authors' identifiers removed (rightfully so due to the nature of Twitter terms of use, and the nature of the study targeted by the ASTD dataset). Therefore, ASTD is not suitable to evaluate AA techniques.

III. THE KHONJI-IRAQI EMIRATI TWEETS AID EVALUATION DATASET (KIT-30)
This section introduces the objective of our dataset, methods that were used in order to construct it, as well as its various statistics.
The goal of the KIT-30 dataset is to provide e-text fit for generating and answering Emirati AID questions to evaluate AID models. AID can be the AA problem addressed in this paper, or the Author Verification (AV) problem (verifying if the same author wrote a pair of texts), or the Author Diarization (AD) problem (grouping sections in a particular document according to their authors).
To accomplish these goals, we execute the tasks in Algorithm 1. This resulted in obtaining a total number of 30 Emirati Twitter accounts [10]. The first two steps of the algorithm were inspired by Nabil et al. [11].
Algorithm 1 Obtaining a Set of Twitter User Accounts 1) Detect the most active accounts in the UAE by using SocialBakers. 2) Detect more accounts by looking for specific tags that are unique to the UAE. 3) Manually examine the saved accounts to drop non-UAE or non-Arabic accounts.
Next, Algorithm 2 was used to save, discard, and preprocess the tweets as deemed appropriate for the objectives of the evaluation dataset at hand. This resulted in obtaining the finalized KIT-30 dataset, which is comprised of over 50, 000 tweets in total.

Algorithm 2 Downloading and Preprocessing Tweets
1) Save the maximum number of tweets as allowed by the Twitter application programming interface. 2) Drop all reposted tweets, as the owner of the account did not write them. 3) Use placeholders to replace all tags, user names, and URLs. For example, all hashtags, such as , will be replaced by the single placeholder ''#TAG''. This is to ensure that the evaluated AID models remain unable to solve AID problems by simply memorizing specific tags, user names, or URLs that potentially happen to strongly correlate with their author identities. 4) Save author identifiers.
To allow meaningful comparisons among the evaluation results against this Emirati tweets dataset and those of other languages, we have repeated the same process by adapting it for the Dutch, Greek, Spanish, and English languages. Table 1 shows the statistics of KIT-30, while Table 2 presents the per-author statistics of Emirati tweets subset.

IV. AUTHOR ATTRIBUTION MODEL
In the context of this work, we adopt RFs as the learning technique as it was shown in [9] that this algorithm achieves competitive high classification accuracy when solving AA problems.
RFs need the input samples to be described as vectors. We follow a vector representation approach similar to that of Khonji et al. [9]. Every text x is denoted by a vector x, where x[i] denotes the frequency of a unique k-skip n-gram [12] in the text x. The element i always refers to the frequency of the same unique k-skip n-gram pattern. For example, given a pair of vectors x 1 and x 2 that represent different texts x 1 and x 2 , respectively, x 1 [i] is the frequency of a unique pattern in text x 1 , while x 2 [i] is the frequency of the same pattern in text x 2 . This will allow a meaningful comparison of the pair of vectors.
Then the learning set of texts represented as vectors and their author identities are used to train an RF model. The trained model is subsequently used for predicting the author of texts in the testing set. Before we define k-skip n-gram patterns, we define n-gram patterns, and then expand the definition of n-grams by the addition of k-skips.
An n-gram pattern is a series of n neighboring grams in a given text. A gram is a parameter that defines the most fundamental unit of the processed text. For example, if grams are words, then the most basic unit of any text is considered to be words. Figure 1 depicts all n-grams for the text ''The quick fox jumped over the lazy dog'' when grams are words, and n = 3. The list of common definitions of grams in the literature includes characters, letters, punctuation marks, words, word shapes, and POS tags.
The only novelty that k-skip n-grams bring relative to n-grams is that they expand each n-gram into multiple n-grams such that the grams adjacency constraint can be violated for up to k many skips [12].
For example, if k = 2, and the starting gram is ''The'', then we not only identify the 3-gram ''The quick fox'', but also all of its 3-gram variants as listed in Table 3.   TABLE 3. k-skip n-grams in text ''The quick fox jumped over . . . '' for when k = 2, n = 3, grams are words, and the first gram is ''The''.
Similar inflation affects all other n-grams, except those near the end of the string by which only fewer skips become possible in order to avoid overrunning after the end of the string.
It can be seen that the concept of k-skip n-grams is a generalization of the concept of n-grams. This makes n-grams a special case of k-skip n-grams for when k = 0. I.e., 0-skip n-grams and n-grams are identical (for any value of n, and any definition of what constitutes as a gram).
Note that since k-skip n-grams inflate each n-gram into multiple variants with skips ranging from 0 up to k (inclusive of 0 and k), the total number of n-gram parameters increases combinatorially. Therefore large values of k are sometimes computationally infeasible.
Additionally, it is customary to disregard less-frequent kskip n-grams. This is to reduce dimensionality, as well as due to the fact that such measures are often found to be too noisy for the purpose of solving AA problems. A successful rule that has been used in the literature is to drop all k-skip n-grams that only occur for less than l many times in any single text in the dataset. In our preliminary evaluations of the proposed models, we found that l = 5 was optimal for the evaluation. I.e., if a pattern fails to appear five or more times in any text, it is ignored and therefore not used in subsequent analysis. Table 3 presents 2-skip 3-grams, when grams are defined to be words. Another definition of grams that is known in the literature of stylometry is defining them to be the POS tags, or dependency tags. For example, Table 4 presents an example of the case when grams are defined to be POS tags. Note that each word is substituted by its corresponding POS tag, as defined by the Penn Treebank project. 2 The same could be trivially extended to dependency tags, word lengths, and word shapes.

V. COMPOUND GRAMS
Compound grams essentially aim to aggregate multiple definitions of grams that refer to the same text segment.  TABLE 5. k-skip n-grams in text ''The quick fox jumped over . . . '' for when k = 2, n = 3, grams are a tuple word-POS tags, and the first gram is the tuple word-POS tag that corresponds to ''The''. Table 5 presents examples of some compound grams, when aggregating the definition ''word'' and ''POS tag'' into one gram.
Compound grams allow for capturing additional information than the classical ones. For example, measuring the frequencies of grams, as shown in Tables 3 and 4, allows for identifying the tendency of words or POS tags to independently occur in a given text. On the other hand, as shown in Table 5, measuring the frequency of compound grams allows for identifying the tendency of certain words to jointly take certain POS tags in a given text. This can be valuable information for identifying authors, as authors can be made unique not only by the independent frequency of certain grams (words or POS tags), but rather by their tendency of choosing certain words in certain positions of their sentences. For example, the word ''saw'' can be used as both, a verb, and a noun as in the sentence ''I saw the saw''.

VI. EVALUATION METHODOLOGY
Once AA models are trained as described in Section IV, we evaluate them by using 10-fold cross-validation. However, in order to ensure that each fold is comprised of realistic learning and testing samples, we add the constraint that limits the free mixing of tweets that were written at different times. Specifically, the tweets per author are chronologically grouped into 10 chunks such that their tweets do not exist in two adjacent time intervals.
This constraint increases the difficulty of the AA problems, as it substantially minimizes the possibility of test tweets being chronologically too close from their learning counterparts.
The statistics of the evaluation dataset, after grouping the tweets into 10 chronological chunks on per author basis, are presented in Table 6. Recall from earlier sections that the at least l-frequent k-skip n-grams have the following parameters: The definition of what constitutes a gram. For completeness, we repeat the evaluation many times, each with a distinct AA RF model, such that each makes use of a unique data representation function. Specifically, we exhaustively implement all possible definitions of the at least l-frequent k-skip n-grams for the following sets of parameter values: tag tuple, dependency tag, word-dependency tag tuple, POS-dependency tags tuple}. The tuple grams essentially represent a special case of our proposed compound grams with two components. This process results in 9 × 2 × 3 × 7 = 378 unique text vectorization methods, each of which is used by an RF model that is evaluated by 10-fold cross-validation.
The only exception to this is the Dutch and Greek datasets by which gram can be word or word length. This is because the POS tagger that we use 3 does not support these languages. 3 http://stanfordnlp.github.io/CoreNLP/#human-languages-supported Additionally, since the accuracy of AA problem solvers is sensitive to the number of considered authors, we repeat the entire evaluation for 29 times, each time while evaluating against a unique size of suspects space. I.e., we evaluate for all suspect space sizes in {2, 3, . . . , 30}. Therefore, the total number of evaluations is 378 × 29 = 10, 962 many 10-fold cross-validations.
Subsequently, to investigate the statistical significance of the different performance results, Approximate Randomization (AR) [13] is applied to compute the p values. The labels of various significance levels are shown in Table 7. For example, if 0.01 < p ≤ 0.05, we view the variation between considered classification accuracy as statistically significant, as is the case in [14], and indicate it by one asterisk ''*''.

A. ACCURACY OF AUTHOR ATTRIBUTION MODELS AS A FUNCTION OF SUSPECTS SPACE SIZE
Recall from earlier that, in total, 10, 962 10-fold crossvalidations are performed in order to evaluate RF AA models exhaustively with various parameter values of l, k, and n. Figure 2 depicts the empirical commutative density function (ECDF) of all of the 10, 962 classification accuracies found by 10-fold cross-validation using the Emirati tweets in KIT-30, such that, for any line i ∈ {2, 3, . . . , 30} (each line i is denoted by a unique color), i represents the ECDF of the classification accuracy of all models that are assessed against problems with a suspects space of i many authors. The results in Figure 2 show that the larger the number of suspect authors, the more there are text representation techniques that achieve lower RFs AA accuracy. Nevertheless, VOLUME 8, 2020 even with the 30 authors, there are specific text representation techniques that allow the RFs AA models to achieve an accuracy very close to 1. More details on such successful configurations will be outlined in the next subsection of this evaluation. Figure 3 presents the classification accuracy versus the number authors. This accuracy is measured by considering the performance of all of the feature extraction functions. It can be seen that the performance of solving AA problems against Emirati tweets is superior to those of Dutch and Greek datasets, and inferior to those of Spanish and US English. However, this is not necessarily an indication that solving AA problems is more difficult under Emirati tweets than Spanish or US English. This is due to the fact that some poorly performing features could degrade the overall classification accuracy, and mask the effect of the well-performing features.
To demonstrate this, Figures 4, 5, 6 and 7 present the same results like those in Figure 3, except for choosing specific feature extraction functions that tend to perform well under specific datasets. It can be seen that the performance of solving AA problems with Emirati tweets can be highly similar to those of Spanish and US tweets datasets when certain feature extraction methods are chosen. Namely, when defining grams as   the tuple of word-POS tags. However, the performance on Emirati tweets degrades significantly when grams are words, as shown in Figures 5 and 6.
This suggests that, while the current methods of stylometry analysis were never previously assessed with Emirati social media texts (and rarely against Arabic texts in general), accurately solving AA problems with Emirati tweets is nonetheless possible by using compound grams that are formed by combining successful feature extraction methods as found based on the literature of stylometry for other languages.

B. ACCURACY OF AUTHOR ATTRIBUTION MODELS AS A FUNCTION OF TEXT VECTORIZATION METHODS
Since this section discusses the effect of the various parameters of the feature extraction functions in greater detail, the discussion is focused on Emirati tweets and a suspects space of 30 authors for brevity. Figure 8 depicts the ECDFs of the evaluated RF AA classification models with varying values of l when tested against a set of 30 suspect authors. The ECDFs generally indicate that the most accurate classification models can be identified when l ∈ {3, 6}. Interestingly, this is close to the value l = 5 that was found by Khonji et al. [15] for the other languages (i.e., Dutch, English, Greek, and Spanish).
However, the most accurate AA classification models under each value of l ∈ {1, 2, . . . , 9}, are not statistically significantly different than those found with different values of l. Table 8 presents the pair-wise statistical significance results against the most accurate models that are found under each value of l. Figure 9 depicts the ECDFs of the evaluated RF AA classification models with varying values of k. The ECDFs indicate that when k = 0 more accurate classification models can be identified than when k = 1, which suggests that the tolerated violations of the grams adjacency assumption are unhealthy for identifying authors of Emirati social media texts. However, Table 9 indicates that the difference between the best performing classifiers under each value of k is not statistically significant. Figure 10 depicts the ECDFs of the evaluated RF AA classification models with varying values of n. The ECDFs indicate that when n = 1 more accurate classification models can     be identified than when n > 1, which suggests that observing the distribution of grams in relation to their adjacent ones is unhealthy for identifying authors of Emirati social media texts. However, Table 10 indicates that the difference in accuracy between the most accurate models under each value of n is not statistically significant. The only exception is between the cases when n = 1 and n = 3 by which the difference in accuracy is statistically significant. Figure 11 depicts the ECDFs of the evaluated RF AA classification models with varying definitions of grams. The ECDFs indicate that compound grams allow for the identification of more accurate classification models than otherwise. Table 11 indicates that the increase in accuracy with the most accurate models using compound grams is always statistically significant, except for the gram pos, where the difference is not statistically significant (p = 0.2348), which may be due to the size of the dataset. Table 12 presents a classification accuracy ranked list, and parameters of the text vectorization methods, of the best performing classifiers with a small enough difference between classification accuracy not to be statistically significant. It can be seen that the top 10 best performing classifiers exclusively make use of compound grams. This suggests that our novel definition of grams is successful in allowing for the achievement of higher classification accuracy under the Emirati tweets domain than when grams are defined classically.
It is important to note that an accurately performing AA classifier is not necessarily an indication of the model's ability in identifying the writing styles of authors. For example, if the dataset contains a significant author-topic bias, then a model that is originally intended to be an AID model, can be partly both, an AID model, as well as a topic identification model. Therefore care must be taken to ensure that the used features do not contain too much topic information, as such topic information could confuse the learning algorithm and transform it into a topic classifier up to a larger degree than otherwise. This is specifically a concern when features that contain words are used, as such words could be content words (as opposed to function words).
If compound grams contain excessive topic information, this may lead the model to become a topic classifier instead of an author classifier. Therefore, to ensure that this is not the case for our best performing compound gram (word-POS or word-dep as shown in Table 12), Figure 12 lists the 20 most important features. The features were aggregated from each of the 10 evaluation folds as used in our RF models (duplicate entries are removed).
In this case, none of the recorded compound grams contains content or topic words. Interestingly, the identified Arabic words in the list above are also Arabic function words. The only arguable word is '' '', which translates to ''God''. However, since the word '' '' is often used in various expressions that are independent of the topic, we believe that it is fair to consider it a word that does not contain significant topic information.  On the other hand, the least performing features (i.e., features that contribute least to AA model's decision in solving AA problems) contain a significant amount of content or topic words. A list of such features is presented in Figure 13.
This supports the claims that the KIT-30 dataset does not include meaningful author-topic bias, and that the suggested compound grams are reasonably assisting the learning algorithm to find AID models, as opposed to topic classification models.

VIII. FEXTRACTOR: EXTENSIVE STYLOMETRY FEATURE EXTRACTION LIBRARY
One of the critical issues that face today's research on stylometry is the fact that implementations of most of the stylometryrelated proposed methods are not released publicly. As a result, re-evaluating, or comparing newer methods against the previous ones is often extremely difficult due to the need for re-implementing those methods again (which requires a tremendous amount of time and effort).
A notable aspect of the research in e-text stylometry, is the enhancement of feature extraction methods. Currently, such methods are highly diverse, and range from simple letter counts, up to more sophisticated ones that use independent statistical models, such as POS taggers. For example, it is quite common in the literature that a good portion of the considered feature extraction methods are evaluated in isolation, without adequate comparison against existing methods to truly justify their relative effectiveness. Another issue is the lack of adequate generalizations of the proposed methods, which leaves some of the novel variants unstudied.

A. SUPPORTED FEATURE EXTRACTION METHODS
The following feature extraction methods are supported: • n-grams (classical n-grams), with parameters: -Normalize (Boolean): If set to True, the library will normalize the number of occurrences of a   Table 13 presents a list of supported grams. -Cache (path): If set to None, then caching is disabled. If set to a path, then caching is enabled. This can be useful for expensive features, such as those that require making use of POS taggers (the cache will save time by avoiding parsing same sentences twice).
• k-skip n-grams, with parameters: k: the total number of tolerated adjacency violations in an n-gram, in the unit of grams. E.g., k = 2 will tolerate up to 2 adjacency violations, while k = 0 will not tolerate any and cause it to be identical to classical n-grams.
• Rewrite-rules. Unlike other rewrite-rules implementations, ours has the novelty in that it allows us to substitute the terminal words by their alternative forms (e.g., word shape). For consistency, we refer to this as ''gram''. Additionally, compound grams are also made available to the rewrite-rules feature extraction function. The parameters are: -Normalize (Boolean). l.
-Gram. Table 13 lists all supported gram definitions.

B. GENERALIZATION OF n-GRAM METHODS
This section presents the mechanism by which our library implements n-grams, k-skip n-grams, and syntactic n-grams.
In order to simplify the implementation, enhance the ability to introduce more novel variants, as well as extend the coverage of the library, we have generalized all of the n-gram-based methods as the at least l-frequent dir-directed k-skip ngrams. Then, we implemented this generalization instead. As a result of this, we get a more-extensive library that is also simpler and allows superior code re-use. Further details are presented below. Consider the text example that is presented in Figure 14, and, for simplicity, suppose that grams are defined to be words. Then, if the parameter dir = spatial, the example text in Figure 14 is represented in the following row matrix in (1).
The quick fox jumped over the lazy dog (1) Then, the sliding window, as depicted in Figure 1, will operate on the matrix in (1) on a row-by-row basis. Since there is a single (but long) row, the sliding window will move along that one row, depending on the chosen value of parameter n.
On the other hand, if the parameter dir = deptree, the example text in Figure 14 is represented in the following row matrix in (2). Note that each row represents a path from the root node towards the numerous leaf nodes as we walk down the dependency tree that is depicted in the Figure Then, similar to the dir = spacial case, the sliding window, as depicted in Figure 1, will operate on the matrix in (2) on a row-by-row basis. Since there are 5 rows, the sliding window will move along each row, independently. It can be seen that, because of this design, we are able to re-use our sliding window code for both dir = spacial (classical n-grams) and dir = deptree (syntactic n-grams with dependency trees).
Alternatively, one may decide to construct a matrix similar to the one in (1), but while using a different type of trees, or methods that might not necessarily be based upon linguistics basis. Extending this library is as simple as introducing code that defines a matrix out of sentences.
As for the parameter k that specifies the total number of permissible gram skips, it is implemented in the sliding window code and is therefore fully re-used elsewhere, independent of the direction. Therefore, the rest of the code is re-used, independent of how the matrices are defined. 9 10 # c o u n t raw k−s k i p n−grams p a t t e r n s w i t h raw f r e q u e n c i e s 11 p = f e x t r a c t o r . g e t c o u n t _ k s n g r a m s (m, k =2 , n =2 , n o r m a l i z e = F a l s e ) 12 13 # p r i n t t h e s c o r e 14  Additionally, if using a vector-representation is required, the represented texts can be trivially converted into vectors. Below is a code of an example where two distinct texts, text1 and text2, are transformed into a vector space. Such vectors could then be used by other classifiers as required.

IX. CONCLUSION
A key contribution of this paper relates to the uncertainty that is associated with the applicability of the stylometry problemsolvers against the domain of Emirati Arabic texts. To work towards addressing this issue, we have constructed the KIT-30 dataset, which is the first Emirati social media authoridentification evaluation dataset. Interestingly, our studies found that the scalability issues of AA problem solvers, that are generally reported in the literature concerning the size of the suspect authors space, is noticeably more aggressive than what our findings indicate. For example, we were able to achieve a classification accuracy of over 0.98 when solving AA problems that were constructed based on chunks of Emirati tweets, with a set of 30 suspect authors. This accuracy is notably higher than evaluated in the literature on similar space sizes of suspect authors in literature [7], [17], especially when knowing that our chunks of tweets, per author, remained relatively small (only a few hundreds of tweets per chunk).
Additionally, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, with a highly intuitive API. This library offers, by far, the most extensive set of e-text stylometry feature extraction methods to date, which is partly thanks to our generalization of n-gram-based feature extraction methods. The library also contains a number of novelties, such as novel definitions of grams (e.g., compound grams) for both, n-gram-based methods, as well as CFG rewrite-rules. Interestingly, when using our feature extraction library, our evaluation of efficient AA solver against Emirati tweets AA problems indicates that the use of compound grams allows for the identification of more accurate AA models. • The KIT-30 dataset is available at https://gitlab.com/ mmaakh/kit-30.git.