AGI-P: A Gender Identification Framework for Authorship Analysis Using Customized Fine-Tuning of Multilingual Language Model

In this investigation, we propose a solution for the author’s gender identification task called AGI-P. This task has several real-world applications across different fields, such as marketing and advertising, forensic linguistics, sociology, recommendation systems, language processing, historical analysis, education, and language learning. We created a new dataset to evaluate our proposed method. The dataset is balanced in terms of gender using a random sampling method and consists of 1944 samples in total. We use accuracy as an evaluation measure and compare the performance of the proposed solution (AGI-P) against state-of-the-art machine learning classifiers and fine-tuned pre-trained multilingual language models such as DistilBERT, mBERT, XLM-RoBERTa, and Multilingual DEBERTa. In this regard, we also propose a customized fine-tuning strategy that improves the accuracy of the pre-trained language models for the author gender identification task. Our extensive experimental studies reveal that our solution (AGI-P) outperforms the well-known machine learning classifiers and fine-tuned pre-trained multilingual language models with an accuracy level of 92.03%. Moreover, the pre-trained multilingual language models, fine-tuned with the proposed customized strategy, outperform the fine-tuned pre-trained language models using an out-of-the-box fine-tuning strategy. The codebase and corpus can be accessed on our GitHub page at: https://github.com/mumairhassan/AGI-P


I. INTRODUCTION
Author gender identification (AGI) is a task that involves determining the gender of an author based on their writing and has several real-world applications across different fields [1].For example, in marketing and advertising, understanding the gender of authors can help tailor marketing strategies and advertisements to specific demographics [2].It assists in creating content that resonates better with particular gender groups, leading to more effective advertising The associate editor coordinating the review of this manuscript and approving it for publication was Agostino Forestiero .campaigns.In legal cases, determining the gender of an author based on written content can aid in forensic investigations.It might help identify potential suspects or verify the authenticity of documents [3].For content curation and recommendation systems, platforms like social media, news aggregators, and recommendation engines can use author gender identification to personalize user content recommendations based on their preferences [4].In natural language processing (NLP), AGI contributes to developing machine learning algorithms and models [5].These models can aid sentiment analysis, chatbots, and other languagerelated AI applications.The author's gender identification task can be considered a binary classification problem.Several features are suggested to perform the author gender identification task such as the most frequent function words, most frequent character n-grams, most frequent word n-grams, most frequent partof-speech (POS) categories and their n-grams, sentiment lexicon, stylistic markers such as percentage of capital letters or punctuation, and mean sentence length [1], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24].To define the target category (male or female), many machine learning classifiers such as Decision Trees, Support Vector Machine, Logistic Regression, K Nearest Neighbors, and Random Forest have been recommended [5], [6], [7], [8], [9], [14], [25], [26], [27], [28], [29], [30].
The AGI task has never been investigated for the Punjabi texts.Punjabi belongs to the Indo-Aryan language family, mainly spoken by the Punjabi people in the Punjab region of two countries, Pakistan and India.As of 2017, Punjabi is the most spoken language in Pakistan, with approximately 80.5 million people.It is the 11 th most spoken language in India with 31.1 million speakers (as of 2011) and has official status in the Indian state of Punjab.A significant overseas diaspora speaks Punjabi in the United States, Canada, and the United Kingdom.The Punjabi language has approximately 113 million native speakers. 1 Due to the lack of or inadequacy of several vital resources, such as gold standard datasets and fundamental natural language processing (NLP) toolkits, Punjabi can be classified as a low-resource language.However, the focus of our discussion is on the limitations of Punjabi in the context of author gender identification.The following are some major limitations.

A. LIMITATIONS OF PUNJABI IN CONTEXT OF AUTHOR GENDER IDENTIFICATION AND OUR RESEARCH OBJECTIVES
Limitation 1: Lack of Research: The author gender identification task has been extensively explored for resourcerich languages such as English [6] and Spanish [9], similarly to other NLP tasks such as part-of-speech (POS) tagging, text categorization, and named entity recognition (NER).However, this task has never been investigated for the Punjabi texts.Therefore, one of the main objectives of this paper is to fill this research gap and present the first thorough investigation of the author's gender identification task on Punjabi texts.We present an author gender identification solution that outperforms the well-known classifiers and the state-of-the-art fine-tuned pre-trained multilingual language models such as DistilBERT [31], mBERT [32], multilingual DeBERTa [33], and XLM-RoBERTa [34].The findings of this investigation add new insights to existing knowledge (see Section III for more details).
1 https://en.wikipedia.org/wiki/Punjabi_languageLimitation 2: Unavailability of Reliable NLP Resources: Gender identification of authors is a crucial NLP task.However, as previously stated, this work has never been conducted on Punjabi texts, and there is no current dataset to perform this task with.To perform the author gender identification task, we require a dataset with each text sample associated with the gender label.As a result, in this paper, we built a new dataset containing 1944 samples where the length of each text sample is fixed to 250 tokens to evaluate the performance of our author gender identification solution (AGI-P), which would be made publicly available to scholars in this field (see Section III for more details).
Limitation 3: Inapplicable Features: As previously stated, a comprehensive set of features was used in resourcerich Western languages to perform the author's gender identification task.However, many of these features, such as the number of capital letters, the number of sentences that begin with capital letters, the number of sentences that begin with a lowercase alphabet, etc., cannot be extracted from texts written in Punjabi.Furthermore, due to the scarcity of credible NLP toolkits, several features are challenging to extract from Punjabi texts, such as the presence of sentiment, the frequency of POS tags, and the type of emotion, to name a few examples.In addition, the Punjabi's morphological complexity and diversity make the feature extraction more challenging.One of the main objectives of this paper is to identify the best features for the author's gender identification task for the Punjabi texts (the findings of an ablation study are given in Table 4).
Identifying the most compelling features for author gender identification in low-resource languages holds immense significance for several reasons: 1) Low-resource languages often lack extensive labeled datasets for model training.Selecting the right features is critical due to the scarcity of data, ensuring that the chosen features contribute meaningfully to gender identification.2) Low-resource languages possess unique linguistic characteristics, different from widely studied languages.3) Pinpointing features that reflect gender-specific linguistic nuances in these languages is essential for accurate identification.4) Selecting optimal features directly impacts the model's performance in predicting gender accurately.5) Identifying effective features aids in creating models adaptable to varying linguistic contexts and languages with limited resources.6) Focusing on the most influential features maximizes the utility of limited resources available for feature extraction and model development.7) Narrowing down feature sets minimizes computational overhead, especially in resource-constrained settings.8) Identifying effective features paves the way for replicating successful approaches in similar low-resource language scenarios.9) Optimal feature selection contributes to scalable solutions adaptable to other under-resourced linguistic domains.Limitation 4: Missing Application of Deep Learning: Fine-tuning pre-trained language models has achieved stateof-the-art results for various NLP tasks.However, despite compelling evidence from the literature, no study has evaluated the performance of these models to perform the author's gender identification task for Punjabi.In this investigation, we fine-tune the state-of-the-art pre-trained multilingual language models and compare their performance against our solution and well-known machine learning classifiers.We note that the proportion of the Punjabi data used to train the language model is less.To make sure that these pre-trained multilingual models are fully adapted to Punjabi, we propose a new customized fine-tuning strategy for the pre-trained multilingual language models, which improves their accuracy (see Section IV-A for more details).

B. RESEARCH QUESTIONS
In addition to addressing the aforementioned limitations, we answer the following research questions, adding new insights to the existing knowledge.
• RQ 1: Do well-known machine learning classifiers outperform the fine-tuned pre-trained language models for the author gender identification task on texts written in a low-resource language such as Punjabi?
• RQ 2: Do well-known machine learning classifiers outperform the fine-tuned pre-trained language models for the author gender identification task on texts written in a resource-rich language such as English?
• RQ 3: What are the most important features that discriminate the texts written by a male and a female?

C. SUMMARY OF OUR CONTRIBUTIONS
The following are the main contributions of this paper.
• We propose an author gender identification solution that can outperform well-known classifiers as well as the fine-tuned pre-trained language models such as multilingual BERT, DistilBERT, XLM-RoBERTa, and multilingual DeBERTa [35], [36].
• We propose a new fine-tuning strategy for pre-trained multilingual language models, improving their accuracy for the author gender identification task.
• As stated earlier, no current dataset exists for the author's gender identification task on Punjabi texts.Therefore, in this paper, we built the first dataset to perform this task, which will be publicly available.
• We present the first study on the author's gender identification task for the Punjabi texts, adding new insights to the existing knowledge.
• Given the limited availability of reliable NLP toolkits, we identify the best features to perform the author gender identification task for Punjabi.We conducted extensive experimental studies on datasets from two languages, including Punjabi and English, to compare our solution against well-known machine learning models and fine-tuned pre-trained language models.The rest of the paper is organized as follows.Section II reviews the existing author gender identification studies.Our proposed solution for AGI-P is briefly described in Section III.Section IV evaluates the experimental studies and discusses their findings.The concluding remarks and future research directions are available in Section V.

II. LITERATURE REVIEW (STATE-OF-THE-ART)
Different demographic groups consistently behave differently, even in terms of language use.A substantial amount of research by sociolinguists has shown that differences in language are related to sociological factors, including age, gender, and educational attainment.Similarly, language psychologists have discovered connections between psychological characteristics and language use [37].The field of author profiling, which has numerous applications in business and society, is currently generating a lot of interest.These techniques can be applied to business intelligence to assess demographically distinct attitudes regarding brands and businesses on social media and in targeted marketing and advertising.By tailoring the chatbot's conversational style to this personality profile, they can also be utilized in customer relations to make educated guesses about customers' personalities that interact with a company's conversational agent.In forensics, language-based profiles can be examined, and the writers of letters, emails, and other documents used in an inquiry can be identified.Author gender identification is the initial step in author profiling investigations [38], [39], [40], [41], [42].This task has been extensively investigated; however, automatically determining stylistic differences between men and women, on the other hand, is far from ideal [37], [43], [44], [45].
Since 2010, the CLEF PAN initiatives have been investigating numerous stylometric tasks, e.g., authorship identification, plagiarism detection, author profiling, etc [46], [47], [48].The tasks proposed correspond to various languages, with English being the most popular.The author's gender must be determined for the author profiling tasks.The bestperforming approaches for the author gender identification task have used different feature types, including the most frequent n-grams of words or letters, the bag-of-words, the POS categories or their n-gram, mean sentence or word length, percentage of capital letters or punctuations, percentage of emojis, etc.
A logistic regression (LR) classifier had reported the best accuracy in CLEF-PAN 2014 [49], whereas a support vector machine (SVM) classifier had reported the best performance in 2015 [50].The best result obtained with the gender identification corpus from the 2016 campaign was based on the LR classifier, which was trained on the most frequent words, most frequent n-grams, and stylistic cues [51].A linear SVM classifier based on the most frequent word uni-grams and most frequent word bi-grams with the most frequent character 3-grams and the most frequent character 5-grams reported the top performance in CLEF-PAN 2017 [52].Similarly, an SVM classifier was used to achieve first place in 2018 [53].Finally, the best results were achieved in 2019 using a logistic regression strategy based on word and letter n-grams [43].
It has been reported that one gender uses some topical words more frequently than the other.Men employ more terms linked to technology and finance (e.g., software, game, Linux, money, sports), whereas women choose to write about their friends and social relationships (e.g., shopping, friends, cute, love, mom) [45], [54], [55].Women have also been reported to employ emotions or certainty phrases (such as must, always) more frequently than men [56].It is more challenging to extract these features than regularly used determiners or pronouns for all the languages [37].Extracting these features may require the collection of terms provided by the Linguistic Inquiry and Word Count (LIWC) programme [57].
This study is focused on the author's gender identification of the Punjabi texts.As discussed earlier, several features used in existing studies are not applicable to the Punjabi language.Furthermore, several of these properties are difficult to extract from Punjabi texts due to the scarcity of credible NLP toolkits.The presence of sentiment, the frequency of POS tags, and the type of emotion are only a few examples.These feature extractions are difficult due to the Punjabi language's complexity, morphological diversity, and the unavailability of reliable NLP toolkits for Punjab.One of the main objectives of this paper is to identify the best features for the author's gender identification task for the Punjabi texts.

A. AUTHOR GENDER IDENTIFICATION FOR LOW RESOURCE LANGUAGES
This section presents research focusing on author gender identification (AGI) tasks conducted in resource-scarce languages.Baseer et al. conducted this task on the Romanized Urdu dataset using 15 lexical features, visualizing the results through a two-dimensional graph [58].The study revealed that male authors tend to use single characters more frequently.Conversely, females exhibit a more definitive conversational style, supported by their higher use of special characters and abbreviations.Additionally, it was noted that females employ more words from a specified Urdu corpus than males, reducing the usage of candidate words.Khandelwal et al. [59] addressed the challenge of predicting author gender in code-mixed content by introducing an English-Hindi Twitter dataset annotated with gender labels.Their study utilized machine learning methods, considering character and word-level features to infer an author's gender from the text.
Moreover, Sarwar et al. [1] recently conducted the AGI task on Urdu texts, employing 600 frequent multiword expressions and 300 frequent words as features along with a support vector machine classifier, achieving an accuracy of 93.79%.

III. PROPOSED SOLUTION (AGI-P)
This section describes our proposed solution for author gender identification for Punjabi text (AGI-P).As can be seen from Figure 1, AGI-P consists of four stages: (i) data collection, (ii) features extraction, (iii) machine learning, and (iv) author gender identification.Each process is explained in the following subsections.

A. DATA COLLECTION
To perform the author gender identification task, we require a dataset with each text sample associated with the gender label.We created a new Punjabi dataset containing news articles (texts) extracted from a newspaper2 and annotated the dataset with gender labels using author information retrieved from the website.
We begin with a seed URL, which is the website address of a newspaper.We send HTTP requests to a seed URL to retrieve the web page content.Once a page is fetched, we parse its HTML content, extracting various elements such as author biography and news article text.We identify and extract hyperlinks in the HTML that point to other web pages.We maintain a queue or list of URLs we discover during parsing.We then follow these extracted links, navigating to new web pages.This process continues recursively, crawling through multiple levels of linked pages, discovering new links, adding them to the queue, and extracting author information and new article text.After the data collection, we removed the emoji information from the texts because we aimed to build a solution based on textual information only.We would also like to highlight that the punctuations were not removed as they may contain linguistic cues of the author's genders.
We fixed the length of each text to 250 tokens, making this task more challenging.We also tested our solution on an English dataset to achieve all the research objectives.The English dataset was extracted from Blog Corpus. 3To make a fair comparison, we also fixed the length of each text sample in the English corpus to 250 words.A summary of the datasets is given in Table 1.As can be seen, both of the datasets are balanced in terms of the number of text samples from each gender.

B. FEATURES EXTRACTION
After collecting data, we partitioned each dataset into two sets, including training and test sets.Specifically, we used 80% of the data to train the probabilistic Light Gradient Boosted Machine (LightGBM) classifiers and 20% of the data for testing purposes.We extracted two types of features from each sample.The first type of feature is the 1800 most frequent variable length character n-grams (V.L.C), where the values of n are in the range of 2 and 10.The second type of feature is 1800 most frequent words (W).As a result, each text sample results in two feature vectors.Machine (LightGBM) classifiers and 20% for testing.Feature vectors were generated, comprising frequent character n-grams and words.The best accuracy stemmed from 1800-character n-grams and words, each varying in length from 2 to 10 characters.Using probabilistic LightGBM, a classification model was trained on the feature sets, providing gender predictions for authors.LightGBM's efficiency, faster training, and higher accuracy due to its leaf-wise decision tree methodology played pivotal roles.To ensure confident predictions, entropy was employed as an uncertainty measure for identifying the most certain gender prediction per sample.A threshold value of 0.280 for Punjabi and 0.870 for English datasets was established to determine the gender of authors based on the final prediction.

TABLE 1. Summary of the datasets.
The motivation behind using these features for the gender identification task is that we tried different types of word and character-based features and found that most frequent variable length character n-grams and most frequent words resulted in the best accuracy (see Table 4 for more details).We also varied the number of character and word-based features, and 1800 features resulted in the best performance (see Table 5 for more details).Moreover, we also tried different range values for n in character n-grams, and the values in range 2-10 resulted in the best accuracy (see Table 6 for more details).Once we identify the best word and character-based features, we move to the next step of our solution.

C. MACHINE LEARNING CLASSIFIER (LIGHTGBM)
After the features extraction process of our solution, we train a probabilistic LightGBM on each feature set, resulting in two author gender predictions.The motivation behind using probabilistic LightGBM is that it is a distributed, high-performance gradient-boosting framework for classification tasks.It is based on the decision tree method.It divides the tree leaf-wise with the best fit instead of other boosting algorithms that divide the tree depth-or level-wise.As a result, in LightGBM, when growing on the same leaf, the leaf-wise method can reduce more loss than the level-wise strategy, which leads to significantly superior accuracy that can only be sometimes attained by any of the existing boosting algorithms.
LightGBM has several advantages, such as higher efficiency and faster training (i.e., LightGBM uses a histogram-based approach, which accelerates training by grouping continuous feature values into discrete bins), reduced memory usage (i.e., discrete bins are used in place of continuous values, which uses less memory), better accuracy (i.e., by using a leaf-wise split strategy rather than a levelwise split approach, which is the primary element in getting higher accuracy; it produces far more complicated trees), huge dataset compatibility (i.e., compared to other tree-based algorithms, it can handle large datasets while requiring significantly less training time), and it supports parallel learning.

D. AUTHOR GENDER IDENTIFICATION
Once we obtain two predictions for each test sample using probabilistic LightGBM, we use the entropy as the uncertainty measure to identify the most certain prediction for each test sample and use it as the final prediction for the author gender identification task.The average amount of ''uncertainty'' resulting from a random variable's potential outcomes is known as entropy in information theory.Given a discrete random variable X, which takes values in the alphabet X and is distributed according to p : X → [0, 1]: After we select the final prediction, we learn a threshold value to decide the gender of the text's author.The threshold value for the Punjabi dataset is 0.280, and for the English dataset, it is 0.870.

IV. PERFORMANCE EVALUATION
This section discusses the experimental setup and our extensive experimental studies for the author's gender identification task in two different languages.

A. EXPERIMENTAL SETUP 1) PARAMETER SETTINGS FOR OUR SOLUTION
We configured the LightGBM classifier primarily using its default settings, with specific adjustments made to select parameters to optimize performance for different language datasets.For the Punjabi dataset, we fine-tuned the min_child_samples parameter to a value of 10 and the subsample_for_bin to 100.In the case of the English dataset, we adjusted the min_child_samples to 35 and set the number of n_estimators to 1000.These particular settings were determined after experimenting with various values and were ultimately chosen because they yielded the highest accuracy during our testing.

2) PROPOSED FINE-TUNING STRATEGY AND PARAMETER SETTINGS FOR PRE-TRAINED MODELS
We have developed a specialized fine-tuning strategy for pretrained multilingual language models tailored for specific linguistic contexts.This approach comprises a standard finetuning procedure and a proprietary, customized method.
Standard fine-tuning is implemented using the TensorFlow framework provided by Huggingface [60], with the hyperparameters details available in Table 2.In contrast, our customized strategy begins with an initial adaptation phase (step 0), specifically for the Punjabi language.This involves using a Masked Language Model (MLM) task with the Punjabi subset of the CC-100 dataset to adjust the models to handle better Punjabi text, which is underrepresented in the CC-100 dataset [61].
Subsequently, the gender identification datasets are divided into five segments for cross-validation 1).Each segment is used as a validation set in this phase while the model is fine-tuned over five epochs on the remaining data (step 2).The version of the model with the highest accuracy on the validation set is then selected for the final model ensemble.
For each input entry, the ensemble models generate a prediction probability, with the final prediction probability being the average of these five outputs (step 3).This method is applied to English without the initial priming step, as the volume of English data in pre-training is already substantial.
The ensemble approach aims to mitigate overfitting issues, ensuring robust model performance.The fine-tuned models for the custom strategy are indicated as MODEL C in Table 3.
The parameter values of fine-tuning of the pre-trained language models are given in Table 2.All models are base models, with 12 layers (with the exception of DistilBERT, which has 6 layers), a hidden size of 768, and 12 attention heads.DistilBERT has a vocabulary size of 31K tokens, mBERT has a vocabulary size of 120K tokens, and Multilingual DeBERTa and XLM-RoBERTa have a vocabulary size of 250K tokens.

3) EVALUATION MEASURES AND EVALUATION STRATEGY
We used accuracy as an evaluation measure for this task, which can be defined as follows. where: • a TP is an outcome where the classifier correctly predicts the positive class, • a TN is an outcome where the classifier correctly predicts the negative class, • an FP is an outcome where the classifier incorrectly predicts the positive class and • an FN is an outcome where the model incorrectly predicts the negative class.It is appropriate to rely on accuracy as our dataset is genderbalanced in terms of the number of text samples [62].The train-test split ratio is fixed to 80%-20%, respectively.15404 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.TABLE 3. Accuracy Comparison of Our Solution (AGI-P), machine learning classifiers and fine-tuned pre-trained language models with different fine-tuning strategies.

TABLE 4. Ablation Study:
The performance of different feature types for the author gender identification task.For V.L.C and V.L.W, the values of n are in the range of 2-10.

B. EXPERIMENTAL STUDIES, RESULTS DISCUSSION AND IMPLICATIONS
In this subsection, we present our extensive experimental studies, answer the research questions listed in Section I, and discuss the experimental findings.

1) ANSWER TO RQ 1 AND RQ 2
In this study, we compare our solution's (AGI-P) performance against the performance of well-known machine learning classifiers and fine-tuned pre-trained language models.
The machine learning classifiers include Light Gradient Boosted Machine Classifier (LightGBM), Gradient Boosting Classifier (GBoost), Random Forest Classifier (RF), Ada Boost Classifier (AdaBoost), K Nearest Neighbors Classifier (KNN), Decision Tree Classifier (DT) and Support Vector Machine Classifier (SVM).As for the well-known machine learning classifiers, we extract the 1800 most frequent variable length character n-grams (V.L.C) from each text sample where the values of n range between 2 and 10 and use them to train all the classifiers using their default parameter settings.The main reason to use only the V.L.C as the features for the machine learning classifiers is that they are the best features to perform the author gender identification task for both the languages (i.e., Punjabi and English, see experimental results given in Tables 5 and 6).We also compare the performance of our solution against the pretrained language models such as DistilBERT, mBERT, XLM-RoBERTa and Multilingual DEBERTa.These models are fine-tuned using two different strategies: out-of-the-box and customized (proposed).
As can be seen from Table 3, our solution (AGI-P) outperforms the machine learning classifiers and the out-of-thebox fine-tuned pre-trained language models on Punjabi texts, which is the main focus of this paper.We also note that while using out-of-the-box fine-tuning strategy, machine learning classifiers outperform the pre-trained language models for the Punjabi texts, however, the performance of the fine-tuned pretrained language models is higher than the machine learning classifiers for English. 4When using customized fine-tuning, the performance of the best-fine-tuned models (an ensemble of fine-tuned mBERT C ) outperforms the rest of the pretrained language models.
For English, it is unsurprising that fine-tuning pre-trained language models produce higher accuracy than our proposed solution.This is in line with recent state-of-the-art [33].These pre-trained models benefit from the information contained in the massive amount of textual data available to them in the pre-train phase and can utilize this information for the task.Furthermore, the vocabulary of these models is skewed heavily towards English as well.Nevertheless, these models are computationally expensive.

2) ANSWER TO RESEARCH QUESTION 3
To answer this question, we extracted character-based and word-based features from each text sample.Specifically, we extracted two types of character-based features including 1800 most frequent characters (C), and 1800 most frequent variable length character n-grams (V.L.C) from each text sample.Similarly, we extracted two types of word-based features from each text sample including 1800 most frequent words (W), and 1800 most frequent variable length word n-grams (V.L.W).We then apply well-known machine learning classifiers for the author's gender identification task.The experimental results are given in Table 4.As for the character-based features, the 1800 most frequent variable length characters n-grams (V.L.C) outperform the 1800 most frequent characters (C).On the other hand, 1800 most frequent words outperformed 1800 most frequent variable length word n-grams (V.L.W).We also combined the best character and word-based features and found that combined features perform poorly compared to the 1800 most frequent variable length character n-grams.
As it can be seen from Table 4 1800, the most frequent variable length character n-grams, where the values of n are between 2 and 10 range, results in the best accuracy.We also investigate the effects of varying the number of the variable length character n-grams (V.L.C) features and the values of n on the accuracy of the author gender identification task as follows.

3) EFFECT OF NUMBER OF FEATURES ON THE GENDER IDENTIFICATION TASK
In this subsection, we investigate the effect of the number of features on the accuracy of the author's gender identification task.We tried different number of features, including 100, 300, 600, 900, 1200, 1500, 1800, and 2100, while fixing the range value between 2-10.As can be seen from Table 5, the 1800 features result in the best accuracy for the author gender identification task for both languages.

4) EFFECT OF VALUES OF N ON THE PERFORMANCE OF THE GENDER IDENTIFICATION TASK
In this subsection, we investigate the effect of different n values on the accuracy of the gender identification task.Specifically, we tried different ranges of values, including 2-5, 2-10, and 2-15, using 1800 most frequent variable length character n-grams (V.L.C) as the features.As can be seen from Table 6, the values of n between the range of 2-10 resulted in the best accuracy using the LightGBM classifier.Therefore, we fixed the n values between 2 and 10 for the rest of the experimental studies.

V. CONCLUSION AND FUTURE WORKS
This paper proposes a solution (AGI-P) for the author's gender identification task for the short Punjabi texts.To test 15406 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
our solution, we created a new dataset for the author gender identification task, which will be made publicly available.We conducted extensive experimental studies to show that our solution outperforms well-known machine learning classifiers and fine-tuned pre-trained language models.Specifically, despite the popularity of the fine-tuned pre-trained language models for achieving state-of-the-art performance for several NLP tasks, our solution (AGI-P) achieves the best accuracy level of 92.03% for the Punjabi dataset.We also proposed a new fine-tuning strategy for the pre-trained language models, outperforming the outof-the-box fine-tuning strategy for low-resource language.Moreover, we found that 1800 of the most frequent variable length character n-grams are the best features to perform the author gender identification task for both Punjabi and English datasets using well-known machine learning classifiers.In the future, we plan to extend this study and identify the words and phrases that are more likely to be used by an author of a specific gender.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 1 .
FIGURE 1. Overview of the proposed solution.News articles were gathered from a newspaper website and annotated with author gender information retrieved from the site.Feature extraction involved partitioning the dataset into training and test sets, utilizing 80% for training Light Gradient Boosted Machine (LightGBM) classifiers and 20% for testing.Feature vectors were generated, comprising frequent character n-grams and words.The best accuracy stemmed from 1800-character n-grams and words, each varying in length from 2 to 10 characters.Using probabilistic LightGBM, a classification model was trained on the feature sets, providing gender predictions for authors.LightGBM's efficiency, faster training, and higher accuracy due to its leaf-wise decision tree methodology played pivotal roles.To ensure confident predictions, entropy was employed as an uncertainty measure for identifying the most certain gender prediction per sample.A threshold value of 0.280 for Punjabi and 0.870 for English datasets was established to determine the gender of authors based on the final prediction.
RAHEEM SARWAR received the Ph.D. degree from the City University of Hong Kong.His research interests include technology development, artificial intelligence, NLP, data science, scientometrics, altmetrics, information retrieval, and text mining.LE AN HA is working and publishing on a variety of subjects, including automatic terminology extraction, both monolingual and multilingual, multiple-choice question (MCQ) generation, analysis of multiple-choice test items, and multilingual preprocessing.He has extensive experience in developing commercial natural language processing (NLP) vertical solutions.He has acted as an Acting Coordinator of an EU Leonardo Project (TELLME), which has developed a range of products, including work-related language exercises and showcased NLP technologies, such as automatic term extraction and MCQ generation.He has developed a Computer-Aided Patient Notes Scoring System for a wellknown U.S. medical examiner organization.Also, he has been leading research activities in the domain of applying NLP technologies for licensing testing, funded by an U.S. organization on a yearly rolling research contract, including American-British transliteration, information extraction, item difficulty prediction, item response time prediction, and item distractor prediction.PIN SHEN TEH has been teaching for more than a decade, mainly with higher education institutions.He has experience in teaching ICT and coding to students aged 6-16.His teaching focuses on programming and database subjects.He is the ManMet Minecraft Project Pioneer and a Minecraft Certified Trainer.His research interests include practical machine learning applications, biometrics systems, and metaverse.

TABLE 2 .
Parameter settings of the pre-trained multilingual language models.

TABLE 5 .
Effect of varying the number of V.L.C features on the accuracy of the author gender identification task.

TABLE 6 .
Effect of values of n for the V.L.C features on the accuracy of the author gender identification task.