Textual Pre-Trained Models for Gender Identification Across Community Question-Answering Members

Promoting engagement and participation is vital for online social networks such as community Question-Answering (cQA) sites. One way of increasing the contribution of their members is by connecting their content with the right target audience. To achieve this goal, demographic analysis is pivotal in deciphering the interest of each community fellow. Indeed, demographic factors such as gender are fundamental in reducing the gender disparity across distinct topics. This work assesses the classification rate of assorted state-of-the-art transformer-based models (e.g., BERT and FNET) on the task of gender identification across cQA fellows. For this purpose, it benefited from a massive text-oriented corpus encompassing 548,375 member profiles including their respective full-questions, answers and self-descriptions. This assisted in conducting large-scale experiments considering distinct combinations of encoders and sources. Contrary to our initial intuition, in average terms, self-descriptions were detrimental due to their sparseness. In effect, the best transformer models achieved an AUC of 0.92 by taking full-questions and answers into account (i.e., DeBERTa and MobileBERT). Our qualitative results reveal that fine-tuning on user-generated content is affected by pre-training on clean corpora, and that this adverse effect can be mitigated by correcting the case of words.


I. INTRODUCTION
The term demography is universally understood as the study of human populations and their changes. It seeks to describe people in relation to characteristics, such as gender, age and religion. Therefore, demographic analysis is vital for identifying audiences and adapting content to their interests, levels of understanding, attitudes, and beliefs. In the case of cQA platforms, an audience-centered approach is crucial for maintaining an engaged community. It assists not only in encouraging increased participation by delivering attractive The associate editor coordinating the review of this manuscript and approving it for publication was Agostino Forestiero . and targeted content according to personalized interests and motivations, but also in establishing effective connections between recently asked questions and community peers that can produce appropriate and timely responses. Intuitively, one form of achieving this is by designing landing/home pages tailored to each specific demographic segment and personality type.
Along with this, as might be expected, having easy access to demographic variables is useful to detect identity theft, fraud, to enforce terms of service and local laws, filtering and banning fake profiles. Simply put, these factors are particularly useful for properly dealing with assorted malicious activities. Incidentally, cQA sites also suffer from gender VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ differences since they tend to reflect our daily lives. One way these differences manifest is in the disparities in the number of female and male authors across their distinct topics [1]. Gender analysis plays a pivotal role in ensuring to their members the opportunity of a fair gender representation in categories with biased participation. Like most of the websites that require membership, online social networks ask their newcomers to fill out a form with their personal information, when registering. In these forms, fields such as age and gender are optional. To a great degree, people choose ''rather not say'' due to discretion and/or they just want to get through the registration process as fast as they can. To help with user profiling, the last two decades of advances in machine learning and Natural Language Processing (NLP) have made it possible to infer informative patterns from textual content.
In the last couple of years, transformers have aroused intense interest due to their effectiveness in language understanding, vision and reinforcement learning [2], [3]. Consequently, extensive research has been undertaken to improve this class of models over the past few years in terms of its adaptation, efficiency and generalization. As a result, a rich variety of architectures currently exists, some of which have been devised to work well under certain conditions and to target specific tasks.
In short, this work enhances the existing body of knowledge on cQA platforms by assessing assorted state-of-the-art encoders for text-based gender recognition. More precisely, our study makes the following contributions: 1) Fine-tuning state-of-the-art pre-trained models, capable of gender identification from writing on cQA sites. 2) By benefiting from a massive automatically annotated dataset, we conduct a comprehensive empirical assessment of a wide variety of pre-trained transformers. 3) Experimental evidence showing that dataset similarity between the pre-trained architecture and downstream task influences the outcomes. Via NLP processing, transfer learning can be improved by updating the target dataset to increase its similarity with the dataset utilized for pre-training the encoder. Our results suggest that each gender has its own distinctive patterns of interaction within cQA platforms, and that most of these differences are expressed in a way that is recognizable using natural language understanding techniques.
The remainder of this paper is organized as follows. Section II discuss related works. Sections III and IV present the research questions and methods, respectively. Section V discusses the experiments, results and findings. Finally, Sections VI and VII draw conclusions, limitations and outline future work.

II. RELATED WORK
First and foremost, the primary goal of this study is to compare the performance of various state-of-the-art transformers for automatically recognizing genders across cQA users. Due to the multifariousness of the human behavior, community fellows engage with these platforms in different ways and exhibit varying levels of activity. For instance, some participants use the site to ask questions and others interact with the site mainly to answer questions. Therefore, this work fine-tunes and assesses frontier encoders on several combinations of textual inputs, namely questions, answers and self-descriptions.

A. PRE-TRAINED DEEP NEURAL NETWORKS
The latest developments in neural networks allow the training of very deep architectures that can adequately cope with a vast variety of NLP tasks, such as text classification and machine translation. Beyond a shadow of a doubt, transformer models represent a significant breakthrough in this field [4], [5]. Their underlying idea has proven simple but very powerful. It consists of pre-training language model objectives on large networks with massive amounts of unlabeled data, and adjusting these networks to downstream tasks afterwards [6], [7]. OpenAI GPT and BERT are two pioneers of this approach [8], [9]. Since their inception, new variants have been devised to improve this first generation of encoders from different perspectives, including their adaptability, efficiency and generalization [10], [11], [12], [13].
Although these pre-trained models (PTMs) have achieved promising results in numerous difficult tasks, and thus turned into the ipso facto architecture for NLP [10], they still face many challenges: designing effective architectures, utilizing rich contexts, improving computational efficiency, and conducting interpretation and theoretical analysis [14]. It is an accepted fact that PTMs represent knowledge as real-valued vectors in contrast to symbolisms used by human beings.
It has been discovered that architectures such as BERT, capture linear word order and phrase-level information in their lower layers [15]. In particular, deeper tiers are needed to model long-distance dependencies (e.g., subject-verb agreements) [16]. Attention weights have shown to be weak indicators of subject-verb agreements and reflexive anaphora [15]. While there is a wide consensus in studies with different tasks, datasets, and methodologies that syntactic information is most prominent in the middle layers [17], there are some disagreements regarding semantic features. Some studies suggest that semantic features are encoded at the top, whereas others suggest that throughout the entire model [18]. In juxtaposition, surface features are codified at the bottom. Essentially, these models have been observed to imitate traditional tree structures [16] to represent the steps of the traditional NLP pipeline [18]. However, it is yet to be seen how well these findings transfer to domains with higher variability in syntactic structures (e.g., noisy user-generated content) and/or with more flexible word orders, as in morphologically richer languages [16].
Despite enabling important breakthroughs in various conventional NLP benchmarks, an increasing number of studies are revealing that their language skills are not as impressive as initially thought [17]. For example, it has been demonstrated that they depend on shallow heuristics when classifying texts [19]. Although it is true that large PTMs are capable of holding a vast amount of knowledge, they typically fail if any reasoning is required on top of their stored facts [20], [21]. Moreover, some of this knowledge is lost after fine-tuning because of network capacity or under-representation of probing facts. Therefore, forgetting is not necessarily or significantly lessened by capitalizing on additional information harvested from larger corpora [22].

B. GENDER IDENTIFICATION ON cQA PLATFORMS
There are only a few studies addressing the detection of genders across cQA users, as evidenced by numerous recent surveys in this area [23], [24], [25], [26], [27]. Most of these studies relate to image processing, more specifically, to learn gender-informative visual patterns from profile avatars. For instance, heuristic methods have been utilized for automatically guessing genders on Stack Overflow [28]. Here, nonfacial avatars pose a tough challenge even for ocular inspections [28]. Therefore, image-based pre-trained neural network models have also been evaluated using multifarious profile pictures [29].
On the opposite side of the spectrum, the research of [1] address the problem of automatically discriminating the gender of who asked a question using the question texts and metadata, demographics, and web searches. By building a wide diversity of high-dimensional vector spaces and exploiting the genders entered when the user signed up, they trained three supervised approaches on top of a large-scale corpora. They discovered that age, industry and second-level question categories were salient features of gender of an asker. Interestingly, the best text-only models sought to infer the same characteristics from semantic and dependency analyses.
On Yahoo! Answers, the investigation of [30] found some relationships between gender demographics and sentiment analysis, namely its synergy with attitude (i.e., inclination towards positive or negative sentiments) and sentimentality (i.e., number of sentiments). Women and men exhibited different attitudes across prompted questions and given answers: males were more neutral, whereas women were more positive in their questions and responses and were more sentimental when answering questions. Some gender differences across question types were found by [31] using data from the graphic design community on Stack Exchange and Quora. Women are more likely to respond questions seeking for opinions, while men produce more answers to factual questions on Stack Exchange. At both sites, responses from men had a more negative tone than women's answers, although this difference was not statistically significant.
Gender information cooperates in reducing the malefemale inequality as it relates to their participation across distinct cQA categories. In this regard, it has been reported that females, who encounter other members of the same gender, are more likely to engage sooner than those who do not in Stack Overflow [32]. Another significant discovery discloses a stronger tendency among women to post more questions, whereas males to yield more answers, resulting in fewer thumb-ups for them, giving raise to lower average reputation scores for females [33], [34], [35]. Working under these findings, they designed a reputation strategy to lessen the gender gap that rewards points for publishing questions and answers to the same level. Along the same lines, the research conducted by [36] revealed that feminine users receive lower scores when responding, despite exhibiting higher efforts in their contributions, revealing some gender bias in the scoring of answers on sites like Stack Overflow. This bias, combined with the fact that gamification strategies such as scores and badges are more appealing to men than to women [33], supports the need to devise alternative strategies to promote women's participation in cQA sites, especially when anonymity is allowed, and gender information is not available.
Overall, recent studies point towards automatic gender identification as strategically vital to keep community members engaged with cQA websites.

III. RESEARCH QUESTIONS
By leveraging the power of transfer learning, we quantify and juxtapose the classification rate of assorted frontier pre-trained models, when fined-tuned for text-based gender detection. To this end, we analyzed the performance of these state-of-the-art encoders, by considering distinct combinations of the different textual contents found across member profiles (i.e., question titles and bodies, answers and self-descriptions).
Essentially, our predecessors have dealt with this subject by conducting analyses at the level of isolated questions only [1], or targeting profile avatars [29]. In this work, we extend this notion to all texts within his/her profile, that is, to consider all questions posted by the same community peer together with all his/her answers and self-descriptions.
Specifically, our primary goal is answering the following three research questions: • RQ1: Is it possible to automatically detect gender across cQA members based on their textual interactions within the cQA site?
• RQ2: Are there any key differences in the performance among distinct encoders using similar input signals?
• RQ3: Are there any differences in the performance of the same model using different information?
• RQ4: What are the factors that influence the results obtained by the models?

IV. METHODOLOGY
In essence, our primary aim was to analyze and compare the performance of assorted PTMs on the task of automatic gender recognition on cQA websites. One of the pioneers, and at the same time, one of the most widely used architectures is BERT (Bidirectional Encoder Representations from Transformers) [8], [9]. It is based on a multi-layer bidirectional transformer, trained on clean plain VOLUME 11, 2023 text (i.e., the English Wikipedia and the BookCorpus) for masked words and next sentence prediction [6], [9]. BERT is able to understand the meaning of any word within a sentence in relation to ''the company it keeps'' [37], that is to say, all the remaining terms embodied within the same context. Its architecture consists of twelve transformer blocks and twelve self-attention heads with a hidden state of 768. To classify textual content, it represents an entire sequence by the final hidden state h of its first token [CLS]. Then, a softmax classifier is appended to its top as a means of predicting the odds of a category.
Thus, BERT has been a source of inspiration for many other architectures. From this perspective, we considered the most representative models in our empirical settings. These are briefly described below: • ALBERT (A Lite BERT) modifies his predecessor in two substantial ways: a factorized embedding parametrization and it introduces a strategy for sharing cross-layer parameters [38]. The former facilitates the growth of the hidden size without markedly increasing the parameter number of the vocabulary embedding. The latter thwarts the number of parameters to escalate in tandem with the depth of the network. Both proposals reduce the memory consumption and the training time of BERT.
• DeBERTa (Decoding-enhanced BERT with disentangled attention) represents words via a vector that encodes their content and another vector its position. In addition, attention weights among terms are computed using disentangled matrices on their contents and relative positions. To predict masked tokens during pretraining, a mask decoder is utilized instead of an output softmax layer to incorporate absolute positions in the decoding layer. Furthermore, a new virtual adversarial training method were used for fine-tuning to improve generalization on downstream tasks [39].
• DistilBERT leverages knowledge distillation during pre-training to reduce the size of BERT, while maintaining almost all its language understanding capabilities. By using a triple loss, this reduction makes this model 60% faster, and through distillation via the supervision of a larger transformer, it is competitive on many downstream tasks [40].
• DistilRoBERTa is a distilled version of RoBERTa-base, obtained by training the model as DistilBERT. It has six layers, 768 dimensions, and twelve heads, decreasing the number of parameters from 125 to 82 million. On average, it is twice as fast as its predecessor.
• ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) pre-trains a discriminator (transformer) that determines whether every token is an original or a replacement, instead of only masking a fraction of tokens within the input [41]. A generator, another neural network, masks and substitutes tokens to generate corrupted samples. In practical terms, this model trains much faster than BERT, requiring significantly less compute, while at the same time, accomplishing a competitive accuracy on several downstream tasks.
• FNET replaces self-attention sub-layers with a simple unparameterized Fourier Transform on input tokens. It rivals efficient encoders while being much faster and lighter in memory demands. Because of its speed, the Fourier Transform demonstrated to be an efficient mixing mechanism [42].
• Longformer tackles the quadratic explosion caused by self-attention, when increasing the sequence lengths. As a substitute, his attention mechanism scales linearly via a drop-in replacement that amalgamates a locally windowed attention with a task motivated global attention, making it easier to process much longer documents [43].
• MobileBERT is a thin version of BERT that is equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. It trains a specially designed teacher model, an invertedbottleneck incorporated BERT model that enables effective layer-wise progressive knowledge transfer [44].
• RoBERTa is a robust strategy for training BERT models [45]. In short, it uses longer training times and sequences, bigger batches, and one order of magnitude more data than BERT for training. This battery of design choices additionally removes the goal of guessing the next sentence and dynamically changes the masking pattern employed on the training data.
• XLNet is a generalized autoregressive pre-training strategy that learns bidirectional contexts. Instead of exploiting a fixed forward or backward factorization order, it maximizes the expected likelihood over all permutations of the factorization order [46].
• XLM RoBERTa is a transformer-based multilingual masked language model that is pre-trained on texts harvested from one hundred languages. This encoder achieves state-of-the-art performance in cross-lingual classification, sequence labeling and question answering, and strong improvements have been observed when coping with low-resource languages. Interestingly, these outcomes have been achieved while remaining competitive with frontier monolingual models [47]. We benefited from the pre-trained models supplied by Hugging Face 1 as detailed in table 1. We utilized the Simple Transformers 2 library for fine-tuning. All encoders used their default parameter settings to level the grounds. The number of epochs was maintained at two so that a maximum training time of ten days was imposed. In practice, no substantial increase was experienced when going beyond one epoch, but we opted to give all transformers sufficient time to converge. The maximum sequence length was set to 512, and sliding windows considered a 0.95 stride. The batch size was set to ensure that the GPU 3 memory was fully utilized. We used a batch size of eight except for large models (xlnet, xlmroberta, deberta, etc.), where a batch size of 128 was required to allow convergence.

V. EXPERIMENTS
To construct our dataset, we used 657,805 member profiles distilled from Yahoo! Answers, a corpus that was previously utilized in [48] and [49] (see figure 1) for age analysis. For annotation purposes (i.e. male or female), we capitalized on a series of heuristics [28]. First, we checked for user aliases contained in any of a group of seven publicly available gender by name collections including WGND 4 and Howarder. 5 As a means of improving the matching, nicknames were lowercased, trimmed at their first space, hyphen, at, underscore or dot. Accordingly, ASCII characters outside the range from 97 to 122 were removed. In the event of no alignments, the end of the alias was systematically trimmed one character at a time until a match was found, or its length was five characters. The final decision was made by counting the overall frequency of each gender.
Second, lowercased n-grams 6 were extracted from all textual content by substituting numbers with a placeholder and ranked in conformity with their entropy afterwards. After a manual inspection of low-ranked elements, almost 1,500 gender indicative phrases were compiled, and later used to revise the previous frequency counts, thus a final label was assigned to 548,375 (83.36%) out of the 657,805 fellows in the corpus (see statistics on table 2). The overall distribution was as follows: 343,661 (62.67%) are women and 204,714 (37.33%) men. The dataset was divided into 329,025 training, 109,675 evaluation and 109,675 testing instances using a random stratified sampling strategy, maintaining on each set the proportions of women and men. From the 109,675 instances in the test set, 68,676 (62,6%) were women and 40,999 men (37.4%). Every piece of text used by the heuristics was removed from the respective profile of the user. Note that held-out evaluations were carried out in all our experiments by preserving these splits unchanged. The following abbreviations denote different empirical settings, designed to allow the identification of the individual contribution of each piece of information to classifier performance: • T (question titles only) • TB (questions titles and question bodies) • TBA (full questions and answers) • TBAD (full questions, answers and self-descriptions) Finally, we took advantage of the test samples exclusively to make an unbiased assessment of the final model fit on the training/evaluation instances.
To answer RQ1 & RQ2, we fine-tuned several pre-trained transformers 1 using each of the four datasets described earlier, namely TBAD, TBA, TB and T. Our outcomes point towards MobileBert and DeBERTa as the best options for this task, since both achieved a superior performance regardless of the metric and configuration (see tables 3, 4 and 5). Adding self-descriptions does not improved significantly the quality of classification of MobileBert and DeBERTa. It is worth stressing here that only about 7% of the members within this corpus provide a self-description. We conjecture that this low proportion of examples is, in part, one of the reasons for the observed effect on the performance. Overall, these two fine-tuned encoders achieved an Accuracy, F1-Score, and AUC greater than 86.4%, 0.894, 0.921, respectively. The worst performance can be, most of the time, attributed to one out of the three following models: XLNet, BERT and ELECTRA. In particular, XLNet obtained the lowest scores when trained on question titles (73.82%) and on TBAD (74.15%). Our results suggest that one reason to this may be their sensitivity to the distinct input signals. Interestingly, these three alternatives are more competitive under a TBA configuration, but on the other hand, their performance significantly drop when considering self-descriptions or when discarding answers. For the most part, the gap between the best and the worst systems is the narrowest when fine-tuning using full questions and answers (approximately 6.6% accuracy). In contrast, training solely on full questions brings about the widest gap (approximately 14.92% accuracy). Furthermore, Table 3 indicates that the average accuracy was 0.8051, with a maximum of 86.66% (DeBERTa) and a minimum of 69.93% (ELECTRA). Our dataset was imbalanced because of that, in addition to accuracy, we reported the f1 metric (Table 5) and AUC Score (Table 4), because those metrics are appropriate to compare classifiers in presence of imbalanced data [50]. Table 4 shows that the average AUC score was 0.8562, ranging from 0.6745 (XLNet) to 0.9069 (MobileBERT). Overall, these results show that it is possible to detect gender differences across community peers using textual interactions within the cQA site. They also revealed that models designed for efficiency and that are case insensitive such as MobileBert obtain the best average results (AUC: 0.9069). To be more specific, this cost-efficiency was observed when juxtaposing the classification rates accomplished using the following settings: DeBERTa (TBA) with an AUC value of 0.9247, closely     using random oversampling and random undersampling [51]. Both models achieved and AUC score of 0.92 and an accuracy of 0.85 on the balanced test set, similar to MobileBERT trained on the imbalanced dataset.   In summary, testing assorted fine-tuned transformers is pertinent to gender recognition across cQA platforms, as our experiments revealed a high variability in the classification rates among distinct encoders under the same input signals. We deem this to be a result of their sharply different specialized designs with different vocabulary sizes. Owing to their architectural differences, there is no rule for determining the right transformer and configuration for a particular downstream task.
To answer RQ3, the data in Tables 3, 4 and 5 show that the best results were obtained using a combination of questions and answers (TBA, TBAD), but the best results on average were obtained using only question titles, question bodies and answers on TBA models (AUC: 0.8819, accuracy: 0.8293, F1: 0.8683). These models achieve a balance between accuracy, precision and recall. Models that included question bodies performed 5% better than models trained on question titles only (AUC: 0.8651 vs AUC: 0.8184). With the inclusion of answers, the models exhibited an average increase of 2% in performance (AUC: 0.8819). The inclusion of profile descriptions led to a decrease in the average performance (AUC: 0.8594), although some models exhibited a small positive effect. Considering that only 7% of the profiles had descriptions, these results suggest that the inclusion of profile descriptions may be omitted without significantly affecting performance.
Regarding RQ4, uncased models (mobilebert-uncased, fnet-base) tend to perform better in our cQA dataset, which suggests that the writing of questions and answers in the cQA site is different from that in clean corpora, where most models were originally trained (e.g., BookCorpus and English Wikipedia), and perhaps less formal. The use of uncased models appears to mitigate some of the differences in writing between the datasets. Table 6 shows a comparison of the results obtained by the cased and uncased versions of the models with higher average performance, trained on the raw dataset and a corrected case dataset (true case). Distilbertbase-uncased performed slightly better than its cased counterparts on the raw dataset. When the case is corrected, the distilbert-base-cased model performs better than when trained on the raw dataset but is still below the performance of distilbert-base-uncased. This pattern also occurs on distilroberta and xlm-roberta, the models performed better when trained on the dataset with the corrected case.
The results obtained using the deberta-base cased model, require further analysis. Deberta results may be explained because is trained on OpenWebText and STORIES. Open-WebText is a corpus generated from reddit, a social media platform, where users can add their own content, and other users can qualify that content, probably leading to writing more similar to a cQA site. This relative similarity in content VOLUME 11, 2023 may lead to deberta to produce a language representation that might be best suited to learning how to represent the cQA content. The Deberta-base model does not perform significantly better when trained on the true case dataset. The AUC of the TBA model did not improve in relation to the raw dataset, and the TBAD model performance improved from 0.921 to 0.926. The Deberta-base model trained on the true case dataset achieved the best score of all trained models, with an average AUC of 0.9085, slightly above that of the mobilebert-base model trained on the raw dataset, with an AUC of 0.9069. Nevertheless, the complexity of Deberta is five times larger than that of MobileBert. Figure 3 shows confusion matrices for deberta-base trained on the true case dataset. Their error rates are very similar to the ones of MobileBERT, only the TB model shows a 1% percent increase on their female detection capability. Regarding ROC Curves, Figures 5 and 4 shows that the performance of both models is indistinguishable.
Based on these results, we selected mobilebert-uncased TBA as the model with the best balance between performance and complexity.

A. VISUALIZATION AND EXPLANATION OF MODEL BEHAVIOR
Explainable AI (XAI) [52], [53], [54] and explainability of transformers [55], [56], [57], [58], [59], [60] are active areas of research. There are multiple approaches to construct explanations for transformer models, many of them relying on the use of attention weights. To understand the classifications provided by our models, we analyzed the mobileBERT model (T and TB versions) to explain the classification of samples for users Daniel and Emily shown in figure 1, that were not included in the samples used for training.
We selected the sample for Daniel because was mislabeled by the T model but correctly classified by the TB model, allowing us to analyze the reason behind the improvement. The sample for Emily was correctly classified by both models and allow us to analyze if the reason for classification changed between the T and TB models. For model understanding we used attention visualization [55], [57], [58] and attribution [56] using the python package transformers-interpret. 7 Figures 6 and 7 show the attention weights in the first layer of mobileBERT T model for head 1 (Top) and head 2 (Bottom). On both figures we see two common attention patterns [58], head 1 has a heterogeneous pattern and head 2 a diagonal pattern. The heterogeneous pattern of head 1 suggest that the model learned semantic relations, like the relation between the word 'is' with words 'weird' and 'interesting', on figure 6. The diagonal pattern appears when attention is between previous, current and next words, like the relation between 'am' and 'pregnant' on head 2 depicted in figure 7.
The sample for the masculine user was incorrectly classified by the mobileBERT T model and correctly classified 7 https://github.com/cdpierse/transformers-interpret.git by the TB model. To explain the difference, Figure 8 shows the attribution scores [56] of the T model (Top) and TB model (Bottom). Highlighted in green are the words that influenced the conclusion reached by the model, while in red are the words supporting the other option, discarded by the model. The T model classified the masculine sample as feminine based mainly in the use of the word interesting and the emoticon :) (attribution 0.16 and 0.32, respectively). This behaviour is concordant with previous analyses [30] that found and association between female interventions and positive sentiments. The TB model did the correct classification as masculine, influenced by words met, girl and baseball (attributions 0.53, 0.39 and 0.23). For the feminine sample, the words healthy, pregnancy and pregnant (attributions 0.28, 0.28 and 0.47) influenced the classification on both models, while the words period and couple (attributions 0.20 and 0.34) complemented the decision on the TB model. These behaviors are more related with the topic discussed by the users, and the use of certain words when talking about something (meeting a woman, pregnancy) or when referring to self (pregnant). In both cases, the results suggest that models learned a relationship between some topics and gender, based 3992 VOLUME 11, 2023 on the information inside the corpus used for training. Previous works had found topic and intent differences between masculine and feminine participants in cQA sites [31].

B. CAVEATS
Genders were assigned according to how members identified themselves on the website. As a rule of thumb, we manually assessed 100 randomly chosen labelled profiles, and obtained an error rate of 10%. Aside from errors attributed to the intrinsic shallow nature of our heuristics, some individuals run fake profiles. Our preliminary manual inspection did not find that other sexual orientations made up a substantial share of this dataset. Owing to their discretion and/or low participation, it is also difficult to compile a comprehensive list of their typifying names and phrases.

VI. LIMITATIONS AND FUTURE RESEARCH
Apart from the aforementioned considerations, there are some additional aspects that must be weighed carefully. First, self-descriptions suffer severely from data-sparseness, namely a low percentage of the members (7%) provides this  short biography on Yahoo! Answers. Intuitively, one can expect a great probability of finding pieces of information conveying demographics across this sort of text. Therefore, its real impact should be quantified by studying platforms, where their users are more likely to describe themselves. Here, one could think on services such as Reddit and Stack Exchange. In the same spirit, different ways of integrating this class of input into a, probably joint, model can be further explored in future works.
Second, if significant computational power and massive cQA collections are accessible, one could think about pre-training frontier transformers on user-generated cQA texts. Doing this poses several exciting challenges, for instance to clean or not to clean the corpus? When these resources are inaccessible, the transfer of knowledge can still be improved by means of resolving community jargon, spellings, aliases, entities and acronyms. Additionally, we conjecture that pre-training title-only models will be beneficial, but this will need special adjustments, since the VOLUME 11, 2023 grammar exhibited in question titles is sharply different to what we can find across question bodies and answers.
Lastly, we also envision that exploiting multi-lingual architectures and texts written in different languages can help to enhance the gender detection rate, especially across users linked to very few questions and answers posted in English. On top of that, multilingualism might assist in dealing with the data-sparseness in self-descriptions. In the case of Yahoo! Answers, extra biographies can be harvested from some Spanish speaking members. Recall that our study focused its attention only on textual content in English, which was singled out via a language detector.

VII. CONCLUSION
Regarding RQ1, we concluded that it is possible to infer the gender of a community peer on a cQA site from their interactions with the site. Better results were obtained by models using full questions (title and body) combined with the answers provided by the person. Uncased models (i.e., Mobilebert and FNET) and models trained on varied datasets like Deberta, performed better than models trained on clean corpora such as BookCorpus or English Wikipedia, showing that the use of any pre-trained model does not lead to the same classification results (RQ2, RQ3). Another important conclusion is that the addition of more information does not always lead to better results, because some TBA models performed better than their TBAD counterparts. The differences in results may be explained by differences in writing across datasets (RQ4), because the correction of the case of the words improved the results in cased models. This affirmation could be further investigated on future works, by training the models with an updated dataset where misspelled words are corrected to ease the transfer learning from the clean corpus.
Model selection appears to be an important issue in the context of natural language understanding applied to cQA sites. We summarize our findings in the following guidelines for model reuse: • To improve transfer learning, select a model trained on a dataset with the closest similarity in writing (formal, informal) to the dataset used for the downstream task.
• If the selected model is cased, preprocess the dataset to correct case before training.
• When using a model pre-trained on a clean corpus, consider fine-tuning the model using a dataset where misspelled words are corrected. We conclude that gender recognition based on writing may be helpful in profiling users in cQA sites, and as a tool to design interventions to promote equal engagement and participation in online communities.