A Topology-Based Approach for Predicting Toxic Outcomes on Twitter and YouTube

The benefits of an information ecosystem based on social media platforms came at the cost of the rise of several antisocial behaviours, including the use of toxic speech. To assess the aspects that concur with the formation of toxic conversations, we provide a multi-platform comparison on Twitter and YouTube between the 2022 Italian Political Elections, representing a potentially polarising topic, and the Italian Football League, a topic close to the country's popular culture. We first probe structural and conversational toxicity differences by analyzing 257 K conversations (3.7 M posts, 1 M users) on both platforms. Then, we provide a machine learning approach that, by leveraging the previous features, identifies the presence of the following toxic comment in different stages of conversations. We show that football tends to exhibit lower toxicity levels than politics, with the latter producing more extended conversations that attract a broader audience and consequently fostering the polarization phenomenon. The implemented classifiers resulting from the conversation stage-based approach achieve state-of-the-art performances despite a restricted set of features. Furthermore, our cross-topic comparison shows that models trained on a divisive topic can be applied to other discussions without causing a degradation of their performance.

potential benefits, social media are also considered responsible for a number of issues, such as: fostering the spreading of online misinformation [6], the emergence of echo chambers [7], and increasing intolerance expressed in online debates [8], [9], [10].These debates are characterized by several antisocial behaviours like cyberbullying [11], sexual harassment [12], trolling [13], and hate speech [14] which, in turn, contribute to the rise of individual and societal problems [15], [16], [17].Therefore, to pursue the development of safer digital environments, it is crucial to identify early warnings of emergent toxicity and adequately moderate them.Many scholars have already faced this challenge with a mixture of approaches that ranged from the analysis of conversation cascades [9], [10], [18], [19] to Machine Learning (ML) [20], [21], [22], [23], [24], [25], [26], [27], [28].A potential driver for the emergence of toxic speech may be the topic of discussion; another one is the platform on which the topic is debated.Yet, another one may be how the conversation evolved from the point of view of both structure and tone.
To explore these mechanisms, we provide a multi-platform comparison regarding the rise and the prevalence of toxic speech that results in developing a machine learning model able to predict the emergence of toxicity in a conversation.As a case study, we consider online debates on Twitter and YouTube around two topics of national interest: the Italian Political Elections held in 2022 and the 2022 Italian Football League.
In more detail, we first compare how toxicity evolved on the two topics and platforms, understanding which factors may have contributed to its prevalence.Then, we exploit conversation trees made up of comments to understand their structural properties using a set of cascade metrics.Finally, we provide a ML approach that, by leveraging the previous features, predicts the appearance of the following toxic comment in different stages of conversations.
From a conversational perspective, our findings suggest that a divisive topic, namely politics, tends to exhibit higher toxicity levels than a topic close to popular culture, such as football, producing more extended conversations that attract a broader audience.Lastly, the classifiers resulting from the stage-based approach achieve state-of-the-art performances despite a restricted set of features.Furthermore, our cross-topic comparison shows that models trained on a more toxic topic, namely political elections, can be generalized to other discussion arguments without causing a degradation of their performance.Our results suggest that both the cultural context and the conversation stage should be considered when developing tailored automatic moderation tools 1 .

A. Defining Toxicity on Social Media
The definition of online toxicity and toxic behaviors has evolved over the years due to its interdisciplinary nature and the role of context.Prior work on this topic coined the term hateful speech, referring to any speech expressing hatred by the author against a person or people based on their identity [29].Similar definitions from the juridical literature defined hateful speech as any form of expression that can increase harassment towards individuals or groups due to some characteristics they share or affiliation [30].A further advance in the definition of toxicity was made in recent years by the United Nations, which formalized the concept of hate speech as "any kind of communication in speech, writing or behavior, that attacks or uses pejorative or discriminatory language regarding a person or a group based on who they are, in other words, based on their religion, ethnicity, nationality, race, color, descent, gender or other identity factors" [31].More recently, researchers from Google Jigsaw, contextually with the introduction of their Perspective Application Programming Interface (API) [32], defined toxic content as any content characterized by "rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion."[33].

B. Conversation Cascades and Toxicity Dynamics
Conversation cascades are an instance of the so-called information cascades whose properties and insights have been observed for years [19].Despite the prior knowledge, the problem of curating online conversations has attracted increasing interest due to the societal implications it has [23], [34], [35].Prior research efforts on this topic investigated the topological structures of conversations [36], [37] and proposed new generative models [29], [38] for their reconstruction.From a social media perspective, scholars made an extensive effort to analyze conversations and their role in anti-social behaviors like harassment [12], [39], spreading of misinformation [9] and trolling [8], [13].Moreover, it was found that users tend to concentrate their anti-social efforts on a small number of threads [40], providing no evidence for the presence of "pure haters" [35].From a dynamic perspective, it was observed how discussions on YouTube tend to degenerate towards increasingly toxic exchanges of views [35].Such exchanges, however, have been demonstrated not to nourish misinformation spreading on social media [9].Finally, a stream of work investigated the predictive power of structural content and user features to identify toxic comments and anti-social behaviors [10], [29], [41], [42], achieving important results in the selection of features to employ in the automatic identification of toxic elements. 1Additional information for this paper can be found at https://osf.io/qdr7f

C. ML for Toxicity Identification
From a ML perspective, the non-trivial task of identifying the presence of toxicity in online conversations has collected an increasing interest due to its implications for society and the technical challenge it poses.Researchers achieved promising results by applying architectures that ranged from traditional classifiers [20], [21], [22], [23], [24] to deep learning approaches, including Recurrent Neural Networks (RNN) [25] and Natural Language Processing (NLP) [24], [26], [27], [28].Along this path, in 2017, Google Jigsaw introduced Perspective API [32], [43], a ML system that detects toxicity of online comments [44].Despite its initial criticism [45], [46], the API was employed by multiple research works [9], [10], [47], [48], [49], being recognized as a state-of-the-art tool in the context of online toxicity quantifying.Despite the extensive use of this model in research [9], [10], [47], [48], [49], its usage has been associated with various criticism from researchers.Indeed, it was shown how adversarial approaches can effectively reduce the toxicity associated with toxic content so that the system assigns significantly lower scores [45], [46].Moreover, recent benchmarking tools [50] revealed pitfalls in toxicity recognizing Perspective API's capabilities on many categories, demonstrating instead how GPT-3 approaches may perform better.The reasons can be found in toxicity detection models' limited ability to contextualize conversations [51], [52].Indeed, these models often struggle to incorporate factors other than text, such as the participant's personal characteristics, relationships and the overall tone of the conversation [51].Consequently, what is considered toxic content can vary significantly among different groups, such as ethnicities or age groups [53], leading to potential biases.These biases may stem from the annotators' backgrounds and the datasets used for training, which might not adequately represent cultural heterogeneity.Additionally, subtle forms of toxic content, like sarcasm or memes that target specific groups, can be particularly challenging to detect.Therefore, recent advances in applying transformer-based models to identify toxicity show how specific feature combination strategies [54] and ensemble models [55] achieve promising results.Finally, researchers evaluated the ability of Generative Pre-trained Transformers (GPTs) to create synthetic datasets which can serve as input for deep learning architectures [56].

A. Data Collection
We collect social media data concerning the 2022 Italian Political Elections and Football League.The first topic, Italian Elections, is known for being a polarising topic, especially in the case of the 2022 Italian Elections, where a strongly conservative party participated and won the elections, nourishing phenomena like echo chambers and polarization [57] and, eventually, offline disorders.Instead, the motivation for choosing the Italian Football League as a proxy for Italian popular culture is twofold.From a relevance perspective, football in Italy has the highest number of teams, thus a large geographical and media coverage, and it receives the highest number of public investments  [58].From a toxicity perspective, we chose football due to its ability to spark anti-social behaviours, including tumults and brutal acts of violence [59], [60], which have the potential to be correlated with division and anti-social behaviors online.
The collection of posts and comments was performed on Twitter and YouTube to compare two regulated environments that rely on different media types, namely the text messages for Twitter and the videos for YouTube.The analysis includes all posts published from 25/08/2022 to 25/12/2022 with the corresponding comments.This period was suitable to to capture both the social media debate around the Italian electoral campaign and the Italian Football League.Indeed, the chosen time window captured the election day that was on September 25, 2022, the following debate between the political parties involved once the winners were announced and the first stages of the Football League that started on the 21/08/2022.
For the Football topic, we look for all posts containing at least one hashtag that refers to the Italian Serie A League team names and their slogans.Then, for each obtained post, we collect all the corresponding comments.The same approach was applied to the Elections topic, with the difference in the search hashtags that refer to political parties, exponents and general terms used by newspapers.
On Twitter, the data collection was performed by using the Twitter API for Academic Research [61], producing a total of 3.6 M posts for both topics, published by 300 K users, and 8.2 M Italian comments, identified by using Google's Compact Language Detector 3 (CLD3), from 550 K users (see Table I for further details).On YouTube, instead, posts with their comments were collected using the YouTube Data API [62], resulting in a dataset of 87 K posts for both topics published by 10 K channels, which produced 2.6 M Italian comments, again identified with CLD3, from 381 K users commenting (see Table I for further details).

B. Toxicity Labelling
In the current paper, we refer to toxic content using the definition provided by Google Jigsaw, which identifies as toxic any content that is "rude, disrespectful, or unreasonable language likely to make someone leave a discussion" [33].Consistently with the authors of this definition, the toxicity content classification is based on Google Jigsaw Perspective API [32].Such API uses a ML model [44] to provide a score ranging from 0 to 1, indicating the probability that a reader would perceive the comment as toxic [63].To define an appropriate threshold, we draw from the existing literature [9], [10], [63], indicating that any content with a toxicity score ≥ 0.6 is considered toxic.To assess the validity of this threshold, we also performed content classification with a threshold of 0.5 and 0.7.Among all topics and platforms, the 0.6 threshold provided the best tradeoff between the percentage of classified elements and the size of the resulting dataset to employ for the training of toxicity classifiers.
By applying Perspective API, we quantify the toxicity of the 98.6% of the total number of posts and comments in the dataset (see Table I for further details).The remaining 1.4% comprises all those contents for which the model failed to produce a toxicity score.This scenario may happen with texts containing only emojis, special characters or lexical elements for which the API did not quantify their toxicity [44].

C. Conversation Cascade Reconstruction
We model a conversation cascade as a directed tree graph T = (V, E), where V = {1, . . ., n} represents the set of nodes and E = {1, . . ., m} the set of links.Each node v ∈ V can be either an original post that started the conversation, representing the tree's root, or a comment.On both platforms, the tree's root is characterized by an identifier (ID) that uniquely defines the conversation, shared by other nodes through the conversation_id attribute on Twitter and by the video_id on YouTube.The edges e ∈ E instead represent the act of replying that links a node v j to a node v i , with j > i.For instance, the edge e 1 = (v 1 , v 2 ) means that the comment made by node v 2 replied to the node v 1 , which can be another comment or the root.
We implement the following procedure to reconstruct the conversation trees on each social media platform.On Twitter, we start from the root node and iterate on its children whose parent, represented by the in_reply_to_id attribute, corresponds to the root ID.For each identified node, we recursively look at their children with the same rationale until we reach all the tree leaves.The same procedure is applied on YouTube.However, in case of sub-conversations starting from a comment node v i , YouTube will always indicate as v i the parent of these nodes, despite the fact they may have replied to a child node v j .Such limitations may prevent the algorithm from reconstructing the actual cascade structure.To overcome this problem, we apply a heuristic to reconstruct the tree by looking at the latest comment posted by the user mentioned in a message (referring to its username indicated by @Username).If no username is found in the text, we indicate as the parent of the comment the root of the tree, i.e., the original post.Otherwise, we assign as the parent of the comment the ID of the most recent comment node posted A node in green represents content whose text was identified by Perspective API with a toxicity score < 0.6, whilst a red node identifies an element with a toxicity score ≥ 0.6.Finally, grey nodes represent all those contents for which the API could not quantify their toxicity.
by the user identified by its username in the sub-conversation.Finally, we label the nodes on both platforms based on the toxicity score of the element, as described in Section III-B.The resulting structure from this process is represented in Fig. 1.

D. Cascade Metrics
To provide a comparison between cascades, we define two categories of metrics.The first one called structural metrics, refers to those features that only depend on the number of nodes and links in a graph and their toxicity score.The second, named conversational metrics, refers to additional information that is not strictly related to the topology of each conversation tree.b) Depth: The Depth D(T ) is the distance d of the deepest node in the conversation, which also coincides with the tree's diameter, i.e., the longest shortest path between the root node and any other node in the graph.The depth can be expressed as where r is the root node.The deeper a conversation tree is, the more direct exchanges happen in the discussion.c) Wiener Index: The Wiener Index W (T ) measures the structural complexity of the conversation tree T and its potential virality [64].It is the average shortest path between each pair of nodes i, j.In the case of a directed tree, the Wiener Index can be defined as where 2 n(n−1) is a normalization factor to account for all paths among couples of nodes.The Wiener index ranges between [1, ∞) and, in general, it is minimized for broadcast structures and maximized for low branching structures [64].In a conversation tree, a lower Wiener Index indicates that comments are more reachable from each other compared to comments in a tree with a higher Wiener Index.
d) Toxicity Ratio: The Toxicity Ratio T R(T ) is the average number of toxic comments in the conversation tree T , considering the number of toxic replies out of the total number in the conversation.The toxicity ratio can be defined as where is the toxicity score of the comment v ∈ V and 0.6 is the toxicity threshold value.The higher the ratio, the more toxic the discussion is.The rationale behind this measure is to quantify the toxicity of the conversation up to the moment a comment node takes place.Therefore, it does not represent the toxicity of the parent comment but, instead, it describes a conversational state up to comment c ∈ T .e) Average Toxicity Distance: The Average Toxicity Distance T D(T ) is the average normalized distance of toxic comments from the root r ∈ V , defined as T D(T ) is bounded in (0, 1], and low values of this quantity imply that toxic comments are, on average, located close to the root.f) Assortativity: The assortativity coefficient r measures the extent to which similar nodes tend to be connected with each other [65].It is defined in the [−1, 1] range: values close to −1 indicate disassortativity (i.e., nodes with different features tend to be interconnected less than expected at random), whilst values close to 1 indicate assortativity (i.e., nodes with similar features tend to be interconnected more than expected at random).A value close to 0 means the distribution of node features is close to random.We consider as node feature their toxicity label, and to compute the assortativity coefficient, we ignore the direction of the links, obtaining the following equation: where a ij are elements of the adjacency matrix A = (a ij ) i,j∈V in which a ij = 1 (a ij = 0) indicates the presence (absence) of a link between nodes i and j; k i = n j=1 a ij is the node degree, and x i is the feature assigned to node i.
2) Conversational Metrics a) Average Comment Intertime: To quantify the average time, in seconds, lasting from the appearance of a comment and its successor in a conversation, we introduce a measure Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
called Avg.Comment Intertime CI(T ).Given tree graph T , it is defined as where Δt(e) = t(w) − t(v) represents the difference between the timestamps associated to the nodes w and v, with e = (w, v) ∈ E. The rationale behind this measure is to assess whether heated discussions tend to have shorter waiting times between responses rather than discussions that do not present toxicity traits.

b) Number of Unique Users:
The number of unique users U (T ) is the number of distinct users appearing in a post by posting or commenting, which is lower or equal to the Tree Size T S(T ), then U (T ) ≤ n.
c) Root Toxicity: To account for the influence that the text of the initial post can have on the conversation, we assign a toxicity label to the root of each tree T , as described in Section III-B.

E. Permutation Test
To assess differences in the distribution of cascade metrics between different topics, we perform permutation tests whose algorithm is described in Algorithm 1.For each metric, we consider the two distributions X ele and Y f oot relative to the Elections and Football topics, keeping track of which population an observation is taken from.We begin by computing the test statistic m, defined as the absolute difference value between the mean of X ele and Y f oot .Then, we unify the cascade distributions of two topics into a new one, called Z, and we shuffle the labels of the measures, obtaining Z * , a set containing the same observations but (possibly) with different labels.Such operation allows us to perform the permutation tests by extracting the two shuffled distributions, i.e., X * ele and Y * f oot based on their labels in Z * and performing the absolute difference for their mean m * .We repeat the procedure 1000 times and, as a result, we compute the probability that the test statistics m * , observed in our null model, is higher (in absolute value) than m.We decide to use the permutation test since it can reduce the effects of imbalances in the sample sizes that may interfere with other tests, such as the Kolmogorov-Smirnov (KS) test.

F. Toxicity Comment Prediction in a Conversation
Content moderation algorithms play a crucial role in the maintenance of online ecosystems.On the one hand, they must promptly limit the diffusion of harmful content.At the same time, too much limitation can prevent the emergence of vibrant discussions, impacting freedom of speech.Recent approaches to designing effective moderation [10], [29], [41], [42] tools focused on structural aspects of the conversations without effectively considering the relationship between the topic discussed and the community involved.To address this gap, we propose a ML approach that differs from the current literature for two main reasons.First, we aim to provide a minimal yet effective feature set based on previously computed cascade metrics.Second, since it is known that structural feature importance is subjected to decaying as the tree size grows [19], we implement 4 different classifiers, each trained with comments belonging to specifc stages of a conversation.In terms of toxicity, we hypothesise such a solution will capture its evolution in the different stages of a conversation.

1) Dataset Creation a) Computing cascade metrics at comment-level:
We begin the dataset creation procedure by reconstructing, for each topic and platform, the conversation cascades as described in Section III-C.During the reconstruction, we filter out all those conversations with less than one comment to ensure the existence of at least a pair of toxic/non-toxic comments.Next, we compute the evolution of the features described in Section III-D at the insertion time of each comment.
b) Creating a dataset for the toxicity prediction task: In ML tasks involving cascades, it is mandatory to account for the decaying importance of their features as the size grows [18], [19], [66].If not, the predictions produced by models trained on these data may be biased from the tree's current state.From a structural perspective, previous results [67] showed how logarithmic binning enhances differences in the evolution of structural measures concerning the cascade size.Given the following motivations, we apply a dataset creation strategy that performs a logarithmic binning on the cascade size.Indeed, each unfolded conversation is split into four intervals, i.e., (1,10), (10,100], (100, 1000], (1000, 10 000], according to the position assigned to a comment by entering in the conversation (comment index).This approach allows the creation of subsets that describe the different stages at which a conversation evolves, potentially helping the emergence of topological or conversational dynamics.
To optimize the separation between toxic and non-toxic elements, on each subset, we retain only those comments with a toxicity score provided by Perspective API less than 0.2, representing elements with a low presence of toxic language and greater or equal to 0.6, representing the toxic elements.
For each conversation in a subset, we create a pair of comments that include a toxic/non-toxic element until all toxic comments have a unique counterpart.However, to account for all those toxic comments without a counterpart, we randomly assign them a non-toxic element chosen from the subset in the exam.Then, we extract the features of both comments from all pairs, obtaining a cascade snapshot from a structural and conversational perspective when a toxic and non-toxic comment in the different conversations is posted.Finally, we end the dataset creation by performing an 80/20 split to obtain the train and test sets for the model training and testing phase. 2

) Model Training:
To predict the occurrence of a toxic comment in a conversation, we implement an ensemble approach that consists of four ML sub-models, each specialized for a specific conversation stage as described in Section III-F1.We train these models on a set of structural and conversational features, defined in Section III-D, to capture the different aspects that can bring to the production of toxic content in a conversation.We implement several ML-supervised models to identify the consistency of results and the most suitable model for this task, namely Logistic Regression (LR) models, Random Forests (RF), Decision Trees (DT), AdaBoost (AB), Support Vector Machines (SVM) and Gradient Boosted Regression Trees (GBRT).For each model, we tune its hyper-parameters through a 10-fold CV.The best model is refitted on the entire training set based on its accuracy score.For each dataset interval, we choose the best model with the highest F1 score, considering the Accuracy score in the case of a draw.
To estimate the predictive power of singular features, we proceed as follows.We first compute the F1 score s obtained by fitting the model m on the original dataset X. Next, we randomly shuffle its values for each feature j ∈ [1, P ] of the dataset, where P is the total number of features.For every shuffle k ∈ [1,10], we fit the model m on the dataset Xj,k with the j-th column shuffled, obtaining a new score s k,j .The importance of the feature i j is defined as IV. RESULTS

A. Toxicity Evolution
We begin the analysis by comparing the toxicity evolution for the Italian Football League, representing a topic close to the Italian popular culture, and the 2022 Italian Political Elections, representing a potentially polarising topic.Fig. 2(a) represents the average toxicity scores observed for each topic and social media platform during the analysis period.We observe that conversations about Italian Elections display higher toxicity levels than those about Italian Football.Indeed, on Twitter, Elections conversations produce an average daily toxicity score of 0.18 compared to the 0.09 for Football.The same behavior is found on YouTube, where the Elections topic attracts more toxicity than Football, with an average score 0.22 against the 0.13 of its counterpart.This result complies with the toxicity labelling results described in Table I in which, on both social media, Elections contents have the highest percentage of toxic elements and is in line with previous studies [10], [68] reporting a low, but still problematic, prevalence of toxic speech in online social media.We statistically assess this result by applying the KS test on both topic distributions for each social, obtaining a p-value p < 0.05 for both cases.Ultimately, we provide the first evidence of how the topic of Football produces conversations characterized by a lower presence of toxic language compared Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II PEARSON CORRELATION SCORES BETWEEN AVERAGE DAILY METRICS CONCERNING THE TOXICITY SCORES, THE NUMBER OF POSTS AND COMMENTS AS WELL AS THE NUMBER OF USERS COMMENTING AND HOW MANY TIMES
THEY COMMENTED DAILY to political Elections.The lower prevalence of toxic speech in the football debate, which is instead usually associated to events of hate and violence both offline and online, represents a counterintuitive aspect emerging from the data.
Next, we quantify the rate at which toxicity evolved during the analysis period, assessing whether events, such as the Italian Elections voting day on September 25th, 2022, may produce an effect on the toxicity of the corresponding debate.To achieve this goal, we estimate the evolution of toxicity on each topic through Ordinary Least Squares (OLS) regression models, defined as Results from the fitting procedure show that YouTube is characterized by a decreasing trend of the toxicity scores for both topics (β 1 = −2.59× 10 −4 Elections and β 1 = −5 × 10 −4 for Football), whilst Twitter presents a stationary trend for the Elections topic (β 1 = 5.06 × 10 −5 ) and an increasing one for Football (β 1 = 1.59 × 10 −4 ).
In terms of differences found in coincidence of the voting day, Fig. 2(b) reports a toxicity decrease of −21.98% on Football and −3.57% on YouTube, whilst on Twitter we note an increase of 3.03% for Football and a decrease of −0.42% for Elections.However, by conducting a KS test on the sample concerning the pre and post-event periods, we observed that the only significant change in toxicity happened on YouTube with p-value < 0.05 for both topics against the 0.22 and 0.35 in the case of Twitter.
To conclude our analysis, we look at the possible factors explaining the observed toxicity trends.To do so, we compute the Pearson correlation coefficient between the toxicity score and a set of measures related to the volume of content and the user's behaviour, namely the number of posts (Posts), comments (Comments), the number of users commenting (Users Commenting) and the comment they produce (User Comments).From the results reported in Table II, we observe that, on YouTube, the evolution of toxicity is positively linked with all the measures taken into account, identifying the role of the content volume in the production of online hate for both topics.More specifically, the positive correlation between the number of comments and users involved provides evidence of how online toxicity is closely associated with the length of discussions -represented by the number of comments -and with the commenting activity of users -represented by the number of user comments.On Twitter, toxicity in Football conversations appears to be linked to the number of posts generated about the topic, without being influenced by the commenting perspective.For the Elections topic instead, results confirm what was observed on YouTube, i.e., toxicity has a strict direct relationship with the commenting activity.
Ultimately, we provide evidence of how a popular culture topic like football tends to attract less hate than those inherently divisive, such as political elections.From a social media perspective, we observe how the topic may be a discriminant in the evolution of toxicity,even in unexpected cases such as football.

B. Structural Analysis
We continue our comparison by investigating how the structure of conversations diverges according to their topic and platform.We first compute a set of structural metrics, described in Section III-D; then, we assess the statistical validity of the obtained distributions using a permutation test, described in Section III-E, with a Bonferroni correction to account for multiple comparisons, considering p-values less than 0.00625 (0.05/8) as significant.Fig. 3 reports the Complementary Cumulative Distribution Functions (CCDFs) computed on the previous cascade metrics for both topics and social media.We observe how, on Twitter, the Elections topic tends to attract bigger (Tree Size), wider (Max Width) and deeper (Max Depth) conversations.From a content perspective instead, Elections conversations are more likely to carry more toxic tweets (Toxicity Ratio) than those from Football.Conversely, users participanting in Football conversations have a higher chance to find a toxic comment earlier than in the Elections ones (Avg.Toxicity Distance).The p-value of the statistical tests evidences how Max Width and Number of unique users are the only metrics on Twitter having no differences regardless of the topic.

C. Predicting the Following Toxic Comment in a Conversation
As a final step, we predict the probability that the following comment in a conversation is toxic.Our results show that GBRT models achieve the highest performance on most configurations, whose results are reported in Fig. 4(a).We report results containing the (1, 10] interval for the sake of completeness, but we do not include them in the discussion of the results.The reason is that newborn conversations with few comments may not have established proper conversational dynamics yet, therefore not representing an adequate asset for toxicity predictors.The F1 scores reported for the Elections topic range between [0.72, 0.78] on Twitter and [0.70, 0.76] on YouTube.For Football instead, F1 scores range between [0.79, 0.84] on Twitter and [0.77, 0.84] on YouTube.Next, we create a baseline by training each model on datasets obtained by unifying all intervals for each topic-platform combination.The resulting metrics unveil how, in all configurations, the (10, 100] interval produces greater or equal F1 scores than the baseline, providing evidence of how accounting for the different stages of a conversation may produce models with better performance and, therefore, with the ability to keep digital ecosystems safer. Next, we investigate the generalizing power of models concerning the topics they were trained from.To do so, we perform a cross-topic evaluation for each social media: each stage model is trained on one topic and tested on its counterpart.the Football comments and testing on Elections produced an average reduction of the F1 score equal to 8%, the opposite scenario produced an average increase of the metric equal to 8%.Such a result indicates that Football, whose conversations are less toxic and participated, cannot generalize toxicity dynamics occurring in more toxic topic like Elections, resulting in a drop in performance.Instead, the models trained on cascades with a more articulated structure, like the Elections ones, tend to better generalize unknown observations in their feature space, achieving higher performance on a cross-topic benchmark.Finally, we assess the importance of each employed metric in this prediction task by measuring how the F 1 would be impacted if a feature is removed.Results displayed in Fig. 5 show, as expected, that the toxicity ratio (Toxicity Ratio) is the most significant feature for predicting the toxicity of a comment, leading to an average reduction of 22% in the F1 score on both platforms, followed by the average toxicity distance (Avg.Toxicity Distance) (2%) and the assortativity (Assortativity) (1%).This result describes how combining cascade features with domain-specific information can be relevant in predicting harmful content.

V. CONCLUSION
In this paper, we proposed a Twitter and YouTube comparison between the Italian football championship and the 2022 Italian general elections.We first assessed their differences in toxicity evolution, understanding which factors induce changes in the prevalence of toxic speech.Then, we compared conversations from a topological perspective by employing a set of structural metrics typical of cascades.Finally, we employed a ML approach, which, by creating four sub-models accounting for the different stages of a conversation, predicted the presence of the following toxic comment in a conversation.Our findings provide a counterintuitive example of how football, a topic close to popular culture that is usually associated with episodes of extreme hate and violence, tends to exhibit lower toxicity levels than politics, a potentially divisive topic.This comparison also sheds light on a trend towards affective polarisation, which implies increased negativity towards the members of the opposing political parties [69], [70] at a national level.From a structural perspective, conversations from the Elections are broader, more toxic and involve more users.Moreover, the classifiers resulting from the stage-based approach achieved state-of-the-art results despite a minimal set of features, with models from early stages of conversations performing as well as those trained on the entire datasets.Our findings could be employed to support human moderators by providing a warning signal related to conversations that display a higher likelihood of generating toxic exchanges.
Despite positive aspects such as the multi-platform and multitopic nature of our study, it presents some limitations.The first limitation relates to only one language -Italian -in the conversations.Our results may also suffer from the lack of deleted content despite our data collection being performed with a short delay (a few days at most) concerning the posting time.
In future works, we aim to generalize this approach by extending the number of topics chosen and the list of platforms, including unmoderated social media platforms.Finally, to advance the quality of predictions, we also aim to define newer structural and conversational metrics to include in our models.
Gabriele Etta received the M.Sc.degree in data science from the University of Padua, Padua, Italy, in 2020.He is currently working toward the Ph.D. degree with the Sapienza University of Rome, Rome, Italy, working on data driven modeling of social dynamics.His research interests include complex networks, information diffusion, and computational social science.

Matteo Cinelli is currently an Assistant Professor in
Computer Science with theUniversity of Rome "La Sapienza", Rome, Italy.His research interests include network and data science mostly in the context of information diffusion and social media.He is a member of the Center of Data Science and Complexity for Society (CDCS).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 1 .
Fig. 1.Graphical representation of a conversation tree on YouTube and Twitter.The root node representing the post is a square, while the children nodes (comments) are represented as circles, with the number representing the comment ID.The nodes' colours represent the toxicity category assigned from their text.A node in green represents content whose text was identified by Perspective API with a toxicity score < 0.6, whilst a red node identifies an element with a toxicity score ≥ 0.6.Finally, grey nodes represent all those contents for which the API could not quantify their toxicity.
1) Structural Metrics a) Tree Size: We define as Tree Size the number of nodes in a tree graph, denoted as n = |V |, where | • | is the cardinality of the set V .Conversation trees with high tree size values indicate a participated discussion.We assume a user can post multiple replies and interact with different users within the conversation.

Algorithm 1 : 2 : N = 1000 3 :
Permutation Test Algorithm to Assess Statistical Differences in the Cascade Metrics of Two Topics.Input: Two topic metric distributions X ele and Y f oot , where each measure posses a label identifying its provenience Parameter: N, number of permutations Output: p, the p-value resulting from the permutation test 1: c = 0 Calculate the test statistic m = |X ele − Y f oot | 4: Z = X ele ∪ Y f oot (maintaining the label of each observation) 5: let i = 1 6: while i ≤ N do 7: Z * = shuffle the labels of observation in Z 8: Extract X ele * and Y f oot * from Z * according to their label in Z * 9: m * = |X ele * − Y f oot * | 10: if m * ≥ m then 11: c = c + 1 12: end if 13: i = i + 1 14: end while 15: p = c N 16: return p

Fig. 2 .
Fig. 2. Left panel:Average daily toxicity score reported on Twitter (left) and YouTube (right).The straight horizontal lines represent the linear fit performed on each trend.The red vertical line represents the date of the voting day for the Italian Elections (September 25, 2022).Right panel: toxicity score distributions for each social media and topic before and after the date concerning the Italian Elections voting.

Fig. 4 .
Fig. 4. Left panel: prediction results of the GBRT model trained on intervals from each social media and topic.Right: Prediction results from a cross-topic comparison on each social media.We observe how performing out-of-topic prediction reduces prediction scores.

Fig. 4 (
Fig.4(b) displays the result of this comparison, where we observe a twofold scenario.On YouTube, training on Football data and testing against the Elections test set decreased F1 score by an average of 7%.The same result is observed even by training on Elections data and testing against the Football test set, with an average decrease of F1 score equal to 9%.Conversely, on Twitter, we observe a twofold effect.Whilst training on the Football comments and testing on Elections produced an average reduction of the F1 score equal to 8%, the opposite scenario produced an average increase of the metric equal to 8%.Such a result indicates that Football, whose conversations are less toxic and participated, cannot generalize toxicity dynamics occurring in more toxic topic like Elections, resulting in a drop in performance.Instead, the models trained on cascades with a more articulated structure, like the Elections ones, tend to better generalize unknown observations in their feature space, achieving higher performance on a cross-topic benchmark.Finally, we assess the importance of each employed metric in this prediction task by measuring how the F 1 would be impacted if a feature is removed.Results displayed in Fig.5show, as expected, that

Fig. 5 .
Fig. 5. Representation of the importance of the features employed in the model, quantified by the average drop in F1 score corresponding to removing a specific feature.