Sarcasm Over Time and Across Platforms: Does the Way We Express Sarcasm Change?

Sarcasm is a sophisticated form of speech used to convey a message other than the apparent one. To date, there are numerous papers that have discussed the idea of automatic sarcasm detection and how it could be used for sentiment analysis improvement. The objective of this paper is to provide non-experts with a comprehensive overview of the state of research in this field and the main findings regarding sarcasm detection. Therefore, in this paper, we survey the state-of-the-art work done in this field, we recapitulate the research effort done, with focus on the more recent works, and we present the expected performance out of the proposed works. Nevertheless, we study in detail how this form of speech is used in different platforms, and how the way we express it evolves over time. We also discuss the proposition that suggests that sarcasm is a polarity switcher for sentiment analysis. To achieve these goals, we run some experiments on 3 different data sets, collected from 3 different platforms, and compare how sarcasm is employed in each. These platforms are Twitter, Reddit, and some news websites. Our experiments show that the way sarcasm is expressed is highly dependent on language mastery and the platform used. For instance, in the Twitter data set, whose users vary widely in age, language mastery, and understanding what sarcasm means, the overall precision of detection of sarcastic statement reaches 89.31%. In the reddit data set, the precision of detection of such statements is about 55.33%, and in the news data set, the precision reaches over 96.67%. Our experiments also show that, to a great extent, it is safe to affirm that sarcasm, when employed, switches the polarity of a given piece of text: for the 3 platforms presented above, sarcasm has been a polarity switcher for 89.3%, 89.1%, and 92.0% of their respective instances.


I. INTRODUCTION
With the rapid growth of user-generated content on the Internet, companies, organizations, and research institutions and centers have been studying this type of data for several purposes. Part of this work has been interested in the interaction between the Internet users, the types of exchange of information they do, and even the nature of the relationships they build. However, most of the interest has targeted the content of the data they share, for it being the most rich in terms of information embedded. Several studies have been conducted on the content of the usergenerated data. One particular type study performed on these data is referred to as sentiment analysis. Sentiment analysis refers to the process of automatically identifying the opinion embedded within a given piece of text. Roughly speaking, The associate editor coordinating the review of this manuscript and approving it for publication was Joey Tianyi Zhou. sentiment analysis has as a first goal the detection of the sentiment polarity of the text. By sentiment polarity, we mean identifying whether the author of the text has a positive attitude towards its subject or a negative one (or sometimes a neutral one). Sentiment analysis has several usages, varying from the identification of users' opinion on a product or service [1]- [3] to their voting intent on upcoming elections, etc. With its maturity, sentiment analysis-related research has deviated from bringing to the table novel approaches to perform the task, towards applications of this technique in cases such as the US presidential elections [4], the Coronavirus pandemic [5]- [7], and critical events [8], etc.
That being said, despite how sophisticated the approaches proposed in the literature are, sentiment analysis, after all, relies mostly on words and expressions used in a text to identify its polarity. However, appearances might be misleading. This is the case when non-straightforward and indirect forms of speeches such as sarcasm are employed.
Sarcasm has had an increase in usage in social media over the last few years, with a multitude of accounts named after it spreading sarcastic statements, which are shared and re-posted by millions of users. Sarcasm has been used by normal users as well as public figures in online debates or when addressing a public event or hot and controversial topics.
The Collins online dictionary 1 defines sarcasm as a ''speech or writing which actually means the opposite of what it seems to say''. Cambridge dictionary 2 defines it as ''the use of remarks that clearly means the opposite of what they say''. Several previous works have shown that sarcasm is one of the most common reasons of misclassification when sentiment analysis is performed [9], [10]. With reference to the definitions mentioned above, sarcasm can roughly be defined as saying the opposite of what is meant, an idea which we discuss in more detail later on in this paper. Sarcasm is being widely used for several reasons, the most important among them being how pertinent and expressive sarcastic statements sound: As discussed by R. Giora [11], direct negation can sometimes be vague and not very expressive. It can also sound very serious and face-threatening, and can sometimes sound dull, or not conveying the feeling of the person talking. Sarcasm, and irony in general, is less serious, yet very expressive. It also conveys more than just the idea which one wants to negate. For instance, when one wants to express his being annoyed of someone else, he might use the expression ''You are so funny!''. This expression doe not only tell the other party that he is not funny, but also gives him the impression that the person is annoyed by his stories. Likewise, Camp [12] analyzed sarcasm in terms of meaning inversion, and distinguished 4 sub-classes of sarcasm, individuated in terms of the target of inversion: • Propositional sarcasm: which is more like the traditional model suggests where sarcasm is as simple as saying the contrary of a proposition that would have been expressed by a sincere utterance.
• Lexical sarcasm: which delivers an inverted compositional value for only a single expression or part of the sentence.
• ''Like''-prefixed sarcasm: which commits the speaker to the emphatic epistemic denial of a declarative utterance's focal content.
• Illocutionary sarcasm: which expresses an attitude which is the opposite of one that a sincere utterance would have expressed. She also concluded that 3 of these classes raise serious challenges for a standard implicature analysis.
With that in mind, despite the common thought that a person's way of expressing himself is an idiosyncrasy, a complex and unique way for himself, it is undoubtedly more accurate to assume that the way we behave is learned from others, the way we talk is, more or less, a combination of what we have heard and expressed in the past [13]. This has been addressed by the developmental psychologists and proven to be very accurate [14]. Sarcasm, for instance, is one of the most sophisticated forms of speech that, ironically, many people are less creative when trying to employ. Some suggest that such form of speech requires high Intelligence Quotient (IQ) to be able to express, let alone to catch and understand [15]. In [16], the authors suggested that people tend to rely on cheap or lazy cues to detect it. Therefore, it has been noticeable that many so-called ''sarcastic statements'' on social media are simple iterations on already-established sarcastic statements. In other words, most of what casual Internet users create as sarcastic statements are modification of previously created ones to fit in a given context. This idea of lack of creativity is the basis of several previously proposed works on the automatic sarcasm detection on texts collected from social media and microblogging websites such as Facebook and Twitter [17]- [19]. These works rely on what they refer to as ''sarcastic patterns'' to identify such common expressions used to express sarcasm.
The use of sarcastic patterns to locate sarcastic statements has had very good results on data collected from online social networks and microblogging websites such as Twitter. However, the question yet to answer would be whether or not such idea can be used to identify sarcasm on more structured data types or on texts written by people with higher language mastery.
To recapitulate, the objective of this survey paper is to provide non-experts with a comprehensive overview of the state of research in this field and the main findings extracted by the researchers regarding sarcasm detection. Nevertheless, in this paper, we try to answer the following 3 questions: [Q1] Does the way people express sarcasm differ from one platform to another, and does it depend on the level of ''mastery'' of the language? [Q2] Does the way people express sarcasm evolve over time, in particular, on social media where sarcastic statements are ''driven'' by some influential users? [Q3] Is it safe to affirm that, for a given piece of text, if sarcasm is employed, the overall polarity of that text is the opposite of the apparent one? The remainder of this paper is structured as follows: Section II describes briefly the state of the art of existing work that dealt with tasks of sentiment analysis and sarcasm detection. In Section III, we present some of the work related to sarcasm in other fields, as well as the main findings that could hint to possible ways to understand sarcasm, thus to detect it. In Section IV, we explore in more detail the existing works on automatic sarcasm detection, covering the data sets built for this task, the methods used, the features extracted from the data to identify sarcasm and the reported results. In Section V, we summarize the main challenges and problems that are still open for research in this field. In Section VI, we present our experiment specifications including a description of the data sets we have used and the software and hardware environments. We describe our VOLUME 10, 2022  Finally, in Section VII, we conclude this paper and propose possible directions for future work. For more readability, the outline of this article is shown in Figure 1. In addition, the most used acronyms and their full forms are shown in Table 1.

II. RELATED WORK
A. SENTIMENT ANALYSIS As described in Section I, sarcasm detection has almost completely been associated with the idea of sentiment analysis enhancement. Sentiment analysis has a long history that goes back to the ancient Greece [20], [21]. However, this kind of analysis was very basic and non-robust and does not qualify as scientific. This is because it did not follow the scientific method which has been established centuries later. Nonetheless, it did not benefit from the currently existing technology which has allowed for massive application of sentiment analysis on real large-scale problems. From the science point of view, the first journal on public opinion mining was published in the year 1931 [22]. However, sentiment analysis as we know now has been defined by Lee who co-authored later the work [23] and who is considered to be one of the founders of the field of ''Sentiment Analysis'' in the early 2000s. Pioneered by the work of Pang et al. [23], the idea of using machine learning for sentiment analysis has been massively adopted, and the vast majority of works in the field have opted for the use of machine learning. Research on sentiment analysis has since then known an exponential growth, with many approaches revolving around the same basic idea proposed afterwards. According to [20], over 99% of scientic papers on sentiment analysis have been published after the year 2004.
The spread of social media over the last two decades has resulted in an exponential growth of user-generated data, a perfect material for application of sentiment analysis. This is because user-generated data are regarded as raw Internet users' opinions, which can be analyzed for various objectives. For instance, sentiment analysis has been used typically for collecting, analyzing, and aggregating people's opinions about products [24], [25] or movies [2], or services [26]- [28]. Nevertheless, works such as that of Akcora et al. [29] were proposed identify major changes in public opinion over the time, and spot the news that led to breakpoints in public opinion.
Twitter, being one of the most popular platforms for people to share their thoughts in relatively short texts, has attracted most of the attention in the last few years. Approaches such as that of Boia et al. [30] and that of Manuel et al. [31] used non-textual features (such as emoticons and slang) to classify tweets and online texts or to attribute sentiment scores to them.

B. SARCASM DETECTION
Sarcasm detection for sentiment analysis improvement is relatively a novel field. To the best of our knowledge, the first published work to introduce this task was that of Tepperman et al. [41]. However, its being applied on vocal data makes it a bit different from the rest of the works discussed here, and from the task we will undego later on. Kreuz and Caucci [42] introduced this task for written text. In their work, they used unigrams to identify sarcastic phrases and sentences present in excerpts from long narratives. Their approach, despite being naive, was a start point for several works to come in the next years.
Tsur et al. [43] and Davidov et al. [17] have introduced a semi-supervised approach to detect sarcastic statements on Twitter and Amazon. They introduced the concept of sarcastic patterns to refer to generic expressions that are commonly used in sarcastic statements. This idea has been polished further in other works such as those of Lukin and Walker [44], Liebrecht et al. [45], Barbieri et al. [46] and Bouazizi and Ohtsuki [47].
Nevertheless, other works have been introduced in the next years. Some of the works used n-grams [48], while other used other types of features such as sentiment features [49], [50]. More advanced ones make use of the context within which the text message was posted, that being temporal, conversational, psychological or behavioral [51], [52].
In addition, with the advances in the field of Deep Learning (DL), several approaches were proposed to detect sarcasm using this technology which has proven to outperform conventional Machine Learning (ML) in classification tasks. Poria et al. [53] have proposed a model to extract sentiment, emotion, and personality features for sarcasm detection.
On a related context, Twitter has been the main platform which has been studied on sarcasm detection. This is because of the reasons we have introduced in the previous section, in addition to the openness of this platform and the ease of access to its users' generated content, via its streaming Application Programming Interface (API). However, several works were introduced to detect sarcasm on other platforms and types of texts, such as Amazon reviews [43], Reddit posts and comments [54], news articles [55], etc.
Research on automatic sarcasm detection is still ongoing, and the results obtained in this field are promising, and have real-world applications to improve sentiment analysis.
In the next Section, we will address further the current state of the studies on sarcasm in different fields, before we tackle the works in the field of automatic sarcasm detection.
We will describe in more detail the techniques used, the results obtained and the main findings.

III. THE PHENOMENON OF SARCASM A. SARCASM FROM A MEDICAL AND SOCIOLOGICAL POINTS OF VIEW
Sarcasm detection, as addressed in this paper, relates to the process of using Natural Language Processing (NLP) techniques and tools to automatically detect sarcasm from social media and other online user-generated content sources. However, sarcasm has nonetheless been studied in other fields such as the medical field, in particular from a neurological perspective. For instance, damage to brain cells, and mental deficiency limit largely one's ability to capture sarcasm [56], which might lead to undesirable consequences. Sarcasm is by definition used to express criticism, quite often in a nonaggressive way. Not being able to understand it does not only reveal mental deficiency, but also leads to miscommunication and incorrect interpretations of intentions. With that in mind, people with mental health issues such as dementia or even non-demented problems [57] share common behavior regarding the processing of indirect forms of speech. Staios et al. [58] explored sarcasm detection in amyotrophic lateral sclerosis using ecologically valid measures. They have shown that Amyotrophic Lateral Sclerosis (ALS) patients exhibit cognitive deficits, including being unable to understand and detect sarcastic and paradoxical sarcastic statements, both being sophisticated forms of speech. This goes along with other observations that suggest that sarcasm requires high IQ to understand [15], even though low IQ does not necessarily mean having neurological problems.
Nevertheless, sarcasm has also been studied from a psychological and sociological perspectives. Sarcasm usage could imply a certain degree of closeness between the speaker and the hearer [59]. Not only does it reflect the nature of the speaker as indirect and humorous, but also it has effects on the hearer, whether this effect is positive [60] or negative [59].
Whether sarcasm is a polarity switcher has also been addressed in few works [47]. In addition, despite being confused in many works with the concept of irony, Littman and Mey [61] suggested that sarcasm and irony are not necessarily conjoined in speech. In other words, even though many researchers have used the terms ''irony'' and ''sarcasm'' interchangeably [62], these two are not to be mixed with one another as sarcasm has a certain degree of aggression and criticism. This will be addressed in more detail in the next subsection.
In Table 2, we summarize some of the findings related to sarcasm and sarcasm detection from which the research on the automatic detection of sarcasm on social media has benefited.

B. SARCASM AND IRONY
The automatic identification of sarcastic statements has been the subject of several research works conducted by researchers for different purposes. The most common usage of sarcasm detection is to enhance the performance of sentiment analysis systems, which lag behind when sarcasm is employed [9], [71]. In this sense, sarcasm has quite often been confused (and fused) with irony when it comes to their detection in written text. This is because sarcasm is indeed one form of irony. In the Collins online dictionary, irony is defined as ''a subtle form of humor which involves saying things that you do not mean.'' In the context introduced above, the definition of irony is no different from that of sarcasm as defined in Section I. However, it is important to emphasize the fact that sarcasm has a criticism aspect and a more ''aggressive'' attitude added to it. Giora [11] defined sarcasm as a form of ''irony that is especially bitter and caustic.'' Rajadesingan et al. [52] suggest that sarcasm is more of ''caustic and derisive'' type of humor. Quite often than not, the type of irony addressed in the literature in works such as [72]- [74] is the one where the apparent meaning is the opposite of the actual one conveyed by the speaker/writer. The target for identifying this form of irony overlaps with the objective of sarcasm detection as addressed in this paper, as well as others: knowing the original intentions conveyed in the text.

C. TYPES OF SARCASM
While the term ''type'' might not be the proper way of defining the classification of instances of sarcasm as proposed in the literature, we will be using the term as authors of previous works [9], [47], [52], [75] used it as well.
In [9], [47], sarcasm has been identified as used for 3 main purposes: • Sarcasm as wit: sarcasm, when used as a wit, has for purpose to be funny. In this context, sarcasm is closer to irony. The person employs some special forms of 55962 VOLUME 10, 2022 speeches, tends to exaggerate, or uses a tone that is different from that when he talks usually to make it easy to recognize.
• Sarcasm as whimper: sarcasm, when used as whimper, has for purpose to show how annoyed or angry the person is, while remaining polite as suggested by the theory of politeness [63].
• Sarcasm as evasion: sarcasm, when used as evasion, has for purpose to avoid giving a clear answer. In other words, rather than criticizing explicitly or saying something that might clearly offend the hearer, the speaker makes use of sarcasm to convey his intentions while remaining polite. Bharti et al. [75] opted for a less complex classification, and defined 7 simple ''types'' of sarcasm. Their definition for sarcasm types is highly correlated with the usage of sarcasm in social media, as some of these types are, by definition, referring to features extracted from social media. The 7 types are: • T1: The contrast between positive sentiment and negative situation, • T2: The contrast between negative sentiment and positive situation, Similarly, sarcasm has been classified into sub-classes based on how it is employed, rather than what it reflects by Rajadesingan et al. [52], • Sarcasm as a contrast of sentiments: This goes along with many observations made by previous researchers. In this sense, sarcastic utterances use sentimental/emotional words (e.g., ''I love'') to address or refer to situations that are incompatible with their context (e.g.''being sick'').
• Sarcasm as a complex form of expression: This type is based on Rockwell [76]'s observation that there is a small but significant correlation between cognitive complexity and the ability to produce sarcasm.
• Sarcasm as a means of conveying emotion: Here, sarcasm is treated as a mean to convey one's emotions.
In other words, in addition to it being a form of aggressive humor [77] or verbal aggression [78], sarcasm is an indirect mean of self-expression as well.
• Sarcasm as a possible function of familiarity: As suggested by [59], [76], sarcasm is more or less used by people towards ones that they are more familiar with. Nevertheless, having a shared knowledge of the language [79] and culture [80] is important to recognize and use sarcasm.
• Sarcasm as a form of written expression: While classically, sarcasm has been addressed as a spoken form of expression, with the exponential growth of social media, people started conveying sarcasm within written texts, by including subtle markers that indicate that the phrase might be sarcastic. Nevertheless, Camp [12] proposed a different categorization of sarcasm. They suggested that ''different types of sarcasm take different 'scopes', and thereby produce different illocutionary and rhetorical results.'' Therefore, they addressed the conventional claim that suggests that sarcasm is straightforward an inversion of the meaning and suggested that sarcastic utterances rather ''pretend to undertake one commitment [. . . ] and they thereby communicate some sort of inversion of this pretended commitment.'' They then went ahead and identified 4 types of sarcasm: • Propositional sarcasm: Here, a proposition is the target of sarcasm and implicit sentiment is conveyed, making the detection of sarcasm in this case quite hard without context. An example for this is ''He's a fantastic guy!'' which, without context might be seen simply as a compliment • Lexical sarcasm: Here, sarcasm is quite clear and identifiable even without context. An embedded incongruity within the text itself makes the listener/reader identify the sarcasm without resorting to understanding the context itself. For instance in the sentence ''Sam is such a gentleman that no girl wants even to give him a chance!, the contradiction between the two pieces of information is a clear hint for sarcasm.
• ''Like''-prefixed sarcasm: This also another instance of sarcasm easily detectable and quite often used. ''Like''prefixed sarcasm is simply sarcasm where the word ''like'' precedes a piece of information that is not correct. An expression such as ''Like I care.'' or ''He was like.. I am Bill Gates, aren't I?'' are quite often understood as ''I don't care'' and ''I don't have money,'' respectively.
• Illocutionary sarcasm: This type of sarcasm is quite less commonly used, yet it is the ''ultimate form of sarcasm''. Here, ''the speaker 'makes as if' to undertake a certain speech act S, where S would be appropriate in some counterfactual situation X that contrasts with the current situation Y .'' For instance, given a first date between two people (situation Y ), where someone acted so poorly, the other person might say ''We should definitely go out again!'' (speech act S) which might be more suited if the first person acted more appropriately (situation X ). Other categorizations of sarcasm and sarcastic statements have been proposed as well. The categorization of sarcasm has been used as the basis for the types of features and subtle markers within texts that could be used to locate sarcastic statements in written text. In the next section, we will discuss in further detail these features and the methods that have been proposed to automatically detect sarcasm. VOLUME 10, 2022

IV. AUTOMATIC SARCASM DETECTION: METHODS AND APPROACHES A. RESEARCH WORK ACQUISITION
As stated previously, works on sarcasm detection for sentiment analysis improvement have appeared relatively recently compared to other similar fields. To the best of our knowledge, the first published work to introduce this task was that of Tepperman et al. [41] in 2006. Since then, several works have been published, and a large proportion of them was addressing sarcasm detection in social media.
To acquire the different research work done in this field, we queried 3 different search engines for sarcasm detection papers. The 3 search engines we queried are: Google Scholar, IEEE Xplore and ACM Digital Library. We used the following expressions for the search: ''sarcasm detection,'' ''sarcasm recognition,'' and ''sarcasm in social media''.
In total, 264 papers were collected, multiple of which were duplicates or irrelevant to the context of our work have been dismissed. We applied a set of rules to filter out these papers: 1) Only papers whose title or abstract infer directly the idea of sarcasm detection as discussed in this paper are kept. 2) Duplicate papers, or papers from the arXiv (or other preprint websites) whose final versions are found elsewhere are removed. 3) Papers with very poor quality and no significant contribution were removed. 4) We browsed the references of some of the collected papers to find any significant work which we might have missed and included it as well. The total number of papers directly related to the task of sarcasm detection in texts, in the sense addressed in this paper, is 153.
In the remainder of this we aim to summarize the existing work focusing mainly on the data used, the methods implemented, and the results obtained.

B. SARCASM DATA SETS
Building corpora for automatic sarcasm detection has also been a task investigated deeply as being one of the challenges in the field. This is because sarcasm is very hard to identify and to recognize, even by human annotators, and quite often, the disagreement between annotators is noticeable [17], [81]. In other words, if the annotators have a large disagreement between them in what constitutes sarcasm, it might be hard to build a corpus with well-annotated data, unless the sarcasm within them is clear and indicators in the text are very relevant. That being said, users of social media have invented explicit way to indicate whether what they say is what they mean or not. In particular, in platforms like Twitter, hashtags such as ''#sarcasm'', ''#irony'' and ''#not'' are still used with sarcastic or ironic statements. By using such key hashtags, to collect tweets, one could collect tweets that were manually ''labeled'' by their own writers as sarcastic.
Obviously, similarly to all buzz words and hashtags, these hashtags are quite often abused and/or used by bots to appear in the search results. However, they are still useful to collect an initial set of potentially sarcastic tweets, which needs to be cleaned afterwards.
The hashtag ''#sarcasm'' in particular was used to build several data sets [47], [71] of tweets collected from Twitter. Other works, such as that of Liebrecht et al. [45], suggested that ''#not'' is the way to go for sarcastic statement collection However, E. Sulis [82] has shown that the 3 hashtags ''#sarcasm'', ''#irony'' and ''#not'' are quite different, and should not be confused with each other. Through their experiments on real data, they supported the arguments for the separation between ''#sarcasm'' and ''#irony''. More interestingly, the hashtag #not was qualified as distinct phenomenon, separate from sarcasm and irony in their classical meanings.

1) DATA SETS SOURCE
Throughout the years, several data sets have been built to train and evaluate sarcasm detection approaches. However, despite their diversity, sources of the data are quite few. Following are the top sources of text data sets used for automatic sarcasm detection in the literature: 1) Twitter: Twitter has long been the first option for NLP tasks related to information extraction such as sentiment analysis and automatic sarcasm detection. This is because Twitter is an open platform allowing people to query its API to collect tweets with specific keywords. In particular, as stated above, few hashtags could be very useful to collect sarcastic and sarcasmrelated tweets. Several papers have used data sets collected from Twitter such as [83]- [96]. 2) Online shops and review websites: Review websites have also been a source of several data sets for tasks such as sentiment analysis. This is because reviews are by definition opinionated texts that show the writer's sentiment toward the product/service he/she is reviewing. Nevertheless, thanks to the scoring system available in many shopping and review websites, no manual annotation is required as the author summarizes his review in a score which can roughly be evaluated as positive if it is high, negative if it is low, and neutral if it is in between. A simple, yet effective way to collect sarcastic texts from such website is to collect texts whose sentiment opposes the score attributed by the author. Works that used data sets collected from Review websites include, among others, [97], [98].

3) Reddit: Reddit is introduced by its creators as a ''is a network of communities based on people's interests.''
Reddit has increasingly been a source of information for people with different interests allowing them to rate and discuss any kind of topic, it being a product, service, public figure, etc. While not as straightforward as the previous sources of data, Reddit has the particularity of having more or less a conversation-like structure. This has attracted the interest of researchers as it offers more than just one-sided opinions of people, but rather more detailed discussions where people can ask for clarifications or argue against an opinion, etc.
Fewer works have used Reddit as a source of their data. This is because, unlike the previous sources, data collected from this platform require manual annotation. It is worth mentioning, however, is that most of the works used the data set offered by Khodak et al. [99], being the first of its kind, and containing 1.3 million sarcastic statements. Works that used data sets collected from Reddit include, among others, [96], [100], [101]. 4) Others: In addition to the previously mentioned sources of data, few works have used data such as Facebook, TV series transcripts (e.g. ''Friends'', ''Daria''), google books-extracted texts, online forums, blogs, etc.
[102]- [106] In Table 3, we describe which works used these sources of data in their work. As can be observed, most of the works used Twitter as their primary source of data used for sarcasm detection. In addition, in Table 4, we give examples of some of the data sets available online which have been used in these works.

2) LANGUAGE
In terms of language used in the data collected, most of the work presented above dealt with English texts. Not only is English the language used the most on Internet, but also the tools for text processing and feature extraction are more mature for English than for other languages. Part-of-Speech tagging, lemmatization, stemming, automatic summarization, named entity recognition and relationship extraction are some of the basic NLP tasks that have reached impressive performance for English, while still struggling in other languages, in particular non-Latin derived ones.
In the context of sarcasm detection, works that addressed languages other than English are quite few. Latin descendant languages that have been addressed include, but are not limited to French [83], Italian [46], [83], Dutch [45], Czech [111], etc.
Non-Latin descendant languages also have been addressed in few works. These include the following ones. Lunando and Purwarianti [109] employed translated SentiStrength [161] to extract sentiment-related features from Indonesian text to perform sarcasm detection. Liu [112] introduced a set of features specifically for detecting sarcasm in social media for Chinese. Charalampakis et al. [121] compared supervised techniques with unsupervised ones for sarcasm detection in Greek. Dave and Desai [102] studied different classification techniques for sarcasm detection and experimented on Hindi blog reviews. Similarly, Bharti et al. [94] and Jain et al. [95] targeted Hindi in their work on sarcasm detection on Twitter. Al-Ghadhban et al. addressed the problem of sarcasm detection for Arabic, and evaluated their approach on a data set collected from Twitter and manually labeled. Suhaimin et al. [104] performed the same task for Malay on posts collected from Facebook. Samonte et al. [88] performed sentence-level sarcasm detection on tweets collected about government, politics, weather, social media, and public transportation for English and Filipino, and showed a significant difference in the results of their experiments.

C. SARCASM DETECTION: METHODS AND APPROACHES
The detection of sarcasm in the literature has mostly taken the form of a classification problem, with very few exceptions. The idea is roughly running a classification task on a set of VOLUME 10, 2022  texts and identify which ones are sarcastic and which ones are not. That being the case, Artificial Intelligence (AI) has been the way to go to perform such a task.
Roughly speaking, as shown in Figure 2, AI could be thought of as the use of computers to mimic the human brain behavior in performing certain tasks. Machine Learning (ML) is a particular type of AI which, given a set of manually labeled data, and a set of rules to extract patterns from these data, could learn how to deal with new unseen data reliably. Deep Learning (DL) is a branch of ML in which the learning of patterns and identifying which are relevant is automatized and left for the computer itself to do.
In Figure 3 we show the overall flowchart of use of ML and DL for classification. Basically, a manually annotated data set (a set of objects -i.e., texts-alongside with the class they belong to) is given to the ML or DL algorithm. This data set is usually referred to as the training set. The algorithm extracts specific patterns from these objects that allow it to recognize their classes. The process of learning these patterns and the relations between them is referred to as the training phase.
Upon training, the model is given an unknown object (i.e., does not belong to the training set), and is asked to identify its class by extracting the same features and comparing them to its knowledge. A good model should predict unseen objects with high accuracy. The main difference between ML and DL is the pattern extraction procedure itself. In ML, a human should teach the machine which features to extract from the input training data, upon which the machine builds its internal rules to recognize objects from these features. In DL, the human intervention is limited to the ''design'' of the neural network and its hyper-parameters. The network learns which features are relevant and which are not all by itself.
In the context of this paper, ML, and recently DL, have been dominantly the ultimate method to detect sarcasm. Nevertheless, other approaches that do not use supervised learning have been proposed. In the rest of this subsection, we summarize the methods used and approaches proposed.

1) RULE-BASED APPROACHES
Rule-based approaches are approaches that define a set of rules according to which a statement is judged as sarcastic or not. For instance, Maynard and Greenwood [71] have used hashtags to identify sarcastic statements. In part of their works relied on explicit hashtags such as ''#sarcasm'' and ''#Irony'' to identify sarcasm. Nevertheless, they also investigated in more details more complex hashtags: they proposed an approach to re-tokenize the hashtags and use the information extracted from them to identify if a statement is sarcastic or not. For example, in the text ''You are more than welcome! #notreally'', the hashtag is transformed into the expression it says ''not really'', which contradicts the content of the tweet. Therefore, it could be concluded that this tweet is indeed sarcastic. Riloff et al. [110] proposed a method to detect a particular type of sarcasm in which the author uses a positive sentiment to describe his feelings towards a negative situation. Their method relies on a bootstrapping algorithm that starts with the seed word ''love'' and a set of sarcastic tweets to build a set of positive sentimental words and a of set negative situations, which they used to judge when there is sarcasm and when there is not. Other works such as [47] iterated further on the idea of contradiction between positive and negative components within a piece of text to decide whether or not it is sarcastic.

2) MACHINE LEARNING APPROACHES
Conventional machine learning, in particular, has been intensively explored. Most of the existing works up-to-date followed the same pattern: extract a set of features and use machine learning algorithms such as Naive Bayes [162], Support Vector Machine (SVM) [163], Maximum Entroy [164]. Features are manually engineered and carefully chosen to highlight any sarcastic-related information.
In Table 5, we summarize the most common types of features that have been used to train such classifiers. These types of features are explained in more details below.

a: LEXICAL FEATURES
Lexical features are simply features that use the basic components of a given text, such as n-grams, hashtags, etc. These are the most basic types of features, yet they are employed more than any other type of features. They have been used not only in sarcasm detection, but also sentiment analysis, hate speech detection. Lexical features have been used in most of the existing work [42], [47], [108], and have given promising results.

b: PRAGMATIC FEATURES
Pragmatic features are features that exploit features other than the text itself, such as emoticons, user mentions and some hashtags. Pragmatic features are mostly used in social media-collected data sets, and have proven to be very efficient in detecting sarcasm on such data sets. Pragmatic features have been used in several works such as [47], [72], [108], [119], [121] c: PATTERN FEATURES Pattern features are features that exploit the repetitiveness in user-text when expressing sarcasm. Common expressions showing sarcasm have been widely used (e.g., ''I love it when + negative clause''). Multiple works [17], [44], [47] have used this family of features, and patterns are built either by exploiting the frequency of usage of words, their grammatical functions or pre-built expressions.

d: CONTEXTUAL FEATURES
Contextual features are features that exploit the context of the text, in addition to its content. Contextual features require a knowledge beyond the text itself. For instance, if a given message (typically a tweet in the case of Twitter data sets) is a reply to another one, the knowledge of the content of the original message could help identify sarcasm more accurately. Nevertheless, the knowledge of at least a sub-part of the content of the data set itself is required. Several approaches [41], [52], [74], [115] rely on the understanding of what makes a situation negative or positive, which requires either manual effort by the annotators or building a system to collect a data set of such negative situations.

e: SENTIMENT AND EMOTION FEATURES
Sentimental features are simply the same kind of features used typically in sentiment analysis classification. They include features related to the usage of sentimental words (i.e., positive and negative words), exaggeration of expressing emotions and contradiction between emotional words within the same sentence. Works that used this type of features include [73], [74].

f: BEHAVIORAL FEATURES
Behavioral features [52], [131] are features that make use of the understanding of the behavior of the Internet user to identify his typical behavior in normal situation and when he is employing sarcasm. Such understanding is built over a certain period of time, and is used to identify when sarcasm is employed and when it is not. Behavioral features could be seen as the observation over time of other types of features. This is because these other features change over time is the fundamental information used.

g: SYNTACTIC FEATURES
Syntactic features [104], [111], [112] are feature related to the arrangement of words and phrases to create wellformed sentences. Several of the so-called memes are used to express sarcasm, and use intentionally grammatically wrong sentences (e.g., ''All your base are now belong to us'') to mock others' lack of knowledge, their bad language, and in general emphasize any bad quality. Syntactic features are highly correlated with the idea of Part of Speech (PoS) tags, as typical sentences follow certain patterns of PoS tags.

h: METAPHORIC FEATURES
A metaphor is a figure of speech in which a word/expression is used to describe an object, event or idea where it is not literally applicable. For example, using the expression ''lone wolf'' to describe introverts is a common metaphor. Metaphoric features [47] are ones that make use of metaphor to express sarcasm. This also includes the use of commonly agreed on knowledge to ridicule something.

i: HYPERBOLE FEATURES
Hyperbole features [49], [97] are similar in nature to metaphoric features. They use extreme comparisons and exaggerations to make a point or show emphasis. In the context of sarcasm, such features are employed to tell the opposite of something quite obvious. For example, one might refer to an obese person by saying ''He's as skinny as a toothpick.'' j: SOCIOLINGUISTIC FEATURES Sociolinguistic features are user-related features, which focus on his information rather than the text's extracted ones. They include for example the age, the gender, etc. Some works that used sociolinguistic features include [115] and [48].

k: PROSODIC FEATURES
Proposodic features [41], [104], [152] are features focusing on the elements of speech in a sentence as a whole such as the intonation, tone, stress and rhythm. Such features are usually used in vocal speech. However, they usually translate into other forms in written text, such as the repetition of a cerain vowl or the use of capitalization to convey some intonation, etc.

l: PUNCTUATION FEATURES
Similar to how prosodic features translate in some particular use of capitalizations, etc., punctuation can also reveal some sort of intention the user might intend to convey. For instance, the excessive use of exclamations marks (e.g., ''That is amazing!!!!!''), or question marks (e.g., ''Oh really I dindn't know!!'') could reveal the sarcastc aspect of the user. Several works have used punctuation features, along with others, for sarcasm detection. They include [100], [113], [116], [122], etc.

m: SEMANTIC FEATURES
Semantic features [46], [112] are ones particular to languages, as they relate to the meaning in the language or the logic behind common expressions or phrases, etc.

n: RHETORICAL FEATURES
Rhetorical features [112] are specific to some languages such as East-Asian ones (e.g., Chinese and Japaneses). They include extreme nouns, adjectives or adverbs, as well as titles of degrees and honorifics.

o: PERSONALITY FEATURES
Personality features [53], [82] are features related to the behavior and thought patterns of people. Pre-trained models on the automatic detection of personality traits (Big 5 for example) could be applied to one's posts/comments/tweets to extract his personality traits, which in return could be used as features to identify sarcasm.

p: STYLISTIC FEATURES
People possess their own idiolect and authorship styles, which is reflected in their writing. These styles are generally affected by attributes such as gender, diction, syntactic influences, etc. Works that used this type of features include [114], etc.

q: IDIOSYNCRATIC FEATURES
The term ''Idiosyncrasy'' refers to an odd habit or a peculiar way of behavior/thought. It is commonly used to express eccentricity or more generally strange and weird attributes. In the context of linguistics, the term could refer to very strange and unusual expressions, metaphors or comparisons that are not usually employed in conversations. If employed, they intend to bring a particular meaning or cue to the conversation, including sarcastic cues [104], [165].

r: EMBEDDINGS
Word embeddings [90], [92], [166] are numeric representations of words and expression which were attributed through advanced techniques such as ''skip-gram'' [167]. While the meaning of these numeral values are hidden and not directly interpretable by humans, neural networks, in particular, make use of such representation to perform complex tasks related to NLP.
Classifier-wise, most of the works have used the following classifiers: • Support Vector Machine: SVM is a robust prediction method based on statistical learning frameworks or VC (Vapnik-Chervonenkis) theory [168]. In an SVM,training examples are mapped into points in a multidimentional space with the objective of maximizing the gap between the two classes (for the case of binary classification). In the context of sarcasm detection, SVM is amongst the most used algorithms and performing the best in terms of classification accuracy and precision. LibSVM [169] is probably the most used implementation of SMV.
• Naive Bayes: Naive Bayes is probably the simplest, yet one of the top performing classifiers in NLP-related classification tasks. A Naive Bayes classifier is a probabilistic classifier based on the idea of applying Bayes' theorem with strong independence assumptions between the features.
• Maximum entropy: A Maximum Entropy classifier [170] is a discriminative classifier based on the statement that suggests that the probability distribution which best represents the current state of knowledge is the one with the largest entropy. This classifier is widely used in NLP problems, including sarcasm detection.
• Logistic Regression: Logistic regression [171] is basically a statistical model that models a binary dependent variable, and thus in its core is not a classification operation. However, transforming it into a binary classifier could be done by defining some threshold for the continuous output, below which the input is judged as belonging to one class, and above which the input is judged as belonging to another class. Regression in the context of NLP has no clear meaning. However, Logistic Regression-based classifiers have shown great potentials.
• Random Forest: Random Forest classifiers [172] are a common type of decision tree-based classifier used in a variety of tasks. In Random Forests, multiple decision trees are constructed, and an ensemble method is applied to them. Decision trees have shown great potentials in tasks related to NLP, and are among the top performing classifiers.
• k-nearest neighbors (kNN): The k-nearest neighbors classifier is a non-parametric classification method in which the input consists of the k closest training examples in data set and the output is a class membership. A definition of distance is required to identify what constitute ''near'' neighbor. Depending on the value of k and the weighting used to favor closer neighbors, the classification is basically done by averaging the weights of the classes of the k closest examples from the training set and picking the maximum one. Other classifiers used include decision trees, SMV-Hidden Markov Models (SVM-HMM), Gradient boosting [173], or Searn [174], etc.
In Tables 6 and 7, we show a summary of the works which used the sets of features introduced above. In Table 8, we show a summary of the works which used the machine learning algorithms described above.

3) DEEP LEARNING APPROACHES
With the recent advances on the filed of Deep Learning (DL), mainly the contributions of Lecun et al. [176] Hinton et al. [177] and Krizhevsky et al. [178], it became VOLUME 10, 2022 possible to train big Neural Networks (NN) in a reasonable amount of time while keeping the training converge in most of the time. This has led to more interest towards the use of NN in almost all learning-related fields, and DL has replaced older techniques, including conventional ML, in tasks varying from image recognition [179] to natural language translation [180], or even text generation [181] and style transfer [182].
Nevertheless, text mining has been one of the main domains that have profited from this technique. In particular, for the task of automatic sarcasm detection, several works have been introduced in the past few years that used deep learning to perform sarcasm detection. While the common stream suggests using Recurrent Neural Networks (RNN) -based techniques for text processing and classification as text is usually considered a sequence of words, few works have used more conventional approaches that use CNN on its essence to perform the classification. In Table 9, we present a brief summary of these works.
The main families of DL approaches are the following:   LSTMs were invented to solve the vanishing gradient problem that were often occurring when training RNNs, in which long-term previous components have an exponentially decreasing effect on later components. This is because they can learn order dependence in sequence prediction problems, including long-term dependence.
• Bidirectional-LSTM (Bi-LSTM): Bi-LSTMs are a particular type of LSTM in which two ''independent'' LSTMs are put together each processing the time-dependent items (i.e., words in our case) in both chronological order, and backwards one. This allows the networks to have both backward and forward information about the sequence at every time step.
• Attention Networks: attention networks are basically an iteration over classic RNNs and LSTMs, in which the encoder-decoder architecture is ''freed'' from the fixed length internal representation. While a classic LSTM forces the encoder to take into account all the previous items, an attention model allows it to focus only on certain inputs in the input sequence for each output item. With regards to this current work, we explore in the next sections two main streams of approaches: ones that employ conventional machine learning trained with several feature sets including patterns and ones that use deep learning (LSTM and LSTM with attention). We also study sarcastic statements on 3 different platforms: Twitter, Reddit, and some News websites.

D. REPORTED RESULTS
In the literature, several Key Performance Indicators (KPIs) have been used to evaluate the efficiency of the proposed approaches. Following is a list of the most commonly used ones and how they are measured.
• True Positive Rate (TPR): refers to the ratio of correctly classified elements over the entire input.
• Precision: refers to the ratio of relevant elements over the retrieved instances.
• Recall: refers to the ratio of relevant elements that are retrieved over the total amount of relevant elements.
• F 1 -score: which is a measure that combines both precision and recall, used usually to compare different approaches, and defined as follows: The F 1 -score is an instance of a larger family of scores referred to as F β where β is a coefficient given the precision to change its weight compared to that of the recall. In general, F β is defined as follows: The F 1 -score is sometimes referred to as the F1-score, F-measure or simply, yet somewhat imprecisely, F1, all of which are agreed on.
• Accuracy: across an entire data set, accuracy is similar to the TPR, as it measures the ratio of correctly classified instances over the entire set of input instances.
• Area under the curve (AUC): the AUC measures the ratio of the area under the Receiver Operating Characteristic (ROC) curve to the total TPR to False Positive Rate (FPR) area. The ROC curve itself is a graph that shows the performance of a classification model at different classification thresholds. In Table 10, we show some of the results reported in some of the works existing in the literature. It is a bit inaccurate though to directly judge these works by comparing them to one another, since the data sets that were used are quite different, and the results reported in these are highly correlated to their respective data sets. We limit the reported results to ones run on Twitter data sets. Note that for works that reported results for multiple approaches or on multiple data sets, we limit the shown results to the best reported ones.

E. SUMMARY
In Figure 4, we show a summary of the most relevant work proposed over the last decade or so regarding the automation of sarcasm detection. As can be seen, most of the work was performed on data sets collected from Twitter. In addition, unsurprisingly, English is the language on which most of the studies have been performed. Other language have been addressed in very few works in the literature. Nevertheless, while the vast majority of these works have used machine learning (SVM, Random Forest, and Naive Bayes), recent years have seen an increase in the use of DL-based methods, in particular, CNN, LSTM and transformers. We can clearly observe that DL approaches started to appear mostly after 2015, with the advances that this field has been subject to, in particular ones related to NLP.

V. CHALLENGES AND OPEN RESEARCH PROBLEMS
Below are some of the challenges related to the task of sarcasm detection that are yet to be investigated further and present open problems for research.

A. ANNOTATION OF THE DATA
As previously stated, creating an annotated sarcasm data set remains a major challenge. Despite the efforts made by several researchers such as Khodak et al. [99], Riloff et al. [110], Filatova [81] and Bouazizi and Ohtsuki [47], a well-elaborate data set is yet to be built: On the one hand, sarcasm is in many cases contextual. This context comes at different levels: previous messages in the conversation that led to the sarcastic statement, the relation between the speaker and the listener, the fact (if it exits) that the sarcasm negates, etc. On the other hand, sarcasm, as suggested by Davidov et al. [17] and Bouazizi and Ohtsuki [47] comes with different ''intensities'' or different levels. Under such assumption or hypothesis, when annotating a text, it might be important to attribute a score or an intensity level to each piece of text.
On a related topic, comparing works to one another is quite hard when each is experimenting with a self-made data set. Making a standard data set for sarcasm detection evaluation is very important. Here, a competition such as PAN at CLEF, 3 has proven to be a good reference allowing researchers to compete and create a robust benchmark for future works in multiple NLP tasks such as hate speech detection, author profiling, etc. Similar competitions for sarcasm detection can be a start point for such a robust reference for benchmarking and evaluating future works.

B. SARCASM AND IRONY
In several works in the literature, sarcasm was confused, intentionally or unthinkingly, with irony. However, as stated above, Littman and Mey [61] suggested that sarcasm and irony are not necessarily conjoined in speech. Distinguishing one from the other is a quite hard task, and is by far more challenging than distinguishing sarcastic statements from normal ones. An interesting task would be indeed to tackle this problem and see if sarcasm can really be detected or not.

C. SENTIMENT ANALYSIS: A TOOL OR A GOAL
When introduced in their respective papers, sarcasm detection is addressed by the authors as a mean to correct misclassified instances on a sentiment analysis exercise. However, in most of these works, sentiment analysis is used to actually extract features related to the polarity of the text. This leads to the obvious question: is sentiment analysis a tool, among others, to help identifying the sarcasm within a statement, or rather a target? In the latter case, where sarcasm is detected, the sentiment of the text is judged usually as the opposite of what it appears to be.
In a real use case, one would suggest that a data set is to be mined for sentiment analysis. Each piece of text is processed to identify its sentiment. Either sarcasm detection is first applied, and a set of texts judged as sarcastic are to be processed in a ''special'' way, whereas the rest is processed using the conventional sentiment analysis method,  or sentiment analysis is applied regardless of whether sarcasm is present or not.
This brings the next question: how often sarcasm is used? In other words, is it really worthwhile to process tremendous amount of data for two tasks (sentiment analysis and sarcasm detection) if sarcasm is employed in a small fraction of the data. Other techniques such as sentiment quantification [36]- [39] were proposed in the literature to address partially wrong classified instances. The idea behind it is quite straightforward: when performed on a small data set, sentiment analysis will have a certain number of wrongly classified instances per class. Measuring how this error changes for different proportions of the classes in the data set could help learn how to rectify and interpret the results of classification when applied on a new data set. This could solve the problem of misclassification in a more global way, as usually identifying the polarity of the individual texts does not matter as much as identifying that of the entire data set.

D. SARCASM DETECTION AND EXPLAINABLE AI
EXplainable Artificial Intelligence (XAI), also known as interpretable AI, is a sub-field in Artificial Intelligence (AI) in which the AI models and/or results are presented in a way that can be understood by humans. Unlike the concept of a ''black box'' models in machine learning in which the human cannot, and sometimes is not even supposed to, know how the results are obtained and how the models are built. The concept of ''black box'' is not limited to machine learning model users, but also the designers themselves as, most of the time, they do not understand how the decisions made by their models is produced and how they can explain it to their model users. XAI addressed the idea of helping users understand how models work, by explaining the decision making process with reference to how a human would make the decision himself, and dismantling their misconceptions of how AI works. In theory, not only does this make the models more relevant but also builds a certain level of confidence towards them, allowing for a wider acceptance of AI among non-experts.
Generally speaking, different families of models have different levels of explainability as shown in FIGURE 5. Models such as decision tree ones are commonly known for being much more explainable than ones that rely on deep learning. On the other hand, as shown previously and as commonly agreed on, deep learning models are much more powerful in classification tasks than conventional machine learning ones.
With regards to sarcasm detection, a very important question rises when it comes to sarcasm detection: the intuition of the human brain develops to understand and recognize sarcasm is much more complicated than that developed to recognize sentiments for example. In the case of sentiment analysis, one would assume that the VOLUME 10, 2022 presence of multiple positive words with no negation means the text containing them is mostly a positive one, and vice-versa. However, when it comes to sarcasm, the intuition is much more different, and generally speaking, the context needs to be taken into account to be able to detect it.
A few works, such as that of Riloff et al. [110] and Bouazizi and Ohtsuki [47] used an intuition similar to that used for sentiment analysis to build their models: if a positive word is co-existing in a text with a situation that is usually considered as negative, the text is likely to be sarcastic. The explainability of such models is, in that sense, much easier than that of transformers-or LSTM-based models.
That being said, tools for XAI have been receiving a great attention in the past few years. Namely, tools such as Shapley Additive exPlanations (SHAP) [183], [184] and Local Interpretable Model-agnostic Explanations (LIME) lime [185] have been widely accepted and used as tool for XAI. They way these models work, however, does not address the intuition behind feature engineering as much as they address the features themselves and their contribution to giving a good prediction, regardless or their meaning. They also operate at sample-level rather than model-level. In other words, the expected output of these tools is as follows: given an instance (e.g., tweet or text), how was the decision made to tell if it is sarcastic or not, and which features contributed to the identification of its class. With that in mind, an interesting task would be to see how much XAI could help narrow down the gap between how the model operates and how the human perceive, recognize and understand sarcasm, and whether or not these models follow similar intuitions like humans to tell if an instance is sarcastic or not.
A notable early attempt has been made by Kumar et al. [154], where the authors aimed to make the learning model used to detect sarcasm in conversations interpretable. However, in their work, the author relied mostly on individual words as indicator of sarcasm, which contradicts the intuition that sarcasm requires knowledge of relation between meanings in a sentence and the overall context for a better judgement as well as for a justified and convincing explanation.

E. LIMITS OF SARCASM DETECTION MODELS
In the previous sections, we tried to summarize the majority of the existing approaches and methods for sarcasm detection which we found. However, one question is yet to be answered: do these approaches and methods indeed perform as efficiently as they are expected when used on data collected outside of the context of their respective data sets? In other words, it is important to measure the generalizability of these approaches and whether they can indeed be used in real-world data sets other than the ones they were optimized for. For instance, given a model that has been developed using a data set collected a few years ago on a platform like Twitter, would it be able to detect sarcasm on nowadays tweets, or even statements collected elsewhere such as YouTube comments, Facebook posts or news websites? This leads the questions we previously raised in Section I: [Q1] Does the way people express sarcasm differ from one platform to another, and does it depend on the level of ''mastery'' of the language? [Q2] Does the way people express sarcasm evolve over time, in particular, on social media where sarcastic statements are ''driven'' by some influential users? [Q3] Is it safe to affirm that, for a given piece of text, if sarcasm is employed, the overall polarity of that text is the opposite of the apparent one? With that in mind, in the next Section, we investigate in more detail this challenge, and aim to answer these questions with a set of experiments we run on data sets collected from different platforms and at different timestamps.

VI. EVOLUTION OF SARCASM OVER TIME AND ACROSS PLATFORMS
As stated above, in this section, we try to answer the following 3 questions, which we have previously introduced in Section I: [Q1] Does the way people express sarcasm differ from one platform to another, and does it depend on the level of ''mastery'' of the language? [Q2] Does the way people express sarcasm evolve over time, in particular, on social media where sarcastic statements are ''driven'' by some influential users? [Q3] Is it safe to affirm that, for a given piece of text, if sarcasm is employed, the overall polarity of that text is the opposite of the apparent one? To do so, we will compare the results of some of the best-performing approaches on data sets collected from different sources at different time periods. We will start by introducing our experiment specifications and the data sets used. We then show the results of the different classification tasks that we run. Finally, we will discuss these results. Throughout the discussion, we will be answering the aforementioned questions.

A. EXPERIMENTAL SPECIFICATIONS
In the current work, we will use a deep learning and machine learning approaches to run the classification. We use 3 different implementations of sarcasm detection systems: A pattern-based approach [47] and two deep learning approaches that describe in more detail in this section. The first one uses [34] to identify sarcastic statements on 3 different data sets. However, the main focus of this work is not comparing the two approaches one to the other. On a first step we will try to identify whether sarcasm is expressed on different platforms the same way: a classification task will be run to try to guess for a given short text whether it is a sarcastic tweet, a sarcastic news headline or a sarcastic reddit comment. A similar classification task is performed to try to guess the temporal context of a given tweet: the classifier will try to guess whether a tweet was posted on the year 2015, 2017 or 2019. On a second step we use the pattern-based approach for sarcasm detection to identify the most common patterns used on each platform/temporal context, how often they are used on each platform, and whether there are patterns that are commonly used across the different platforms. Lastly, we take a closer look at the different data sets: we take a random set of samples from each and identify whether sarcastic tweets are polarity switcher or not: for each tweet, we check whether its actual sentiment polarity is the same as the one returned by a sentiment analysis tool (which does not recognize sarcasm) or not.

1) DATA
Two different data sets, manually annotated are used in this work: • A set of tweets collected on three different points in time: mid 2015, mid 2017 and early 2019. This set will be referred to as ''Set I''. These data have been collected using the Twitter streaming API by querying it for the hashtags ''#sarcasm'' and ''#not''. In total, over 180,000 tweets were collected. The data have been manually checked and cleaned up by removing duplicates. They are also cleaned by removing all sorts of noise (e.g., non-English tweets, ones with images or URLs to external links). The resulting tweets are capped to a certain number to keep the data set balanced. These tweets, from each time span are split into two sub-sets as shown in TABLE 11: a training set and a test set. As can be seen, the data from 2017 are small in size compared to the other two time spans. This is because these data were collected over a smaller time span, thus resulting in a less quantity, even before cleaning.
• A set of sarcastic statements collected on three different platforms: Twitter, Reddit and some News websites. This set is referred to as ''Set II''. The details of the different sub-sets of ''Set II'' are as follows: -The data for Twitter include for the most part tweets from the set previously describe (i.e., ''Set I''). -Reddit data are available in several previous works, including IAC-SARC2 [158] and IAC-Subset [Walker et al.] [159].   -Sarcastic news headlines are available online and a version of it can be obtained from Kaggle. 4 The overall data set has been cleaned as well. To make sure the classifier does not rely on features such as hashtags or the length of sentences, we made sure only textual components of the texts are kept and that statements from the different platforms have close average words per sentence. We have split this data set into two sub-sets as shown in TABLE 12: a training set and a test set. As mentioned above, patterns are used, later in this paper, to identify whether there are commonly used expressions to express sarcasm on a given platform, and whether these platforms share common sarcastic patterns. Obviously, to generate patterns that are purely sarcastic, we need a set of non-sarcastic statements (for each type of data) against which we check the sarcastic ones. Therefore, we created 3 data sets (one for each platform: Twitter, Reddit and the news headlines in order) composed of sarcastic and non-sarcastic posts collected from the 3 platforms. These 3 data sets are referred to as Sets ''III-1'', ''III-2'' and ''III-3'' respectively. The data set details are given in TABLE 13

2) HARDWARE AND SOFTWARE CONFIGURATIONS
For this experiment, we use a machine running on Windows 10 Pro, with the following hardware: The neural network is built using Python 3.7 and Pytorch. On the other hand, the patterns manually engineered are extracted using SENTA [35], an open source tool to extract features from texts. The tool mechanics have been modified to identify sarcastic texts from non-sarcastic ones instead of identifying the sentiment of the texts.

B. PROPOSED APPROACH
In Figure 6, we show the architecture of our proposed approach for sarcasm detection. The approach itself is an extension of a previous work of ours [47], in which we propose to use patterns for sarcasm detection. We will explain in detail what each part of the diagram means.
Given a piece of text t, we initially start by cleaning it (part (a) of the diagram). By cleaning, we refer to the process of removal of URLs and tags, replacing slang words and abbreviation by their corresponding full expression, etc. Two instances of the text are then created: the first one goes through the Neural Network (NN) shown in part (b) of the diagram while the second one is processed using SENTiment Analyzer (SENTA) [35] to extract pattern features as shown in part (c) of the diagram.
The upper part of the diagram, i.e., part (b), corresponds to the neural network, through which goes the text and which identifies the class of the given text. The lower part, i.e., part (c), corresponds to the ML part of the work, where engineered features (mainly patterns) are extracted.
The data sets ''I'' and ''II'' go through the neural networks, where the aim of the classification task, given a piece of text, is to identify whether it is a tweet, a reddit comment or a news headline (or at which time period it was posted), only using the textual information of the text.
The data sets ''III-1'', ''III-2'' and ''III-3'' go through a different processing, where the aim is to extract the commonly used patterns to express sarcasm on each platform.
In the rest of this section, we describe in more details each of the steps.

1) PRE-PROCESSING
The pre-processing phase consists of several basic tasks to clean up the data sets used for training and testing. They include: • Removing all the URLs, tags and all non-textual components, • Replacing slang words and abbreviation by their corresponding English words and expressions, • Fixing all detected typos and excessive punctuation marks usages. An additional pre-processing step is made for texts that will be used for pattern extraction: all punctuation marks are removed and names are replaced by a simple expression to refer to them. In addition, as mentioned above, all the sets, whether they are used for training and for test are pre-processed the same way.

2) NEURAL NETWORK
In this part of the work, we use the pre-trained language model implemented by Howard and Ruder [186], which we then fine-tune for our current task. The original model was trained on the WikiText-103 data set [187]. This corpus is composed of 28 595 pre-processed Wikipedia articles and contains 103 million distinct tokens (words). Howard and Ruder [186] implemented an AWD-LSTM (Averaged Stochastic Gradient Descent Weight-Dropped LSTM), a regular LSTM with various tuned dropout hyperparameters [187]. Figure 7 shows the structure of the AWD-LSTM. The model is composed of an embedding layer, followed by 3 stacked LSTM layers and a softmax layer. The embedding size is 400 and each LSTM layer has 1152 activations.
As shown in the figure, the first step is to load the model as it is. This step is done by simply calling the pre-trained model. The model is downloaded alongside with its training weights.
In the second step, we start the fine-tuning of the language model, by continuing the training on twitter+reddit+ news headline-like data. To do so, we used our whole data sets, alongside with more data collected from Twitter and Reddit, with no label. This goal of this step is to make the language model learn the specific features of the language used in these platforms so that it could recognize how sentences are structured and learn newer words such as slangs.
In the final step, the language model is adjusted for classification. From the language model, the softmax layer is cut off, and a linear block whose activation is set to softmax is added after the 3 LSTM layers. We fine-tune the model using the gradual unfreezing, discriminative learning rates, and the slanted triangular learning rate.

3) PATTERN EXTRACTION
Patterns are extracted with SENTA [35] as we have described previously. We have referred to the work of Bouazizi and Ohtsuki [47] to extract them.
Pattern features identify and quantify the full expressions that are commonly used to express sarcasm. A pattern is defined as a generic sequence of words or expressions. They are collected according to specific rules as described in [47]: all words are divided into 2 groups: • ''CI'': this group contains the words whose content is important, and • ''GFI'': this group contains the words whose grammatical function is important. Words whose grammatical function is important are replaced by some expression [47], whereas words whose content is important are kept as they are. The classification of words into one of these categories is based on their Part-of-Speech (PoS) tag. A pattern p obey the rule that their length should be within a certain length range [47]: where L Min and L Max are the minimum and maximum length of a pattern in terms of words. Patterns are extracted from the sarcastic texts, filtered according to certain rules: a pattern p must occur N occ times, and should not appear, not even once, on a non-sarcastic text. Throughout our experiments, we use the same values for parameters L Min , L Max and N occ as in [47]. An example of a pattern extracted from the following sentence ''I love it when people take me for granted'' would be [PRONOUN love PRONOUN when NOUN VERB].
Patterns extracted from the different platforms are compared to each other, to see if there are some commonly used patterns on all the platforms, or whether a specific platform has a different behavior when compared to others (''Set I''). Similarly, patterns extracted at different points in time from twitter are compared to each other for the same purpose (''Set II'').

C. EXPERIMENTAL RESULTS
In the first set of experiments, we run a classification task using ''Set I'' and ''Set II''. The aim is to identify where (or when) a given piece of text was posted. This is of a great importance to highlight later on to what extent the use of patterns would be useful, and whether or not approaches for sarcasm detection on Twitter, the most studied platform in this context, can be applied on other ones.
To evaluate the performance of classification, we use the following KPIs:  • Recall, and • F-Score. which we have previously defined in Section IV-D.
These KPIs are measured at class-level as well as on the test sets in their entirety.

1) SARCASM ACROSS PLATFORMS
We initially run the classification of texts from different platforms agains each other. The classification is done using the DL part of Figure 6. The classification TPR, precision, recall and F1-measure are given in TABLE 14. The confusion matrix of classification is given in TABLE 15.
As we can observe, it is relatively easy to distinguish sarcastic statements from different platforms from each other. To recall, this is not due to the use of non-textual components VOLUME 10, 2022 FIGURE 7. Architecture of the neural network used and the 3 steps to fine tune it to our current task: 1) load the original language model, 2) fine tune it on corpus made of data similar to ours, 3) add a softmax layer, and train it to perform a classification task.
or punctuation because these have been removed prior to the classification. This is not due to the difference in length of the statements as they are of similar average lengths. One reason for the distinction between sarcastic news headlines and sarcastic tweets and reddit comments is how formal the language employed is. This explains why the sarcastic news headlines have such high precision and recall levels (96.67% and 98.25% respectively). However, being both causal user-generated, tweets and reddit comments share a lot in common, yet are distinguishable from each other to a certain level. Later in this work, we will discuss this particular point.

2) SARCASM OVER TIME
A more interesting task would be to identify sarcastic statements at different points in time. While such task can be highly biased due to the trending topics at these different points in time, we still believe that the way sarcasm is expressed changes, or rather evolves over the time.  The classification performance is shown in TABLE 16 and its confusion matrix is given in TABLE 17.
The classification results show that it is possible to distinguish between sarcastic tweets posted at different points in time. As we said above, this might be partially biased by the topic of the tweet itself. However, later on, we show that the patterns commonly used to express sarcasm change. In other words, people ''learn'' from influential users, who come up with new trends of how to express sarcasm.

3) MODEL GENERALIZATION
Given the results obtained above, our next target is to identify whether sarcasm detection models are generalizable. In other word, given a model trained and optimized on one data set, we want to see how good it is when evaluated on another data set. With the major differences between the different platforms, we limit our study to the same platform, and evaluate the models trained on a data set from one time period in identifying sarcasm on data sets from the other two time periods.
We and M 2019 T , respectively. Each model has been training on its corresponding training set, and validated on its corresponding test set. In Table 18, we report the results of classification using each of the models on the 3 different test sets (including the one from its time span). Note that the precision, recall and F-score are reported for the class ''sarcastic''. As can be seen, each model performs best when used for the data collected from its time span. For instance, the accuracy of the model M 2015 T on data collected in 2015 reached 89.20%. This accuracy drops significantly when the model is evaluated on data collected in 2017 and 2019, reaching 74.06% and 69.73%, respectively. This behavior is observed also when using the two other models (i.e., M 2017 T and M 2019 T ). To answer our question about the generalizabily of the models, it is fair to conclude from the results obtained in Table 18 that a model trained on data from a certain time span could be indeed used to classify data from another time span. However, one would assume that the expected performance is far lower than that obtained during the training. Later, in subsection VI-C5, we address the main reasons for such drop in performance.

4) SARCASTIC PATTERNS
In this sub-section, we explore the idea that suggests that people use similar phrase and sentences to express sarcasm. This is the key point behind several works on the detection of sarcasm [17], [43], [47]. In their respective works, they used the term ''pattern'' to refer to a generalization of these expressions used to express sarcasm. As described previously in Section VI-B3, patterns are collected by transforming the texts in a way that abstract them from their context, while keeping the overall grammatical construction of the sentences. Here we used the Sets ''III-1'', ''III-2'' and ''III-3'' to collect the most common sarcastic patterns in each platform of our data set (i.e., Twitter, Reddit and the news headlines).
In Figure 8, we show the top 20 patterns used in each platform ordered by their occurrence number in each platform, as well as their occurrence number in the other platforms. Figure 8-(a) shows patterns that are mostly used in Twitter, Figure 8-(b) shows those that are mostly used in Reddit, and Figure 8-(c) shows those that are mostly used in news headlines.
As we can observe, patterns extracted from tweets are the most abundant ones. By far, they outnumber those extracted from the other platforms. This goes along with our intuition early that suggests that sarcastic expressions are mostly learned and re-used. This also explains how approaches that VOLUME 10, 2022 FIGURE 9. An in-depth look at a sample of texts manually labeled as ''Sarcastic. '' rely on patterns to detect sarcasm are very good at detecting sarcasm in Twitter, but present lower performance on other platforms [47].
On the other hand, it is interesting to note that the top patterns are almost all common to all the platforms, meaning that these expressions are used in all the platforms despite how formal or informal the language used is. This means that some patterns are ''universally'' agreed-on as sarcastic.

5) SARCASM: IS IT REALLY SARCASM?
As we have explained early on in this paper, sarcasm is a sophisticated form of speech which requires a higher-thanaverage intelligence to make and to understand [15]. Several suggestions have been made towards whether sarcasm, the way it is expressed in social media is the ultimate form of sarcasm [44], [47]. Early in this paper, we introduced [Q1] questioning as well whether sarcasm requires a certain mastery of the language for it to be employed correctly.
As we have seen in the previous sub-section, users of Twitter tend to overuse certain expressions, making them more of abused clichés and less of contemptuous phrases.
Here, we collected some random samples of texts from our data set, and went through them to identify whether or not they really are sarcastic. In addition, being one of the reasons sarcasm has been studied in the first place, we also checked, for each of the texts, whether it is a polarity switcher or not, answering the question [Q3].
In Figure 9, we classify the studied samples of texts from into 1 of 4 of classes, identifying whether or not what the annotators labeled as sarcastic is indeed so. In the upper part of the figure (Figure 9-a), we show the proportion of tweets, reddit posts and news headlines annotated as sarcastic which have been recognized by more rigorous annotators as really being sarcastic. Note that the data sets were partially collected from various sources found online and partially annotated through the services of CrowdFlower. 5 As shown, a little less than 63% of these tweets are actually sarcastic. The remaining ones are most directly mocking (a person usually) or jokes. Similarly, 72.2% of the Reddit comments annotated as sarcastic were indeed identified sarcastic after rigorous check, and 87.8% of news headlines annotated as sarcastic were identified as so.
On the other hand, as shown in Figure 9-b, sarcasm is indeed a polarity switcher for 89.3% of the tweets where it is employed. In other words, the actual polarity of the tweet is the opposite of that returned by a sentiment analyzer. Sarcasm has also been identified as a polarity switcher on 89.1% and 92.0%, respectively in Reddit and the news headlines, respectively. It is no surprise that news headlines were ones that had more accurately sarcastic statements, and that sarcasm is more of a polarity switcher when compared to the other two platforms. This has been targeted in the previous sub-section.
To recapitulate, the answers for the questions we have investigated could be as follows: • [A1] The way sarcasm is expressed indeed differs from one platform to another. This is due to several reasons which include, but are not limited to, the mastery of language, how influenced users are by others, etc.

• [A2]
The way sarcasm is expressed in Twitter does change over the time. More interestingly, Twitter is indeed the platform where pattern-based approaches for sarcasm detection are the most effective.
• [A3] Within the context of the data sets we have explored and used in this work, it is safe to affirm that sarcasm is a polarity switcher with a very high probability.

VII. CONCLUSION
In this paper, we have investigated the topic of sarcasm detection on different platforms and over time. We have studied the different ways sarcasm is expressed on 3 different platforms: Twitter, Reddit and news headlines. Our experiments show that sarcasm is indeed expressed differently on these platforms. They have also shown that the way it is expressed at separate periods of time is different. Finally we have explored the idea of using sarcasm as a polarity switcher, and confirmed that sarcasm can be used as a polarity switcher if detected.