Agree-to-Disagree (A2D): A Deep Learning-Based Framework for Authorship Discrimination Task in Corpus-Specificity Free Manner

Authorship discrimination is the task of detecting whether two writings are authored by the same person. From literature study to forensic analysis, the authorship discrimination makes a significant contribution in differentiating authorship. In this work, we propose Agree-to-Disagree (A2D), a novel framework for the authorship discrimination task. It is a two-stage deep learning-based framework consisting of an ‘Agree’ and a ‘Disagree’ network. At the first stage, it learns the authorship attributes with its Agree network. Subsequently, through its Disagree network, the framework attempts to differentiate the authorship of a new dataset (completely unrelated to the training dataset), a novel use case that has not been systematically considered hitherto in the literature. We show that A2D is not dependent on the dataset-specific prior knowledge and it can learn only from authorship attributes of the dataset to detect whether two different writings are from the same author. We prove that the A2D framework can successfully reveal the authorship with pseudonyms through tasking it with unfolding the pseudonyms of a famous American short story writer Washington Irving. We also apply our framework on a historical topic of ascertaining whether the authorship of the most respected book in Islam (the Holy Quran) can be attributed to the Prophet of Islam. Through the experimental analysis, A2D reveals that the Prophet of Islam is not the author of the Holy Quran, and this result is in perfect alignment with the belief of 1.8 billion Muslims around the globe regarding the authorship of this holy book.


I. INTRODUCTION
Understanding the innate writing style of an author and extracting useful information from it, has been a fascinating topic of research in the field of natural language processing. It fascinates us when such a research work identifies that the mysterious author named 'Robert Galbraith' of the crime novel ''The Cuckoo's Calling'' is none other than J.K. Rowling, the famous author of the ''Harry Potter'' series. From studying the authorship of Shakespeare's works to forensic linguistics, it covers a wide range of applications. As a result, several research subfields of this topic have emerged like The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Shorif Uddin .
In this article we are focused on the Authorship Discrimination task which is the process of detecting whether two given texts/manuscripts are written by the same author. This is an important task in various fields and from different angles. Authorship discrimination reveals that Lyman Frank Baum, America's greatest writer of children's fantasy, did not author ''The Royal Book of Oz''; rather it was written by Ruth Plumly Thompson [1]. On April 29, 1992, a young, healthy man, Michael Hunter was found dead in his own bed by his roommate, Joseph Mannino, who notified the police. During the investigation of a potential homicide, Mannino gave the police suicide notes which he found on the home computer. After analyzing the writings of both Hunter and Mannino, it was revealed that the suicide notes were most likely not written by Hunter but rather written by Mannino [2], [3]. A famous dispute between Paul Ceglia and Mark Zuckerberg over the ownership of Facebook showed the vital role of the authorship discrimination in forensic analysis and justice. In his 2010 lawsuit, Ceglia claimed that he and Zuckerberg signed a contract in 2003 entitled Ceglia to 50 percent of Facebook. Ceglia submitted a set of disputed emails claiming Zuckerberg as the author of those emails. Later, in 2011, Gerald R. McMenamin, a Professor of California State University, submitted a report showing that the writing style of Zuckerberg differed from the disputed emails and concluded that Zuckerberg was not the probable author of the questioned emails [4]. Thus, authorship discrimination plays a significant role in discovering different authorship styles in historical literature and forensic analysis. And this motivates us to revisit this challenging problem.
The prior works ( [5]- [7]) on the authorship discrimination and the authorship attribution mostly focused on differentiating authors depending on lexical feature analysis. These works are vulnerable to human modification of the dataset such as translation, which may change the lexical features of the writing. Perhaps the most important issue, weakness to be specific, in these works is that these are mostly datasetdependent. The features prove to be useful on one dataset may not be considered as useful on another dataset because no generalizations can be made about the features [8]. Thus, the models of these works already get prior information about the dataset.
In this work, we propose the Agree-to-Disagree (A2D) framework, a two-stage deep learning-based framework for the authorship discrimination task. Our framework is not vulnerable to the limitations of the usage of lexical features as mentioned above. We also show that our framework does not acquire any prior knowledge about a specific dataset as it exhibits impressive performance on all the benchmark datasets considered in this study. We first take a dataset where we can agree that each writing is assigned to its actual author. We train our Agree network (first stage) on this dataset to extract the generalized authorship attributes. Then, we take a new dataset where we may not agree that each writing is perfectly assigned to its actual author. Therefore, we put this new dataset under experiment into our Disagree network (second stage) and observe how this dataset performs based on the generalized authorship attributes. We expect that the A2D framework will achieve high accuracy when writings with the same authorship style are assigned to one author. On the other hand, the A2D framework will show low accuracy if writings with the same authorship style are assigned to different authors. In this way, the A2D framework is able to detect whether two different writings are from the same author without having any prior knowledge about the writings. The A2D framework is evaluated based on two benchmark datasets (Reuter_C50 [9] and Spooky_Author [10]) and it shows impressive results in the authorship discrimination task.
This is important to mention that, some of the prior works ( [11], [12]) focused on identifying the author for an unknown text. In June 2017, Aleksandr A. Marchenko founded ''Emma Identity'' for identifying authors by their writings [13]. It used more than fifty parameters to define authorship. The task of identifying the author of a given text is authorship attribution or authorship identification. On the contrary, our A2D framework does not attempt to identify the author of a text; rather, it focuses on authorship discrimination which means to differentiate two texts based on the authorship. An example showing the difference can be useful in this context. Suppose we are given a labelled dataset of X number of authors. Authorship attribution or identification tasks (e.g., done by [14]- [16], and [17]) first train their network on the above mentioned labelled dataset. Now, in the former (latter) task, we have X = 1 (X >= 1) and given a new piece, their model can determine whether the given piece is written by that particular (one of the X ) author(s). On the other hand, A2D framework considers a completely different task, hitherto unexplored to the best of our knowledge, as follows. Given two pieces without any author related information, it can return whether the two pieces are authored by the same authors or not. And to do that it simply needs a 'good' labelled dataset of multiple authors and the authors of the above-mentioned two pieces need not be in that set of authors. The purpose of the labelled dataset is to learn the innate authorship style unlike the authorship identification task, where the model is trained on a labelled dataset to later identify the writings of the authors mentioned in the dataset.
A concrete fictitious example, highlighting the speciality of the A2D network with respect to the usual Author Discrimination (ADis) task studied in the literature, is provided as follows. Suppose, we have 5 authors (A, B, C, D, E) in a closed set of authors. Now suppose we have a labelled dataset, (DatasetX) where we have n different literary pieces each authored by one of A, B, C, D and E and for each author there are enough literary pieces in DatasetX. In a typical authorship identification task (e.g., [14]- [16], and [17]), the generated models are trained using DatasetX and then given a new piece by one of A, B, C, D or E, the model should be able to correctly identify the author. If a piece written by a different author, say F is fed to the model, the model has no way to identify F. Now we focus on a typical setting of an ADis task (e.g., [6], [18]  respectively, i.e., DatasetX did not have any literary pieces written by these two authors. In this case ADis1 will fail. A2D's speciality is that it will work in this setting. If A2D is trained with DatasetX, the Agree network will try to learn the innate characteristics of authorship and use that knowledge to 'discriminate' and is expected to be able to say that L_R and L_S are authored by two different authors. Notably, A2D does not aim to identify that L_R is authored by F and L_S is authored by G nor does it have the capability to do so. Similarly, ADis1 will fail if it is fed with L_R and L_U (both authored by F). But A2D is expected to output that these two pieces are authored by the same author.
Thus A2D works on an unique setting that overcomes the incapability of the existing models. This is why the current work is unique and to the best of our knowledge, this work is the first of its kind in this context.
In an effort to investigate the usefulness of the A2D framework, we conduct two separate case studies on two sets of ancient books. In our first case study, we work on the books of Washington Irving, a famous American writer of the early 19th century. Although he wrote some of his books under the pen name, we experimentally show that the A2D framework successfully reveals the same authorship of those books. For the second case study, we decide to work on an extremely sensitive topic which is very important in theological scholarship. In particular, we turn our attention to the theological debate whether Muhammad (peace be upon him), the Prophet of Islam can be attributed the authorship of the Holy Quran, that is believed in Islam to be the direct revelation of God (Allah) Himself. For this case study, we need samples of the Prophet's speech. In order to collect the Prophet's speech, we choose the Holy Bukhari (one of the six major Hadith collections). Notably, this book is compiled at a later period by a famous Islamic scholar, named Muhammad al-Bukhari, and hence the name. The reason behind choosing the Bukhari is that, it is regarded as one of the most authentic collections of the Prophet's sayings and deeds [19].
In this context, a brief discussion of the relevant work done by Sayoud in [20] is in order. Sayoud conducted three series of experiments for the authorship discrimination of the Holy Quran and the Hadith (records of the words, actions, and the silent approvals of the Prophet) to prove that the Prophet is not the author of the Holy Quran. For the first series of experiments, the author performed an analysis of the two books in a global form where the texts of the both books are analyzed as a single large text. For the second series of experiments, he analyzed the two books in a segmental form: four different segments of texts are extracted from each of the two books. In the last series of experiments, he makes an automatic authorship attribution based on a multi-classifier and multi-feature segmental analysis. However, in every experiment, he put high importance on the lexical features (i.e., word frequency, word length frequency, character frequency, numbers citation, special ending bigrams, discriminative words, and vocabulary similarity). But the difference in lexical features appeared may not be only due to different authors but also due to different topics. The variance in the distribution of the topics can be realized by the performance of the Latent Dirichlet Allocation (LDA) analysis (also mentioned by Sayoud [20]). As a result, one can argue that the impressive performance found and reported in [20] might not only be due to the authorship differentiation of the dataset.
In our work, we emphasize on the fact that the performance we achieve is only due to the differentiation of the authorship. To ensure this, we experiment with our A2D framework from the bi-directional point of view. We find that our framework exhibits outstanding performance in the authorship discrimination task when each author possesses unique authorship characteristics and poor performance when no specific authorship characteristics are found to distinguish an author. This proves that our A2D framework truly learns from the innate writing style of the authors.
Our contribution in this work can be summarized as follows: • We propose the A2D framework for the authorship discrimination task. To the best of our knowledge, this is the first framework that can detect whether any two writings are from the same author.
• Our framework exhibits expected performance regardless of the nature of the corpus and the authors' variation. It does not depend on corpus-specific linguistic features. Hence, our framework does not need to acquire any prior knowledge before differentiating the authors.
• We perform a case study on the books of an author with different pseudonyms. We show that the A2D framework is successfully able to identify the same authorship of the books.
• According to the 1.8 billion Muslims around the world, the author of their principle religious text, the Holy Quran is God Himself, not the Prophet of Islam. It has been a topic of debate in the theological scholarship. We perform a case study with our A2D framework to analyze this topic. Based on the results of the A2D framework, we can confidently claim that the Holy Quran is not authored by the Prophet. In Section II of this article, we present the related research works; in Section III, we describe the architecture of the A2D framework in detail; in Section IV we discuss about the datasets we used in our experiments, their preprocessing, and the experimental setup of the A2D framework; in Section V we present the performance analysis of the A2D framework along with two case studies to show the application of the framework; in Section VI we provide some insightful discussions of our work and finally, we conclude this article mentioning possible future work in Section VII.

II. RELATED WORK
Significant research works in the field of NLP have been going on over the years to discover the writing styles of the authors and to differentiate the authors based on the styles thereof. Different methods have been proposed for authorship discrimination, authorship attribution, and verification.
Hirst and Feiguina [6] proposed bigrams of syntactic labels from a partial parser as a feature for author discrimination. They performed their experiments on various ranges of text lengths and reported 99% accuracies. Gamon [15] used semantic features, such as tense, aspect, verb subcategorization, and semantic modification relations as features. In [15], Gamon employed the SVM classifier and reported 97.6%. Kim et al. [7] proposed a new feature set of k-embedded-edge subtree patterns that could hold more syntactic information. They also proposed a novel approach to directly mining the information from a given set of syntactic trees. Chen et al. [23] performed experiments on 150 stylistic cues including lexical, syntactical, structural and content-specific features to detect author similarity from email messages. They achieved 89% accuracy using the Enron email dataset. Kestemont and Van Dalen-Oskam [22] showed that using a lazy machine learning technique, it is possible to discriminate between scribes. In their work, they noted that if the right features and weighting methods were used, the automated discrimination of both copyists and authors is possible for medieval texts.
Early works [14] of authorship attribution put importance on discovering authors' styles without much effort on text analysis. Diederich et al. [18] used SVM for authorship identification and observed that author detection with SVM on full word forms was remarkably robust even if the author wrote about different topics. Grieve, in his work of quantitative authorship attribution [21], performed thirty-nine different types of textual measurements. According to this work, a combination of the weighted performance of different features (i.e. words, punctuations, graphemes, sentence lengths) can achieve the best result.
Among the recent works, Sari et al. [29] provided an insight into the relationship between the effectiveness of different types of features for authorship attribution and the characteristics of datasets. They found that the most effective features for datasets can be predicted by applying topic modeling and feature analysis. Patchala and Bhatnagar [30] showed that templates formed from the sub-tree frequencies in the parse tree of an author's text reflected the innate writing style thereof. They found that syntactic features are best combined using Dempster's rule compared to other information fusion methods.
Juola and Mikros [25] used entropy, repeat rate of words, R index, and linguistic and Twitter-specific extra-linguistic variables to identify fourteen bilingual Twitter users. Fatima et al. [27] developed a multilingual (English and Roman Urdu, i.e., Urdu language written with the Roman script) corpus for authorship profiling task on Facebook. They suggsted that the existing stylometric features like n-grams, lexical, and content-based features can also be used on the multilingual corpus. Stamatatos [17] proposed algorithm of text distortion to enhance the robustness of extracting stylometric features for the authorship attribution. The author showed that through compressing topic-based information, personal styles of the author can still be maintained and this algorithm can be useful to authorship identification where the topics of documents by the same author vary. Rocha et al. [26] used a diverse set of features including bag-of-words, stem-based match, tf-idf, and distance between matching words for the authorship attribution task in the social media forensics. They suggested that researchers should look closely at the graphical structure of the social network to detect the association of an author.
Evert et al. [28] performed a series of investigations into stylometric authorship attribution methods relying on Burrows's Delta distance measurement [32] to quantify stylistic similarity. They suggested that information for the authorship identification relies on the deviation across most-frequent usage of words rather than the deviation of specific words. Overdorf and Greenstadt [24] used lexical, syntactic, n-grams, and misspelled words as features and proposed augmented Doppelgänger Finder algorithms to improve performance in cross-domain authorship attribution. Ramnial et al. [16] used the nearest neighbor and the support vector machine algorithms with 446 features on ten Ph.D. theses. They leveraged the authorship attribution technique for plagiarism detection and achieved above 90% accuracy. Ding et al. [31] categorized feature sets of authorship analysis into four modalities, i.e., topical, lexical, character-level, and syntactic. The author suggested applying feature engineering methods on these modalities to characterize an author, rather than choosing features manually.
Luyckx and Daelemans [8] addressed the problem of overestimating the system performance and the importance of linguistic features due to the unrealistic size of the training data and a very small set of authors. They showed the robustness of the memory-based learning to cope up with the problem. The proposed memory-based learning system correctly classified 50% cases in the authorship attribution on 145 authors and 56% cases in the author verification task.
Before concluding this section, we present Table 2 that shows a concise and focused view of the related works in a chronological order clearly highlighting the specific task the work focuses on. Careful examination of Table 2 also justifies our motivation to propose the A2D framework that performs the authorship discrimination task in a unique setting as follows. Given two literary pieces without any information thereof, A2D framework can decide whether the pieces are authored by the same author or not. And to do that, it only needs a good labelled dataset with literary pieces VOLUME 8, 2020 authored by some arbitrary authors to train its Agree Network (further details below). The methodology and approach adopted here is completely independent of the authorship style/characteristic and nature of the dataset.

III. MODEL ARCHITECTURE
The main constraint of authorship discrimination for the most fundamental writings is that we do not have any prior knowledge about the authors. Our Agree-to-Disagree framework is proposed precisely to tackle this limitation.

A. OVERVIEW OF THE A2D FRAMEWORK
The A2D framework is a combination of two identical networks: Agree (A) and Disagree (D). As has been illustrated in Fig. 1, the Agree network is trained over an author-identified dataset (i.e., a dataset containing labelled pieces authored by multiple authors). The one-dimensional convolutional layer of the Agree network (see details below) is thus trained to detect authorship characteristics from the dataset. In this way, we mold and build the convolutional section of this network to differentiate distinct author characteristics.
Once the Agree network is trained as mentioned above, our A2D framework is ready to take up an author discrimination task, i.e., given two (literary) pieces it will decide whether they are authored by different authors. Now the same convolutional section of the Agree network is leveraged by the Disagree network, which now works as an 'authorship filter' for the Disagree network. We need to create a combined dataset using the two given literary pieces (further discussion is done later in Section V-C with Fig. 7). This Disagree network is then fine-tuned over this combined dataset without updating the parameter of the convolutional layer. Therefore, the convolutional layer provides the features to the attention layer of the Disagree network upon which this network decides on the authorship of the two literary pieces of the combined dataset.

B. AGREE (A) NETWORK
Given a writing-author pair (x, y), our task at Agree network (Fig. 2) is to predict y A , where y A is the author of a given writing x at the Agree network. One dimensional convolutional neural network can extract features and classify texts after transforming words in the sentence corpus into vectors [33]. Let x be represented with a maximum length of n words. Therefore, A filter of dimension w A is applied to a window of h words to produce a new feature c A i , Here b A C is a bias term and f C is a non-linear function such as sigmoid, hyperbolic tangent, ReLU. This filter is applied to each possible window of words in the sentence to produce a feature map vector, A max-overtime pooling operation is applied over the feature map and that takes the maximum value of the feature corresponding to a particular filter [34]. Local feature vectors extracted by the convolutional layers have to be combined to obtain a global feature vector, with a fixed size independent of the sentence length, in order to apply subsequent standard affine layers. Max-pooling approach over k times forces the network to capture the most useful local features produced by the convolutional layers.
We use bi-directional LSTM [35] to encode the features from max-pooling. We determine the hidden states of both forward and backward directions − → h A t and ← − h A t for a given time step t. Then we concatenate them to get We calculate the attention score α A t over the hidden states and concatenate it with the respective hidden state to generate context vector z A t . Here, For the final step of the Agree Network, we apply a fully-connected layer on this context vector to predict the authorship output of this network.

C. DISAGREE (D) NETWORK
Disagree network (Fig. 3) is identical to the Agree network except for the convolutional section. Disagree network does not update the parameters of the convolutional layers. It applies the same convolutional layers (2), (3), and (4) of the Agree network on the word vectors of an experimental dataset to capture the most useful local features c D m . Disagree network then run Bi-LSTM over these features and compute attention scores to build up the context vector.
As the last step, we compute the probability of the authorship y D (the author of an experimental writing x at the Disagree network) given the word vector for this Disagree network.

IV. EXPERIMENTS
In this section, we describe the datasets, dataset generation, preprocessing and experimental setup of our framework.

A. DATASETS
In this study, we have considered two benchmark datasets for the authorship discrimination task.

1) REUTER_50_50 (REUTER_C50) DATASET
Reuter_50_50 (C50) dataset [9] is the subset of Reuters Corpus Volume I (RCV1). This archive contains over 800,000 manually categorized newswire stories made available by Reuters Ltd. for research purposes [36]. The corpus has already been used in author identification experiments. The top fifty authors with respect to the total size of the articles were selected. Fifty authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected thereby, making an attempt to minimize the topic factor in distinguishing among the texts. The corpus consists of a total of 5000 texts (100 per author). We split it into a non-overlapping 9:1 ratio to create the training set (90 per author) and the test set (10 per author) of our Reuter_C50 dataset.

B. DUMMY (RANDOM) DATASET GENERATION
As has been discussed previously, our principle goal is to ensure that the the A2D framework depends only on authorship attributes for the differentiation task. Therefore, we expect the framework to perform poorly when the dataset is authored by the same authors (or, equivalently, by authors with identical writing styles). So, if we could build a dataset where the authorship distribution is not discriminatory, then, our model should perform poorly. To examine this, we have built some dummy(random) datasets from the actual datasets of our experiment. These dummy(random) datasets contain the same topic distribution and linguistic features like the actual ones and only differ on the authorship distribution. If our framework fails to show similar performance on these dummy datasets like their respective actual datasets, we can infer that our framework indeed learns the authorship attributes for the prediction task. For this experiment, we have created a Spooky_Author_Binary dataset by taking only the writings of Edgar Allan Poe and Mary Shelley to check our framework's performance on a binary dataset. We have randomly distributed the authorship of Reuter_C50, Spooky_Author and Spooky_Author_Binary datasets to create Dummy_Reutor_C50, Dummy_Spooky_Author, and Dummy_Spooky_Author_Binary datasets respectively.

C. DATASET PREPROCESSING
We perform some preprocessing of the raw texts before feeding into the A2D framework. We clean our corpus by correcting the spelling of words. We split every text by white-space and the base form of each word by lemmatizing the word with Wordnet Lemmatizer from the NLTK library. Finally, we rejoin the word tokens by white-space to present our clean text corpus and tokenize them to feed our A2D framework. We use 100-dimensional pre-trained GloVe [37] embedding to build our word embeddings matrix. The GloVe is an unsupervised learning algorithm for obtaining vector representations for words.

D. A2D FRAMEWORK SETUP
For the Agree network, we use one-dimensional convolutional layers with 512, 256 and 128 filters of window size 4, 3 and 2. A max-pooling layer of pool size 2 is selected on the top of each of these convolutional layers. The bottom-most convolutional layer is initialized with 100-dimensional pre-trained GloVe embeddings. We use a Bi-directional LSTM [35] with 100 hidden states to produce a 200-dimensional vector. An ADAM [38] optimizer with a learning rate of 0.001 is applied to minimize the categorical cross-entropy loss. A softmax activation function is used for the final fully-connected layer. This network is trained on the Reuter_C50 dataset over 50 epochs with batch size 128. For the Disagree network, we keep the architecture identical, expecting that, we freeze the convolutional layers to stop updating the parameters of these layers. We use sigmoid activation for binary class and softmax activation for multi-class at the output layer. We fine-tune the datasets over 15 epochs to predict the final outcome for the authorship classes. Our implementation of the A2D framework is publicly available at: https://github.com/Tawkat/A2D-Authorship-Discrimination-Framework

V. RESULTS AND CASE STUDIES
In this section, we summarize the experimental results of our A2D framework on the benchmark datasets and present two case studies showing the application of the framework.

A. A2D FRAMEWORK LEARNS TO DIFFERENTIATE AUTHORSHIP CHARACTERISTICS AUTHOR-INDEPENDENTLY
To check whether the Agree network indeed learns the generalized authorship attributes, we train the Agree network on Reuter_C50 dataset with 50 authors. We take the convolutional layer of this Agree network for using it in the Disagree network and fine-tune this Disagree network on the original datasets (i.e., Spooky_Author_Binary, Spooky_Author and Reuter_C50) and corresponding dummy datasets (i.e., Dummy_Spooky_Author_Binary, Dummy_Spooky_Author and Dummy_Reuter_C50). After fine-tuning, if we find that the performance of the Disagree network with the original dataset is significantly higher than that with the dummy dataset, we can claim that the Agree network of the A2D framework is indeed capable of capturing the authorship attributes. And it can be observed from Fig. 4 that this is indeed the case. In particular, the Disagree network achieves more than 80% accuracy on the dataset with two authors (Spooky_Author_Binary) but only 54% accuracy on the dummy dataset with two authors (Dummy_Spooky_Author_Binary) registering a significant difference of around 26%. Similarly, for the dataset pairs with three authors (i.e., Spooky_Author and Dummy_Spooky_Author), the difference in accuracy is around 35% and finally for the dataset pairs with 50 authors (i.e., Reuter_C50 and Dummy_Reuter_C50), the difference in accuracy is around 54%.
The high differences between the actual and the dummy datasets proves that our A2D framework learns to differentiate authorship characteristics at the Agree phase of the A2D framework. More importantly, we can now claim that the A2D architecture is capable of differentiating authorship regardless of any specific authorship characteristics because  the authors with whom we train the Agree network and the authors with whom we fine-tune the Disagree network are completely different. Therefore, to leverage the A2D framework in the authorship discrimination task, we need to train the Agree network with an author-classified dataset to capture the innate authorship characteristics. At the Disagree phase, this framework uses this authorship characteristics to decide whether the writings are from the same or different authors.
It is important to note that, our work emphasizes on building a framework that can differentiate writings of different authors without any prior information about those writings. Therefore, we focus on building our A2D framework so that it can achieve high accuracy if it detects distinct authorship styles in the dataset and low accuracy if it does not find distinguishing authorship style. Thus it is not meaningful to compare the accuracy of the Disagree network as reported in Fig. 4 with previous works.

B. AUTHORSHIP DIFFERENTIATION WITH ATTENTION
The distribution of normalized attention scores over the position of the feature vector is shown in Fig. 5 and Fig. 6. When VOLUME 8, 2020 the dataset contains unique authorship styles for each author (Fig. 5), the authorship filter passes adequate information to the Disagree network to pay the attention through finetuning. As a result, the Disagree network produces different attention distributions for different authors. These different distributions help the network to differentiate the authors. Therefore, the framework shows high performance on this type of dataset. On the other hand, when the dataset is authored by the same authors or, equivalently, by authors with identical writing styles (Fig. 6), the authorship filter fails to pass any information to the Disagree network. Consequently, the Disagree network produces similar attention distributions for all the authors. As a result, the Disagree network cannot differentiate the authors through examining the distributions. Therefore, the framework shows low performance on this type of dataset.

C. CASE STUDIES
In this section, we conduct two case studies to illustrate how our A2D framework performs on experimental texts. In order to come to a decision about the authorship discrimination of two texts, we need to choose an accuracy threshold. We set it to 0.75, based on the empirical performance analysis of our framework on binary-labeled datasets. Therefore, if our framework shows accuracy under 0.75, we will conclude that the two experimental texts are from the same author. The lower the value of the accuracy, the higher the confidence of the A2D framework that the two texts are from the same author. Fig. 7 shows the workflow for determining whether the two different texts (as input) are authored by the same author. Firstly, we assign different labels on the two experimental texts. Then, we combine both texts and randomly shuffle to create a combined dataset. We split the dataset into an 8:2 ratio to create the training set and the test set respectively. We fine-tune our Disagree network with the training set and predict the accuracy on the test set. If the accuracy exceeds the accuracy threshold, we will say that the two texts are from different authors. Here it is assumed that the Agree network is already trained on a labelled multi-authored dataset.

1) CASE STUDY I: UNMASKING THE PSEUDONYMS OF WASHINGTON IRVING
Washington Irving was an American short-story writer, essayist, biographer of the early 19th century who is best known for his short stories: ''Rip Van Winkle'' and ''The Legend of Sleepy Hollow''. In 1809, he wrote ''A History of New York'', a satirical novel of historic New York under the pen name Diedrich Knickerbocker. Irving is also the writer of ''The Sketch Book of Geoffrey Crayon, Gent.'' which is a collection of 34 essays and short stories. It was published serially throughout 1819 and 1820. This book also marks Irving's first use of the pseudonym Geoffrey Crayon. Hence, in our first case study, we find out whether our A2D framework can detect these two books are from the same author.
We obtain ''A History of New York'' and ''The Sketch Book of Geoffrey Crayon, Gent.'' from Project Gutenberg [39]. We assign different labels on these two books and shuffle randomly to create the dataset. We split it into 8:2 ratio to create the training and the test set. After fine-tuning the framework on the train set, we predict the accuracy on the test set.
Our framework achieves only 51% accuracy while differentiating the authorship of this dataset. As ''A History of New York'' is a satirical novel and ''The Sketch Book of Geoffrey Crayon, Gent.'' is a collection of short stories, these two books represent different genres and topics. However, the innate characteristics of the authors, namely, Diedrich Knickerbocker and Geoffrey Crayon, are still revealed to our framework through the identical pattern of the attention scores distribution (Fig. 8). As a result, our framework shows accuracy below the accuracy threshold (0.75). Hence, we can come to the decision that the author of ''A History of New York'' and ''The Sketch Book of Geoffrey Crayon, Gent.'' are the same.

2) CASE STUDY II: DID THE PROPHET OF ISLAM AUTHOR THE HOLY QURAN?
The Holy Quran is the central religious text of Islam. It means ''The Recitation''. It is widely regarded as the finest work in Arabic literature [40], [41]. The Holy Quran is organized in 30 parts, 114 chapters and a total of 6236 verses. Sahih al-Bukhari is one of the Kutub al-Sittah (six major hadith collections) of Sunni Islam. The Arabic word 'Sahih' translates as authentic or correct. The collection of prophetic traditions, or hadith for Sahih al-Bukhari, was performed by the Muslim scholar Muhammad al-Bukhari. Sunni Muslims view this as one of the two most trusted collections of hadith along with Sahih Muslim. About 24.1% [42] of the world population are Muslims and they believe that the Holy Quran is the revelation from God (Allah) Himself to His Messenger Prophet Muhammad (Peace Be Upon Him) through angel Gabriel (Jibril). Muslims also believe that the Bukhari contains direct speeches and traditions of the Prophet Muhammad. On the other hand, the authorship/origin of Quran has been a topic of debate in theology scholarship since long [43], [44]. This motivates us to check what is the verdict of our A2D framework about the authorship of the Holy Quran and the speeches of the Prophet contained in the Bukhari, i.e., whether these reflect the same authorship.
We have collected English translation of the Holy Quran dataset from Kaggle [45] and considered each of the verses of this book as an instance. We parse English translation of the Bukhari [46] and consider only those as our instances that include the Prophet's statements. This is important, because, apart from compiling the actual statements of the  Prophet of Islam, the Bukhari also contains various 'metadata' that include (but are not limited to) the details of the chain of narration. The Bukhari verses contain the name of the narrators as sources. We remove these names to avoid introducing any unwanted bias and take only the exact speech of the Prophet Muhammad (please see Fig. 9). In this way, our Quran-Prophet's_statements dataset contains a total of 11808 instances (6236 verses of the Holy Quran and 5572 statements of the Prophet from the Bukhari). We split it into a training set and a test set having 8:2 ratio. Thus, the test set includes 1259 verses from the Holy Quran and 1101 statements of the Prophet.
Our A2D framework achieves 92% accuracy in differentiating the authorship of the Holy Quran and the Prophet's statements. Fig. 10 shows the distribution of normalized attention scores over the position of the feature vector for VOLUME 8, 2020 the Quran-Prophet's_statements dataset. The difference in authorship characteristics results in different distributions of the attention score for the Holy Quran and the Prophet's statements. As a result, our framework is comprehensively able to discriminate the authorship. Therefore, based on the A2D framework, we conclude that the Prophet himself is highly unlikely to be the author of the Holy Quran.

A. UNIQUENESS OF THE A2D FRAMEWORK
The earlier works ( [5]- [7], [11], [12]) focused on selecting a specific set of features depending on the nature of the dataset or the characteristics of a specific author. In this way, those works differentiated a particular author from a closed set of other authors. On the other hand, the significant level of difference in accuracy between each benchmark dataset and its respective dummy dataset (Fig. 4) supports the capability of our framework to detect whether two different writings are from the same author or not (applications of the A2D framework are discussed in more detail in Section V-C) regardless of the nature of the dataset or the characteristics of specific authors. To the best of our knowledge, the A2D framework is the first of its kind that does not depend on any corpus-specific features. Therefore, comparison between the performance of the A2D framework and the other existing works is not meaningful.

B. RATIONALE ON THE SELECTION OF ACCURACY THRESHOLD
We have determined an empirical accuracy threshold (0.75) to determine whether two given writings are from the same author (See Section V-C and Fig. 7). While this is convenient, admittedly in some corner cases or in some different datasets such a threshold may become misleading. This is because the threshold may vary depending on the characteristics of the dataset and the authors. Hence, it is difficult to define a universal threshold applicable to all the datasets. Therefore, instead of a quantitative approach we can take up a qualitative approach based on the attention distribution as has been described in Section V-B. In particular, the more identical attention distributions of the two writings are, the more confident our framework is about the similar authorship.

C. RATIONALE ON THE TRAINING OF AGREE NETWORK ON DATASETS
In the A2D architecture, the Disagree network utilizes the authorship filter created by the Agree network (as described in Section III) to determine whether any two writings are authored by the same person. Hence, it is important to create this authorship filter in such a way that it should only focus on capturing the innate authorship styles rather than the other differentiating factors (i.e. topic of the writings). This is why the Agree network should be trained on a dataset that only describes the innate authorship styles of the author. Because, if we train the Agree network on a dataset that contains the writings of different authors on different topics, the network may create the authorship filter to differentiate the topics, not the authorship styles. Therefore, in our work, we train the Agree network on the Reuter_C50 dataset with an attempt to minimizing the topic-variance so that the network can only learn from the innate authorship styles.

D. A2D FRAMEWORK IN HANDLING THE AUTHORSHIP IDENTIFICATION TASK
We have principally tasked the A2D framework for the authorship discrimination task, and that too in a unique setting that is hitherto unexplored in the literature. However, it is worth pointing out that the A2D framework can also be tasked with the authorship identification task as follows. Suppose, there is a (target) book whose author needs to be identified. Now, assume that we have a large dataset of books whose authorships are known and it contains some other books authored by the author of the target book. To determine the authorship of the target book, the A2D framework is applied between the target book and each book in the dataset. If the A2D framework shows accuracy below the accuracy threshold or, equivalently, exhibits similar attention distributions between the target book and a book in the dataset, we can conclude that the target book and that book in the dataset are authored by the same author, thereby identifying the author of the target book.
To elaborate on how the A2D framework will work in this setting, we give a fictitious but concrete example as follows. Suppose, book_X is written by author_X but its authorship is not known. So, book_X is our target book. To identify author_X as the author of book_X, we need a dataset containing (author-labelled) books authored by various authors including author_X. Suppose, dataset_XYZ is such a dataset containing books authored by author_X, author_Y, and author_Z. First, we train the Agree network of the A2D framework on a different labelled dataset of multiple authors (similar to training the Agree network in the authorship discrimination task). Then, we employ the Disagree network to differentiate the authorships between book_X and each of the books in dataset_XYZ. As we show in Section V, the A2D framework will exhibit accuracy below the accuracy threshold and identical attention distributions while differentiating authorships between book_X and a book authored by author_X in dataset_XYZ. Therefore, we can then identify author_X as the author of book_X.
On an ending note, although the A2D framework can be employed in the authorship identification task in this way, we stress that the uniqueness and principal contribution of this framework, as discussed earlier, lies in differentiating the authorship of any two writings regardless of the characteristics of the dataset or any specific author.

VII. CONCLUSION AND FUTURE WORK
In this study, we present the Agree-to-Disagree (A2D) framework for the authorship discrimination task. We show its performance on different datasets to prove that the framework does not collect any dataset-specific information before to distinguish between/among the authors. Additionally, we perform two case studies to show the applications of this framework. In the first case study, we show that our framework can successfully detect the writings of an author when the author uses different pen names. In the second case study, we focus on a sensitive topic and deploy our framework to determine whether the holy Quran could have been authored by the Prophet of Islam. Our framework affirms the Muslim belief and reveals that the Prophet is highly unlikely to be the author of the Holy Quran.
Our framework is flexible and easily extendable. For example, it can be used as a baseline for any future authorship discrimination research work. Moreover, as we show in our case studies, it can be used to differentiate ancient writings with unknown authors.