Pattern-Based Syntactic Simplification of Compound and Complex Sentences

With the advent of new technologies, simplifying text automatically has been very popular and of high importance among natural language researchers during the last decade. The predominant research done in the area of Automatic Sentence Simplification(ASS) is inclined to either lexical or syntactical simplification of sentences. From the literature survey, it is observed that existing research in lexical simplification makes use of word substitution technique. This causes word sense ambiguity in cases where the word synonyms are not appropriate for a sentence in the given context. In contrast, syntactical simplification though accurate and applicable to Natural Language Processing (NLP) tasks, requires tremendous efforts to construct rules for a given domain. The research proposes a framework called Pattern-based Automatic Syntactic Simplification(PASS) which identifies sentences and applies rules based on grammatical patterns to simplify the sentences thereby making it more generic for NLP tasks. PASS is evaluated by human experts to rate the usefulness of the framework based on fluency, adequacy and simplicity of the sentences. Furthermore, the framework is automatically evaluated with the available online corpus using automatic metrics of SARI, BLEU, and FKGL. The proposed approach generates promising results in the field of ASS and could be used as a preliminary module for NLP tasks as well as other natural language-related applications like summarization, anaphora resolution, question-answering, and many more.


I. INTRODUCTION
Automatic Sentence Simplification (ASS) involves techniques not only to reduce the linguistic complexity of the sentence but also preserves its meaning and original information. Specifically, the simplified form of the sentence is obtained either by substituting the words in the sentence or transforming the sentence structure [1], [2]. ASS is important for both software applications, as well as for certain NLP tasks [2], [3]. Software applications using sentence simplification models assist mainly language learners, people with reading disabilities, or low literacy readers [3]- [5]. Furthermore, the performance of ASS improvises the NLP tasks of anaphora resolution [6], machine translation [7], The associate editor coordinating the review of this manuscript and approving it for publication was Sergio Consoli .
Most of the early works done in ASS have extracted sentences that conveyed the most similar meaning to the original sentence called extractive sentence simplification. Later on, with the disposal of knowledge resources like WordNet, research in abstractive sentence simplification increased as compared to extractive sentence simplification [3]. Under the abstractive approach, there are two techniques: Lexical Simplification and Syntactic Simplification [12], [13]. Lexical Simplification is done by identifying and replacing difficult phrases or words using word substitution technique. In contrast, syntactic simplification deals with converting the complex structure of the sentence to simple syntactic structures [14] retaining the original words. The literature review in the paper discusses research targeted towards ASS in both syntactic as well as lexical simplification.
Though there have been substantial attempts in the literature to simplify the sentences, however most of the models have limitations in generating grammatically correct and simple sentences. Some of the existing approaches, though are able to generate grammatically correct simple sentences, are not able to retain the meaning of the original sentences. Majority of the lexical simplification make use of neural network which would learn based on the rules provided or learn by itself. This approach would be time consuming due to the extensive training time and works only for certain sentences given as input by the application. Furthermore, lexical simplification systems use word substitution technique which cause word sense ambiguity for a given sentence in the given context. This occurs when a word has several meanings and the system is not able to determine which word needs to be substituted for the original word in the given context. In case of syntactic simplification the systems depend on automatically learned rules or hand-crafted rules specific for a given domain. This not only makes the system domain dependent but also requires tremendous efforts to generate or create rules.
To overcome these limitations, the proposed research is based on grammatical patterns which can be further extended to cover complex constructs in a given sentence. The proposed Pattern-based Automatic Syntactic Simplification (PASS) utilizes grammatical patterns existing in the sentences to apply rules thereby simplifying compound and complex sentences. Furthermore the identified grammatical patterns are generic to a given language and applies to any context. The system is able to generate simple sentences which have substantial improvements in terms of grammatical correctness and simplicity. The simplified sentences generated in the approach are also able to preserve and retain the meaning of the original sentences. In this direction, the motivation is to propose a domain-independent system which • Uses grammatical patterns to identify different kinds of compound and complex sentences • Apply rules based on grammatical patterns to transform and simplify sentences • Evaluate PASS concerning its usefulness and compare it with existing models using existing automatic metrics The rest of the article is subdivided as follows. The basic terminologies is presented in Section II with related work in Section III. The methodology is explained in Section IV. Section V provides the analysis and discussions of the results, section VI gives the evaluation done for PASS and section VII presents the conclusion.

II. BACKGROUND
A sentence is a set or group of words which convey meaning in a given language. English sentences have been classified as simple, compound and complex sentences. The structure of the sentence comprises the information of a topic. Here the topic represents the subject while the details or information of the topic represent the predicate or the verb and object in English grammar. S derives the sentence with either a subject followed by verb as shown in equation 1, while a subject can either comprise of Noun or Pronoun or Restrictor or Determiner as represented in equation 2. The verb can either comprise of verb or complement or object in the sentence as shown in equation 3.
The sentences can be classified into declarative, imperative, yes-no questions and wh-question type of sentences [15]. Context Free Grammar (CFG), in linguistic terminology, represents the structure of every English sentence into its equivalent mathematical form as stated in [16]. The structure is encoded with a rule or production which expresses the ordering of the words forming a pattern. CFG can represent sentence-level grammatical constructions in English for declarative as shown in equation 4, imperative as shown in equation 5, yes-no question as shown in 6 and wh-question type of sentences as shown in 7. These equations represent sentence S that stands for the nonterminal symbol and its corresponding production rules for the derivation of each of the sentences. Here symbols:-NP refers to Noun Phrase while VP refers to Verb Phrase, Aux refers to the auxiliary verb in the imperative sentence while Wh-NP refers to noun subject in the interrogative sentence containing wh-word e.g. what, who, and so on.
The grammatical structures for the different sentences have been explored in [17]. This paper gives analysis and structural representations for simple and compound sentences which paved the way to explore grammatical patterns in the sentences used in the research. The criteria for distinguishing simple from the compound and complex sentences have been explored in [18]. This approach provides a way to distinguish the simple from the compound and complex sentences with the number of verbs in the sentence. Sentence classification has paved an important improvement for sentiment classification in [19]. The sentence classification for this kind of task was done by the divide and conquer approach followed by the  sentiment classification using a one-dimensional convolution neural network. Different types of sentences, their structures, and the different conjunctions have been discussed. The conjunctions -coordinating conjunctions, correlative conjunctions, and subordinating conjunctions in the sentences along with their order to illustrate the various types of declarative sentences in English are also discussed in this paper. This paper discussed the different coordinating conjunctions used in compound sentences and the correlative and subordinating conjunctions used in complex sentences. The research is targeted towards simplification of declarative sentences.
The declarative sentences can be categorized as simple, complex, compound and compound-complex sentences. The categories are based on the presence of clauses. A sentence can comprise of one or multiple clauses. If the clause can express a complete thought i.e. it has subject and predicate it is called an independent clause (IC) [19]. In case the clause cannot express a complete thought or is incomplete i.e. it has only the subject and verb it is called a dependent clause (DC). The different categories of sentences are explained in the following subsection.

A. SIMPLE SENTENCES
Sentences which comprise of no DC or in other words contain only one IC with a subject and predicate are termed as simple sentences [20]. An example of the same -'Cats are fond of balls' can be graphically represented with the POS tagging shown in Figure 1 using SPACY [21].

B. COMPOUND SENTENCES
Compound sentences comprise of at least two ICs connected by a coordinating conjunction (CConj) [20]. The different coordinating conjunction are -'For', 'And', 'Nor', 'But', 'Or', 'Yet', 'So' which represent 'FANBOYS' stating for the connectors between two or more IC [19]. POS Tagging for a compound sentence 'He went to the party but she stayed home' is represented in Figure 2. Compound sentences if not connected by the conjunction can also be separated by a semicolon [22]. On observation, the compound sentences can be categorized into three types: 1) Type I: Each of the IC comprises of one subject (same or different) each performing different actions connected by CConj • 'Cats like to drink milk and dogs like to chew bone' • In the e.g. 'and' connects the IC 'Cats like to drink milk' with another IC 'dogs like to chew bone' 2)  [20]. Both the IC and DC are joined in any order with subordinating conjunction(SConj) like 'even though', 'because', 'after', 'while', 'since' and so on [19]. POS tagging of a complex sentence 'Whenever it rains I like to wear my blue coat' is represented in Figure 3 If the DC comes before the IC they are usually separated by a comma [23]. Based on the position of subordinating conjunction in the sentence and the number of the DC complex sentences can be categorized into the following three types: 1) Type IV: Sentences having have SConj followed by a verb e.g. 'that' followed by a verb • 'He bought the shoes that were in the shop window' 2) Type V: Sentences having SConj in the beginning or middle of the sentence e.g. 'Because' in the beginning or 'even though' in the middle of the sentence • 'Because my tea was too cold, I heated it in the oven' • 'The servants were very obedient to him even though he was very harsh to them' 3) Type VI: Sentences having more than one DCs e.g. DC: 'drawing on areas of expertise such as applied mathematics' and 'thus emerging as the latest technology' • 'Artificial intelligence research has been necessarily cross-disciplinary drawing on areas of expertise such as applied mathematics, thus emerging as the latest technology'

D. COMPOUND-COMPLEX SENTENCES
Compound-complex sentences are the ones having two ICs and at least one or more DC [24]. E.g. ''While Sandhya reads comics, Anu reads novels, but Anjali reads only magazines''. The POS tagging of this sentence is shown as in Figure 4 III. RELATED WORK ASS as a preliminary module is important for NLP processes and has been often discussed from a linguistic perspective. A possible requirement of simplification is to get an easy and simple version of a given sentence. Based on this, existing approaches in syntactic simplification explored parsing and dependency linkages of a given language to simplify sentences. From another perspective, sentence simplification techniques have been experimented based on machine learning. This includes statistical machine translation and Conditional Random Field (CRF) to decompose sentences in the direction of lexical simplification. This section gives an insight into the papers put in both lexical and syntactic simplification techniques.

A. SYNTACTIC SIMPLIFICATION
The initial work on syntactic simplification of a sentence was done in [1] wherein the researcher explored two alternatives for full parsing to simplify the sentences. One approach clusters noun and verb groups through Finite State Grammar. While another approach generates dependency linkages by using the Super-tagging model. These alternatives were done to compare which among the two methods was better for simplifying the sentence. Another researcher used semantic role labeling in [25] for simplifying the sentence. The approach used transformation rules which were hand-written and corresponded to basic syntactic patterns. The preference of rules for a given sentence is parametrized by learnt weights by the model. The model thus applies the rules and tranforms the original sentences till the resultant set comprises of simplified sentences. A tree model was proposed in [26] which was based on statistical machine translation to derive a parse tree for complex sentences. This model generated the parse tree by splitting, reordering, dropping, and substituting operations to simplify the sentences. The approach was targetted towards simplifying complex sentences. An approach of exploiting syntactic parse trees for simplifying both compound and complex sentences was contributed in [27]. Once the parse tree of the sentence is obtained, the coordinating conjunctions connected to the compound sentence are identified. Then the sentences are split recursively until the sentences are simplified. Correspondingly, for complex sentences, the technique tries to get rid of the sentences having the subordinate Clause (SBAR) tag thereby simplifying the complex sentence. One another researcher has contributed a syntactic simplification system using dependency parse tree for complex sentences [28]. The system with the dependency tree of the complex sentence produced structurally similar and simpler English sentences that preserved meaning. Another approach called Simplified Factual Statement Extractor in [29] extracted multiple simple sentences from the text using rules and generated questions automatically. The extracted sentences from this technique were multiple, simple and syntactic and generated semantically correct factual questions from complex sentences. The sentence simplification technique proposed in [30] focused on clauses and entities to extract relations in the sentences. The approach defines two rules one for the selection of clause which removed the noise before and after the clause. While the second rule selects the entity phrase which simplified the entity without modifying the relations' truth value. The limitation of this approach was some rules changed the modality of the sentence during simplification.
One another contribution towards syntactic simplification in [31] was done by applying heuristics to the syntactic parse trees. The parse trees thereby created alternatives of the input sentence and its possible several simplified sentences. The limitation being that the parsers output heuristics templates were required which in turn produced simplified sentences having grammatical errors. Syntactic simplification for French texts in [4] was explored using two stages. In the first step all possible simplifications for a given sentence were generated. However, in the second step, the best-simplified sentence was selected which satisfied a certain criteria. The limitation was the pre-requisite of manually hand-crafted rules extracted from a French corpus. An approach by name Simplified Statement Extraction (SSE) in [32] decomposed the original sentence into small simplified sentences. The experiment yielded simple sentences but with lower grammatical accuracy.
Complex sentences from Punjabi texts were identified in [33] using CRF. Here all the accessible features that included interactions between sentences were noted by the framework but the work was limited to Punjabi language. Simplification of compound-complex sentences in [24] was done by using the Stanford dependency parser which generated Modified Stanford Dependency (MSD). This structure is derived from Basic Stanford Dependency (BSD) and Collapsed Stanford Dependency (CSD) structure of the parser. The limitation of the system was it generated incomplete sentences for some test cases of compound-complex sentences. Simplification of complex sentences for building a Spanish corpus has been done in [34]. Depending on number of conjugate verbs, the methodology targets splitting long and complex sentences into simple sentences. The limitation being the splitting of complex sentences is only done for those sentences which were either coordinate or subordinate sentences. VOLUME 10, 2022

B. LEXICAL SIMPLIFICATION
Lexical simplification of the sentences was tackled in [35] with a dataset comprising of Resource Description Framework (RDF) triples. The RDF using the Sequence-to-Sequence model given the name Split and Rephrase simplified complex sentences. The technique uses probabilistic model to predict the splitting point for a given input sentence. In some cases the wrong splitting point predicted by the model hinders to generate fluent and meaning preserving rephrasings for the given sentence. Another approach in lexical ASS was done in [36] where for a given input sentence a simplified sentence was generated by utilizing a corpus. The technique used here is a variant of phrase-based machine translation system to incorporate phrasal deletion to simplify the text. The approach though has obtained positive results in simplifying the text produced grammatical incorrect sentences in some cases. A simplification model for children was proposed in [37] wherein the difficult words were replaced by easier synonyms. However, the simplification introduced certain errors while simplifying the input sentences. A simplification model using unsupervised machine learning has been implemented in [38]. The limitation of the system was that it could not simplify sentences that had parsing errors. Rule-based sentence simplification model was devised in [39] which incorporated contextual information to replace the original words of the input sentence. However, the simplification of the system effectively worked only for frequently occurring words in the domain.
The sentences were simplified without rules in [38] by splitting them based on their semantic structure using an unsupervised approach. The approach, though was less sensitive to parsing errors, required a large corpus of simplified and standard language with no alignment between them. Lexical simplification of text was also explored in [40] by using sequence to sequence neural network with two Long Short Term Memory (LSTM) networks. Due to the development set, the model would not only predict the perplexity values but also selected the model parameters with the best perplexity. Thus the model could perform both lexical simplification and content reduction.
Lexical simplification was experimented using reinforcement learning technique in [41]. The model called Deep REinforcement for Sentence Simplification (DRESS) explored the possible multiple simplifications of the sentence. The model learned to reward the function which outputs grammatical, fluent, and simple sentences. However, it required optimal method to reward the sentences leading to a significant delay for the training compared to the other neural network models. A basic neural network model in [42] utilized a configuration having both multilayer and multihead attention architecture for simplification of sentences. The model was trained through paraphrase detection that covered all the simplification rules. However, although the models helped to leverage better rules the system required an additional memory component to maintain the simplification rules. Neural Semantic encoders contributed in [43] utilized greedy and beam search strategy to get the target sentence for simplifying complex sentences. The results of the approach had a positive correlation between adequacy and fluency but a negative correlation between adequacy and simplification. Sentence simplification using an automatic semantic parser has been explored in [44] based on an effective and simple algorithm. Here after the sentence were split the text was finetuned to tackle both structural and lexical simplification.
An approach that explored both syntactic and lexical simplification utilized unlabeled text corpora collected from en-Wikipedia dump. The architecture comprised of shared encoder along with a pair of attentional decoders in [45]. The framework simplified complex sentences by training the model with complex sentences along with their corresponding simple sentences. A sequence to sequence model explored in [46] shared all the encoder and decoder parameters for multi-task learning. Here the sentence simplification task, entailment generation, and paraphrase generation task was trained in parallel to get the desired output. An approach for lexical simplification of complex sentences was experimented in [47] that incorporated content word complexities and loss function into their word complexity model during training. Then, their model generated a huge set of candidate diverse simplifications during testing phrases to promote adequacy, simplification, and fluency of the sentences. The model did not perform well for 1. Long and complex sentences having multiple clauses, 2. Sentences that needed anaphora resolution to preserve the meaning, and 3. For sentences where simplification was done for the wrong part of the sentence.
One more approach in [5] adapted a discrete mechanism on sequence to sequence model by providing explicit control on the attributes of the sentence. The attributes such as paraphrasing detail, length of the sentence, were used for simplification. The attributes were chosen carefully by the audience based on the requirement of the simplification. The model was hence termed AudienCe-Centric Sentence Simplification (ACCESS). In the previous models, the operations were not learned by the neural network, in contrast to the approach done in [48]. Wherein the model explicitly was trained to learn the different operations of adding, deleting, and keeping words in the sentence simplification process. By this approach, the intended edit operations on the target sentence were trained thereby resembled the human thinking process for a given sentence while simplifying it. An iterative edit-based unsupervised approach for simplifying the complex sentence was proposed in [49]. Here the approach made use of iterations for editing the word and phrase-level for a given sentence. The method generated multiple sentences iteratively by performing edit operations thereby changing the syntactic and lexical structure of the sentence. However, the limitation being the approach could not paraphrase the sentences properly due to which the meaning of the original sentences was not retained. 53294 VOLUME 10, 2022 With the existing approaches, it is observed • Majority of the works concentrate on rules specific for a given domain, therefore are domain dependent • Some of the existing works have designed their methods by making use of specific language constructs occurring in languages such as Basque, Spanish, Punjabi and French which though yield promising results cannot be reused for English language • Existing approaches have limitation of generating sentences with incorrect grammatical structure In this connection, PASS identifies different grammatical patterns in sentences and then applies rules to simplify them. Further, the system is not only manually evaluated on the basis of its usefulness but also the automatic evaluation of the system is done to compare the approach with the existing approaches using the publicly available dataset and automatic metrics.

IV. METHODOLOGY
PASS comprises of three stages: Segmentation and tagging, Chunk identification and the third stage of analysis and application. Figure 5 shows PASS which simplifies sentences using grammatical patterns. PASS transforms or aligns an input sentence to two or three simplified sentences, so the sentence transformation is categorized as one-to-N as discussed in [3]. In the first stage, each of the input sentences is segmented and for each segmented sentence, POS tagging is done. POS-tagging tags each word of the sentence with its parts-of-speech. Once the tagging is done, chunks of words in the sentence are identified through CFG.
Each identified chunk comprises of VP or NP with the modifiers attached. The NP chunk will also identify the singular/plural number of the phrase, while accordingly, the VP chunk identifies the tense and other aspects of the sentence. After chunk identification, the last stage of analysis and application is done. Here the identified chunks are analyzed and the rules are applied to transform the input sentences into simplified sentences. Analysis and application require grammatical patterns of CFG to be analyzed and then to apply the rules based on the grammatical patterns on the identified chunks.
Once the chunks are identified the rules are chosen depending on the type of the sentence encountered. A rule which can be seen commonly in all the compound sentences is shown in Equation 8 that can be interpreted as follows. If P in the Equation 8 is a complete sentence S that constitutes of NP and VP shown in Equation 4. S is followed by CConj, followed by Z, where Z is an arbitrary sequence of words, the sentence then is simplified into two sentences i.e. X and T. Here X will be the existing sentence formed by separating the NP i.e S 1 . While T forms another complete sentence S 2 by identifying and appending the missing subject or predicate or object from X. If Z is a complete sentence by itself after removing the CConj then T = Z.
The example P → 'Cats like to play with dogs and other cats', shown as Type III in will be simplified based on the rule given in equation 8 as follows.
• S → 'Cats like to play with dogs' • CConj → 'and' • Z → 'other cats' • Z has missing subject and predicate which is referred and appended from X i.e. subject → 'Cats like' • Missing predicate → 'to play' • Simplified sentences are: -X → 'Cats like to play with dogs' -T → 'Cats like to play with other cats' • Similarly, the other sentence as shown as Type I will be simplified as -X → 'Cats like to drink milk' and -T → 'Dogs like to chew bone'.
• Sentence shown as Type II will be simplified as -X → 'Cats like to drink' and -T → 'Cats like to play with the ball'. Compared to compound sentences, a common rule or grammatical pattern is not observed in complex sentences due to their structure. Rules based on grammatical patterns which can be applied to the sentence of Type IV is shown in Equation 9. Equation 9 is understood as follows. If a complex sentence X begins with an IC followed by that' followed by the DC, the sentence is divided into two sentences X and T. X is the complete sentence S 1 , while T will be the second complete original sentence Z appended with the second NP referred in X. The second NP in X happens to be the object of the sentence. Rule applied to sentence of Type V is given in Equations 10 and 11. In this case, complex sentences have SConj in the middle of the sentence as shown in rule 10. Or at the beginning of the sentence as shown in rule 11, then the original sentence is divided into two simple sentences X having the IC and T having the DC. For complex sentences comprising of more than one DC as shown in Type VI, the split of the original sentence is done as shown in equation 12. The rule states that if X IC (with NP along with the verb VP 1 ) is followed by the SConj followed by DC. The sentence is divided into two sentences with the initial sentence X having the complete IC. While T will be the second complete sentence with NP of X appended with changed tense of the verb VP 1 represented as VPC 1 and DC of Z. For this sentence, a dictionary is build to generate the past tense of the verb to VOLUME 10, 2022 be used for the recurring verbs in the complex sentences.
Using equation 9, the example X → 'He bought the shoes that were in the shop window.', shown in Type IV will be simplified as follows.
• IC → 'He bought the shoes' • DC → 'were in the shop window • DC has missing subject (object in X) which is referred from X i.e. subject → 'The shoes' • Simplified sentences are: -X → 'He bought the shoes' -T → 'The shoes were in the shop window' • Similarly, the other sentence X → 'Because my tea was too cold, I heated it in the oven' as shown in Type V will be simplified as -X → 'My tea was too cold' and -T → 'I heated it in the oven'.
• Another sentence of Type V 'The servants were very obedient to him even though he was very harsh to them' will be simplified as -X → 'The servants were very obedient to him' -T → 'He was very harsh to them' • Sentence shown in Type VI will be simplified as -X → 'Artificial intelligence research drew on areas of expertise such as applied mathematics' and -T → 'Artificial intelligence research emerged as the latest technology.

A. ALGORITHM
The algorithm PatternAnalysis() describes the steps of analyzing the grammatical patterns in the input sentences and applying the rules to simplify the sentences. Since there are different rules which are applied for different compound and complex sentences based on grammatical patterns, this algorithm has separate functions for each of the sentences. Depending on the requirement of simplification, each function is called separately or sequentially. So consequently the entire dataset is simplified by executing the Compound function or Complex function alone. Or sequentially both based on the order as Compound-Complex or Complex-Compound.  The analysis of each kind of sentence is shown in the Table 1. The table shows the steps applied as per the algorithm with the output for each of the input sentence. The data flow diagram of the entire procedure in Figure 6. Here I denotes the input sentences given to both Compound and Complex algorithms. While O1 and O2 represents the output of simplified sentences from Compound and Complex algorithm. The individual flowcharts of the Compound and Complex algorithm are shown in Figure 7 and Figure 8.

B. DATASET DESCRIPTION
The experimental dataset by name RandomCoCo (Random Compound and Complex) has been designed which comprises of compound and complex sentences taken from publicly available online sites. The dataset is available online on IEEEDataPort [50] consisting of 979 sentences. if S j contains CConj then 3: Remove CConj 4: if CConj is 'nor' then 5: Change the second sentence structure 6: End If 7: if X and T simple then 8: Exit 9: else if Get missing components of IC then 10: Append missing components to the clause 11: Split to generate X and T if S j contains SConj=='that' then 3: Split S j before 'that' to form X 4: NP before 'that' in S j is appended to Z as subject of Z 5: Z forms the simple sentence T 6: else if S j contains SConj in middle then 7: Split S j before Sconj to form X 8: T is S j after Sconj 9: else if S j contains SConj in beginning then 10: Remove SConj 11: Split S j before the second noun or delimiter whichever comes first to form X 12: T is sentence after the second noun or delimiter whichever comes first 13: else if S j contains two DC then 14: Split S j the first complete sentence to form X 15: Append main subject of IC to the first DC with changed tense of verb of first DC to form second sentence 16: Change tense of VP to past tense and append main subject to second DC to form third sentence 17: End if 18: End Procedure These 979 sentences consist of 478 compound sentences and 501 complex sentences. The algorithm works with compound-complex sentences but currently the dataset has only compound and complex sentences based on the objective of the research. Among 478 declarative compound sentences which had 435 sentences with conjunctions while the rest 47 were those having a semicolon. Among the 431 sentences 239 sentences comprise of more than two conjunctions. Among the remaining 192 sentences, there are about 90 using 'and', 20 using 'for', 8 using 'nor', 50 using 'but', 8 using 'or', 9 using 'yet', and 7 sentences using 'so' as conjunctions. Figure 9 shows the statistics of compound sentences in RandomCoCo. Considering Complex sentences, the RandomCoCo comprises 501 declarative sentences. Among this 23 sentences belong to Type IV, 20 belong to Type V and 458 belong to Type VI respectively. Figure 10 shows the analysis of complex sentences in RandomCoco.
The different CFG patterns used for the identified types of sentences is shown in the Table 2. The proposed set of patterns covers the general pattern of the declarative sentences in English language comprising of compound and complex sentences.

V. DISCUSSION OF RESULTS AND ANALYSIS
The approach is implemented using Python libraries of Natural Language Toolkit (NLTK) and SPACY. The compound algorithm was run on the entire RandomCoCo and generated 1058 simple sentences from 478 compound sentences. Table 3 summarizes the sample outputs for different sentences. Some test cases of compound sentences simplified by PASS are shown as follows.
1) As the method deals with the basic form of the sentence i.e. having subject, verb, and object, any other meaning about the sentence is preserved after splitting.  • E.g. We have never been to Asia, nor have we visited Africa • With the use of the algorithm the sentence gets split into X: We have never been to Asia and T: We have never visited Africa.
• Hence the meaning is preserved in the sentences after splitting.   2) The steps provided for compound sentences give a robust solution for sentences of the type 'He likes to run and swim'.
• Instead of splitting the first IC into subject, verb, and object, the first clause is used entirely, then the verb and words after the verb are combined to form the second sentence.
• This preserves the infinitives and the other adverbs and adjectives that might precede the verb.
• So the sentences split into X: He likes to run and T: He likes to swim.
• In this case, the verbs 'run' and 'swim' with infinitives 'likes to' are appended with the subject to form the two simple sentences.  With 501 complex sentences, 1048 simple sentences have been generated from the complex algorithm. Table 4 summarizes the sample outputs for different sentences.
Some test cases of complex sentences used and simplified using the algorithm are shown below. 1) For sentences having 'that' the method not only simplifies the sentences but also retains the meaning by appending the object of the first sentence with the required pronoun to the second sentence • The complex sentences of the type: 'My mom asked me to find her sweater that was torn and required stitching.' • This kind of sentence is of Type IV for which the steps give the simple sentences as X: My mom asked me to find her sweater and T: Her sweater was torn and required stitching. 2) For the complex sentences having subordinate conjunction in middle the method simplifies the sentences by splitting the first IC and then removing the conjunction to give the second sentence • The complex sentence: 'Let's go back to Shimla because there we had our first meeting.' • The sentence is of Type V wherein the sentence is split into X: Let's go back to Shimla Nous. and T: There we had our first meeting. 3) For complex sentences where there are more than one DC, the method employs to build a dictionary thereby splitting the sentence by appending the main subject to the next sentence • The complex sentence of the type: 'Teresa got Indian citizenship, spent many months in Pune to receive primary medical training at Holy Hospital, embarking into the slums.'.
• The sentence is of Type VI wherein the sentence is split into X: Teresa got Indian citizenship and T: Teresa spent many months in Pune to receive primary medical training at Holy Hospital and S3: Teresa embarking into the slums.

VI. EVALUATION
Evaluation of PASS has been done by both human experts as well as with the automatic metrics. The manual evaluation has been done for checking the correctness and usefulness of the algorithm. The output of the dataset given to evaluators for evaluation is available online [50]. Automatic evaluation has been done with the automatic metrics along with the publicly available corpus to compare PASS with other existing systems. This section focuses on the basis on which the domain experts have evaluated the system and the reason for choosing the appropriate corpus and automatic metrics.

A. HUMAN EVALUATION
The best reliable method to evaluate a simplification system is to rate the system's output by human experts as stated in [3]. Therefore PASS has been evaluated by considering five linguistic experts' judgments to validate the correctness of the system. The research is inclined towards the syntactic simplification of the complex and compound sentences, so the human evaluators have been given the simplified sentences. These simplified sentences are the results after running the Compound and Complex function separately. The evaluators are presented with the input sentences along with simplified VOLUME 10, 2022 sentences and are asked to rate them based on criteria given as follows.
• Fluency -To check whether the decomposed sentence is grammatical and fluent • Simplicity -To check whether the decomposed sentences simplify the input sentences in terms of structural transformation ignoring lexical simplifications • Adequacy -To check whether the decomposed sentences preserve the meaning Evaluator 1 is asked to rate the output for fluency, evaluator 2 rates concerning simplicity. Table 6 tabulates the individual metrics evaluated by each of the evaluators for the algorithm. The evaluation has been done using the 5-point Likert Scale which is the prominent and common technique used for evaluation of simplification systems [3]. The highest score in the Likert scale indicates the correctness of the sentence concerning the metric, whereas the lowest score means vice versa. Table 5 shows the criteria of the 5-point Likert scale given for human evaluation. The sentences having a scale of 3, 4, and 5 have been chosen to be satisfying the given metric.
Once the simplification is done, the system's output should be useful for other NLP tasks. One such task which the approach looked into as an example is Anaphora Resolution to support the usefulness criteria. Anaphora resolution is a task where the pronouns are resolved to their antecedents. In cases where a sentence after simplification is incorrect, there are high chances of anaphora resolution being erroneous in the simplified sentences [51]. For this the system has been evaluated for two aspects: 1. Usefulness 2. Effect of Anaphora resolution after simplification. The judgments of all evaluators based on fluency and simplicity are considered for evaluating the usefulness of PASS. In terms of evaluating the effect of anaphora resolution, it is very important that the meaning of the original sentence i.e. adequacy along with the correct fluency and simplicity of the output sentences is preserved. This is because in the course of simplification process there are chances of some nuances of meaning that might be lost. Hence to evaluate the algorithm concerning the anaphora resolution the judgments of all evaluators with respect to simplicity, fluency and adequacy are considered. Table 7 shows how the system is evaluated based on the criteria of usefulness and effect on anaphora resolution Among 1058 simplified sentences generated from 486 compound sentences in RandomCoCo, evaluator 1 found 944 sentences having the correct fluency. Table 8 gives the statistical details of the outputs for compound sentences for      all the evaluators. Considering the fluency and simplicity judgments by the evaluators, the average number of sentences that satisfy fluency is 943 while 967 are in their simplest form. With this, the usefulness of the system gives an accuracy of 90.26% by considering the judgments of four evaluators as reflected in Table 9. Similarly, based on Table 7 the effect of simplified sentences on anaphora resolution is computed thereby giving the accuracy of 90.29%. Table 10 provides the statistical details for compound sentences with effect to anaphora resolution.
Among the 1048 output sentences for 501 complex sentences in RandomCoCo, evaluator 1 found 928 sentences  satisfying fluency, 989 were in their simplest form as judged by evaluator 2. Table 11 gives the detailed analysis for each evaluator. The algorithm gives an accuracy of 91.65% for its usefulness considering the parameters of fluency, simplicity from the four evaluators. Table 12 provides the statistical details for complex sentences concerning the usefulness of the algorithm. To judge the capability of the algorithm and to check the effect of anaphora resolution over simplified sentences, the accuracy of the system considering all evaluator's judgments comes to 90.61%. Table 13 provides the statistical details for complex sentences to check the effect of anaphora resolution over simplified sentences.
To check the reliability between the evaluators, interrater reliability is tested for the system. Inter-rater agreement scores are the weighted Cohen's Kappa coefficients [52] done between two evaluators for each metric and type of sentence. This score was done between two evaluators and was above 0.9 indicating perfect reliable agreements between the evaluators. Table 14 provides the weighted Cohen Kappa coefficients for the experiment.
The evaluation of the system's output for both sentences concerning its usefulness and effect on the anaphora resolution has given promising results which suggest that PASS can be used for ASS. Moreover, the existing neural network models used for lexical simplification predict the probable word for word substitution but do not guarantee to simplify sentences that retain the same meaning compared to PASS. E.g. if the word in the original sentences is 'Excellent' which is a superlative word for 'good', there are high chances the neural network models might not be able to understand this and can replace the word with 'good'. This causes the sentences to be simplified but with a meaning which might not correlate with the meaning in the original sentence. However in PASS since the system utilizes the same words as in the original sentences. This assures that the meaning of the original sentences is retained in the simplified sentences generated by the approach.

B. AUTOMATIC EVALUATION
Though human evaluation is the most reliable method to rate the system's output, it is costly in practice and requires expert annotations as mentioned in [3]. Therefore to obtain quick and standard means to evaluate the system, automatic evaluation metrics are used along with the corpora for simplification. Among the most popular and publicly available corpora such as PWKP [26], TurkCorpus [53], and Newsela [54], the proposed approach has been evaluated with TurkCorpus. The motivation to use this dataset is as follows: The Turk-Corpus dataset calculates the simplicity metric by extracting sentences from the PWKP dataset having one to one alignments and eight multiple simplified sentences for each original sentence collected from Amazon Mechanical Turk.
Since this corpus provides one to one alignments, chances of misalignments are low and hence there is no creation of noisy data. So it is useful for evaluation of PASS which also aligns one input sentence to one or N sentences. Moreover, since the corpus has eight simplified sentences for each input sentence, PASS can be easily tested with this corpus. Furthermore, this dataset comprises 2359 sentences split into 2000 training and 359 sentences for testing giving a huge and diverse dataset to evaluate the system better. One another major reason for using TurkCorpus is that it allows PASS to be compared with the other existing lexical models.
The existing approaches have been evaluated to different automatic metrics such as BLEU [55], SARI [53], and FKGL [56]. Higher values of BLEU and SARI better the systems, whereas for FKGL lower the value, better the systems. BLEU computes the modified precision for n-gram by computing: (a) The maximum occurrences of n-gram word in the transformed sentence, (b) The clipped absolute count of every n-gram word by its greatest reference count and (c) Total of the clipped counts distributed by the entire unclipped counts of candidate words. With this computation, the BLEU metric is a machine translation-inspired metric that depends on and matches each n-gram in the transformed sentence independent of the position. BLEU metric is calculated with Brevity Penalty (BP) as shown in equation 13, wherein p denotes the candidate translation length and q denotes the length of reference corpus.
BLEU metric as shown in equation 14 computes the geometric average of the modified n-gram precisions u n up to length T with positive weights x n . As per the equation, the BLEU metric would increase the precision for shorter VOLUME 10, 2022 sentences. BLEU has a prominent association with human judgments of fluency and adequacy but with simplicity the association is less prominent when sentence is split [57]. Hence using this metric alone is not sufficient to evaluate ASS systems. Another widely used metric used in ASS systems is System output Against References and Input sentence -SARI. This metric assesses how suitable the simplification system is in adding, deleting or keeping the words during transformation compared to the reference sentences. SARI thereby correlates with human judgments of simplicity as mentioned in [53]. Therefore SARI is the most popular metric used for ASS systems. Considering the systems output S, input sentence SI, Reference sentence RS and # f (.) as a binary value of n-grams occurrence f in a given set. The n-gram precision p(n) and recall r(n) for the add, keep and delete operations is given in 15, 16, 17, 18, and 19.
A limitation of SARI is the controlled count of transformations taking to account thereby evaluating only one to one paraphrased sentences [3]. Furthermore, system is higly penalized by SARI in cases, where only one reference similar to the intial sentence is available. Hence the model's result does not modify the intial sentence thus giving a less score as stated in [3]. One another popular metric evaluated for ASS system corresponds to reading grade coefficients is Flesch-Kincaid Grade Level (FKGL). FKGL was used in reading tests of Naval personnel as shown in Equation 20. FKGL gives good scores for shorter sentences even if they are grammatically wrong and does not preserve meaning as stated in [3]. Moreover, as mentioned in [57], this score can be used to evaluate superficial simplicity but cannot be used as a metric for the overall comparison of ASS systems.  (20) To evaluate the system as mentioned in [3], there is a need to use the individual metric in conjunction with other metrics. Moreover, to correlate the human judgments with simplicity, fluency, adequacy along with readability, PASS is also evaluated based on all three metrics: SARI (High value preferred), BLEU (High value preferred), and FKGL (Lower value preferred). For this, the Easier Automatic Sentence Simplification Evaluation (EASSE) a Python package is used which aims at facilitating, standardizing automatic comparison and evaluation of Sentence Simplification systems [58]. This package provides a single access point to evaluate the system outputs. Using the EASSE package with TurkCorpus, the automatic metrics of SARI, BLEU and FKGL are evaluated for PASS and compared against the existing approaches. The evaluation of PASS has been done for both the training and test set of TurkCorpus. Moreover, the automatic evaluation of the simplification results is done based on the sentences obtained after running all the four functions -Compound, Complex, Compound-Complex and Complex-Compound.
In the training set it is observed that there are more compound-complex and compound sentences as compared to only complex sentences. Based on this analysis, we expect that SARI values should be lower for the former set of sentences compared to complex sentences. This is because SARI provides low scores for sentences involving structural changes specifically sentence splitting which are more predominant in the former set of sentences. Moreover, PASS transforms compound sentences using more rules compared to complex sentences. Hence it is expected to have lower BLEU scores for the former set of sentences in comparison to complex sentences. The reason for this is more the transformation, the lower is the BLEU score for the sentences. PASS splits the sentences for both compound sentences and complex sentences by removing words thereby reducing the length of the sentence. Therefore it is predicted to have lower FKGL scores in complex-compound as well as compound-complex algorithm compared to complex and compound algorithm alone. The system is evaluated for each of compound, complex, compound-complex, and complexcompound algorithm against each of the metrics as shown in Table 15 concerning the 2000 training dataset of Turk-Corpus. As expected, the prediction of having lower SARI and BLEU scores for compound, compound-complex and complex-compound algorithm compare to complex sentences has been justified in Table 15. The prediction of obtaining lower FKGL scores is also proved in Table 15. The system's prediction of lower BLEU scores for compound sentences is also justified by segregating only compound sentences and evaluating each rule which is shown in Table 16.
In addition to this, evaluation of PASS concerning the TurkCorpus test dataset has also been done. On observing the test dataset, it is seen, there are more compound-complex sentences compared to compound and complex sentences alone. So it is expected to get high SARI values for the former    set of sentences as SARI yields higher values for multiple references. Similar to the training dataset, it is predicted to have lower BLEU scores for compound, compound-complex and complex-compound algorithm. This is due to sentence transformation for the compound sentences compared to complex sentences in the test dataset. The FKGL score is expected to be lower for compound-complex and complex-compound algorithm as PASS reduces the length of these sentences compared to compound and complex sentences alone. Evaluation of PASS with 359 sentences of the TurkCorpus test dataset with the automatic metrics is reflected in Table 17. Table 17 has justified the predictions of PASS as discussed for the TurkCorpus test dataset. Similarly, the compound sentences have been evaluated based on the first rule as well as for all rules for the test dataset of the TurkCorpus. The evaluation of the automatic metrics for the compound sentences is shown in Table 18 which justifies the prediction of lower BLEU scores for the approach.
PASS has also been compared with the benchmarking sentence simplification models mentioned in [3] which have used  Table 19, PASS performs better in terms of FKGL, higher BLEU scores for complex sentences with moderate values of SARI compared to the existing models. This also justifies SARI and BLEU metrics performing poorly when sentence simplification involves sentence splitting as in the case of compound sentences.
PASS has been human evaluated by RandomCoCo (private dataset) while the automatic evaluation has been done by a publicly available dataset TurkCorpus. The reason for this is RandomCoCo is essentially good for human evaluation as it has less number of sentences. Therefore it is possible for the human experts to go through each of the sentence correctly based on their expertise and evaluate them. But since the dataset is small hence it cannot be taken forward for automatic evaluation as it would not give accurate results. However, on the contrary using a dataset which is large like Turkcorpus provides good and accurate results through automatic evaluation. Due to its size, this corpus cannot practically be used for human evaluation. As it is not possible for the human expert to manually go through each sentence and evaluate to check whether it has been correctly simplified. Therefore RandomCoCo has been used for human evaluation while, Turkcorpus has been used for automatic evaluation.

C. APPLICABILITY OF PASS
The advantage of splitting the long sentence into the shorter sentence by PASS might have the possibility of aiding the VOLUME 10, 2022 online anaphora resolution tool. Using these split sentences the publicly available online anaphora resolution tool can possibly resolve the longer sentences with better context. Thus make the sentences grammatically correct which is shown in the following example. The improvement is seen in sentences where there exists an IC connected with conjunction with another object that further extends the first IC as given in -'Manuscripts are placed together by rich people, monarchs, hermitages, and temples. They are placed in collections and stores.' Using an automatic anaphora resolution tool the resolved output was: 'Manuscripts are placed together by rich people, monarchs, hermitages, and temples. Rich people, monarchs, hermitages, and temples are placed in collections and stores.' After using PASS method the original sentence gets split as 'Manuscripts are placed together by rich people, monarchs, monasteries. Manuscripts are placed together by rich people, monarchs, temples. They are placed in collections. They were placed in archives.' After anaphora resolution the output given is 'Manuscripts were placed together by rich people, monarchs, hermitages. Manuscripts were placed in collections. Manuscripts were placed in stores.'. With this example, the observation is that anaphora resolution of the simplified sentences generated from PASS might get resolved better thereby improving the efficiency of anaphora resolution.
The possible improvement in the efficiency of an online anaphora resolution tool after simplification using PASS can be a potential future work. Moreover, the second advantage of using PASS is that as the sentences are split into small sentences followed by their anaphora getting resolved. So they can be easily used to create multiple choice questions for assessing factual cognitive levels.

VII. CONCLUSION
For extracting information in the NLP applications simple sentences are pre-requisite. However, the given text document could include various types of compound and complex sentences requiring its simplification. Hence transformation of compound and complex sentences to simple sentences is essential for natural language tasks. Furthermore, such kind of system can be useful for people having disability in reading and language learning. In this direction PASS -a syntactic sentence simplification system has been proposed in this paper. PASS identifies the different types of the compound and complex sentences through grammatical patterns. The system applies rules based on grammatical patterns to simplify the sentences. The generated simple sentences can be utilized to for other systems or automatic tools or modules for assisting other NLP tasks.
The limitation of PASS is it cannot split compound sentences having semicolons along with one conjunction. In some cases, the meaning of the split sentences is not preserved due to the original sentence structure needing human intervention. Furthermore, in the case of complex sentences PASS cannot split the sentences if there is more than one noun in the sentence. Currently, such sentences are transformed into simple sentences by compound-complex algorithm as the research provides a generalized approach to rule-based simplification. Moreover, since cases of repeated nouns are rare in complex sentences this could be a potential extension of the system in future.
One major advantage of PASS over the increasingly popular neural network algorithms is that it allows for granular tuning and modifications as per requirement of the user. Neural network algorithms tend to be a black-box model that do not allow the user to direct the simplification of the sentences. However syntax-based simplification ensures simplification of only those sentences which conform to the rules set, allowing for addition of further rules and modification of existing rules.
This paper makes an unique attempt to check the effect of PASS on anaphora resolution after simplification. In this direction PASS has been evaluated by linguistic experts for its usefulness and influence on anaphora resolution after simplification. An automatic evaluation of PASS along with the automatic metrics has shown that the system gives promising results. Furthermore, the comparison of PASS with existing approaches has given good improvement in terms of the metrics. An extension is to make the system handle compound sentences having multiple conjunctions and complex sentences with more than one noun. This can be achieved by exploring the grammatical patterns of such sentences in CFG. Another long term objective is to check the performance of PASS as it depends on SPACY and NLTK libraries of Python. The observation of the results of simplification show potential benefit in anaphora resolution but needs further investigation not only in anaphora resolution but also in other NLP tasks in the future. Furthermore, future plan includes to check on the system's efficiency in aiding the anaphora resolution tool and generate multiple-choice questions from the simple sentences generated from PASS.