Semantic Sentence Matching Based on Multiple Parallelly Organized Interaction Layers at Various Granularity Combinations With Two-Stage Aggregation Strategy

Semantic sentence matching plays an essential role in resolving many problems in natural language processing (NLP) field, it has gained increasing research focus and shown great improvements in recent years. However, most currently existing researches are for English sentence matching, research on Chinese semantic matching are relatively less. Moreover, due to the rather complicated contextual expressions and grammatical structure of Chinese language, many existing models are still unable to quite effectively capture interaction information between sentences. Thus, in this work, we attempt to propose a novel deep model to better address Chinese semantic sentence matching. Specifically, the convolutional neural networks with various kernel sizes are first employed for the multi-granular contextual encoding of sentences, combined with multiple different cross-sentence alignment mechanisms, the semantic interactions can be more clearly and profoundly performed at various granularity combinations between sentences. Additionally, rather than serially stacking multiple interaction layers, we organize multiple interaction layers in a parallel manner, and by further introduction of attention pooling, the semantically aligned sentence attentive vectors would be adaptively aggregated from both perspectives of alignment mechanisms and granularity combinations, thus more stable and effective sentence interactive features can be extracted while attempting to alleviate potential sentence alignment error propagation issue existed in hierarchically stacked interaction structure. Finally, extensive experiments are conducted to evaluate the performance of our model, the experimental results demonstrate that our proposed approach outperforms many state-of-the-art models on sentence matching and is capable of gaining a more accurate understanding of semantic relationships between Chinese sentences.

Sentence matching, one of the key research problems in NLP filed, has gained increasingly larger attentions from research communities.It refers to the technique of distinguishing certain logical or semantic relationships between two sentences, and has been applied in a variety of The associate editor coordinating the review of this manuscript and approving it for publication was Ali Shariq Imran .
NLP tasks(e.g., paraphrase identification, natural language inference, question answering, and information retrieval).For paraphrase identification tasks, it utilizes sentence matching technique to judge whether two sentences, which, in some cases, might be of large difference in words constitution and grammar structure, express the same semantic meaning [1], [2].In natural language inference tasks, sentence matching is used to determine if a hypothesis sentence can be reasonably inferred from the premise sentence [3], [4].Question answering system employs sentence matching technique to find most suitable answers with the highest matching scores from large candidate answers according to a given question sentence [5], [6].As for information retrieval tasks, sentence matching is leveraged as a technique to evaluate the relevance between a user's input and large online documents, so that the most relevant document can be retrieved based on certain ranking scores which is calculated to quantify the relationship between document pairs [7], [8].With its wide application in a range of NLP tasks, sentence matching has played an essential role in resolving numerous NLP problems in various areas, and achieved many significant progresses.However, most existing research on sentence matching achieving excellent performance are for English sentence matching, as for Chinese sentence matching, the number of available researches is much less than that on English.Additionally, compared to English language where spaces can act as natural separators for each sentence, the word tokenization in Chinese language is more complex than in English.For each Chinese sentence, there is often more than one method to conduct word tokeniztion, and the accuracy of word segmentation will greatly affect the model performance.Also, the number of the smallest language unit in English (i.e., English letter) is just 26, while for Chinese language, the number of Chinese characters is much more than that of English, and Chinese characters in Chinese sentences contain more information than English characters, it is necessary to keep such character-level information while performing semantic matching between Chinese sentences.Moreover, in Chinese language, the grammatical structure and context expression of sentences are more complicated than that of English sentences, thus it is generally more difficult to make semantic matching between Chinese sentences compared to that in English language.And, many previous sentence matching models with excellent matching performance on English tasks may not achieve same good result when directly transferred for Chinese sentence matching.Most currently available Chinese sentence matching models are still unable to quite effectively capture semantic association between sentences, and are still faced with many challenging issues that need to be well addressed in future studies.

B. RESEARCH CHALLENGES
Although sentence matching research has made significant strides in recent years and has been applied widely in varieties of NLP tasks, there are still some major issues that need to be resolved, which are summed up as follows.
First, it has been proven by large prior researches that, the quality of interactive features captured by models indeed have a significant influence on the final matching result [9], [10], [11], [12].In order to effectively extract such crosssentence features, words or phrases having closer relevance between sentences should generally be assigned higher weights of relationship than those ordinary ones, which is usually implemented by the cross-attention alignment mechanism to explicitly distinguish different importance level of language units (e.g., words, phrases, or sentences) between sentences [13], [14].However, most currently existed Chinese sentence matching models, despite being equipped with cross-attention mechanism, are still unable to quite effectively extract such cross-sentence features for proper semantic relation inference, necessitating the need to facilitate the attention mechanism in a more effective manner.
Secondly, in recent years, due to wide application of attention mechanism in sentence matching models, it's not uncommon to find studies trying to address this issue by stacking multiple cross-attention layers so that implicit interactive features can be more profoundly extracted [15], [16].However, for this interaction-layer-stacking strategy, there may exist some drawbacks.On one hand, the aligned information between sentences captured at lower layers may get polluted by potential errors while being propagated to high layers.Once the interactive features retrieved at a low layer contain major errors, the error information may remain long in the network and be propagated all the way to the high layers, causing the alignments information at top layer to be less distinctive for final sentence matching.On the other hand, because the interactive representation of one sentence is obtained by dynamically attending to the information of another one, the interactive representation obtained at intermediate layers may vary a lot between layers, making it hard to really capture stable and effective semantic alignments information between sentences [9], [15].
Thirdly, given the advantages of multi-grained semantic information which has shown great potential in improving sentence matching accuracy, many researches try to utilize CNN (Convolutional Neural Network) to encode sentences at multiple granularities.However, upon acquiring such multi-granular word representations, many researches tend to fuse them up in the relatively lower layer of models [16], [17], either explicitly or implicitly, which makes it difficult to generate clear interactive signals between sentences of different granularities and to really benefit from such multi-granular comparison information at higher matching layer.For synonyms phrases with different word length, it still would be hard to effectively make an appropriate matching between them.

C. MAIN CONTRIBUTIONS
In this work, to tackle the above challenges, we attempt to propose a novel deep model, aiming to provide some reference for related research on Chinese semantic sentence matching.The main contributions of this paper can be summarized as: • We propose a novel model, namely Multi-Granular Chinese Semantic Sentence Matching Based On Multiple ParaLlelly Organized Interaction Layers (MGCMPI), in an effort to better address Chinese semantic sentence matching with particular emphasis on more sufficient sentence interaction by fully taking advantage of the characteristics of CNN neural network as well as multiple cross-attention alignment mechanisms.
• With introduction of multi-granular encoding for sentences combined with multiple cross-attention alignment mechanisms, our proposed model is able to generate clear and rather distinctive semantic interactive features between sentences at various granularity combinations, thus gaining a profound understanding of semantic relations between Chinese sentences.
• Instead of serially stacking multiple interaction layers, we organize them in a parallel manner coupled with the utilization of attention pooling to adaptively aggregate sentence attentive vectors from both aspects of alignment mechanisms and granularity combinations, aimed for the obtainment of more stable and accurate sentence alignment information.
The remainder of this paper is organized as follows.We introduce the related work on sentence matching in Sec.II.In Sec.III, we elaborate the details of the proposed model.Sec.IV conducts experiments to evaluate and compare performance of our proposed model with some state-of-theart models.Finally, we conclude this work and provide future direction in Sec.V.

II. RELATED WORK
Sentence matching technique plays an essential role in solving many NLP tasks and there have been significant progresses achieved previously, in early research stage, most methods were proposed with their focus on the matching of vocabulary, phrases, grammar, and syntax between sentences [18], [19], [20], [21].However, most of these traditional methods heavily rely on specific hand-crafted features which could lead to weak generalization ability, thus resulting in poor performance when applied on other different tasks.Afterward, with the increasingly remarkable results achieved by deep learning technique in various NLP tasks, as well as the emergence of large-scale annotated sentence matching dataset such as ATEC, Quora, WikiQA, etc., many researchers focus on applying deep learning to solve sentence matching problems and have gained excellent results on varieties of sentence matching tasks.Generally, according to their learning ways, deep-learning-based sentence matching approaches can be categorized into two groups, namely single sentence-encoding-based models and matching-aggregationbased ones, with the former group emphasizing more on the distinguishing of semantic difference between sentences, while the latter one paying more attention to the capture of semantic interactive information.

A. SENTENCE-ENCODING BASED METHODS
For single sentence-encoding based models, it usually employs the form of Siamese network and utilizes either CNN or RNN (Recurrent Neural Network) as the basic underlying encoder to separately encode each single sentence to a fixed-size vector representation in a parameter-shared manner, then the semantic similarity can be directly calculated upon two sentence vectors by certain pre-defined measurement function [22], [23], [24], [25].For example, Mueller et al. proposed a sentence matching model based on Siamese LSTM (Long Short-Term Memory) network, which passes through word-embedding vectors supplemented with synonymic information to the LSTM network, then encodes the underlying meanings of each sentence to fixedsize vectors, and finally get the sentence similarity score calculated by a simple Manhattan metric [26].Yu et al. put forward a method by modelling sentences in a way similar to that on image processing, it firstly constructs multi-dimensional feature maps for both sentences then employs CNN to capture deeper semantic features contained in each sentence, finally another CNN was used to capture the interactive matching features between sentences [27].For models of this paradigm, despite their relatively small-scale model size due to the parameter-shared architecture, one major weakness that cannot be overlooked is the absence of interactive information which has been proven quite essential in improving overall matching performance by many post research works.This kind of late-interaction models, which tend to make cross-sentence comparison at rather higher layers close to final prediction layer, are unable to effectively extract deeply implicit semantic relationships between sentences, thus resulting in relatively unsatisfactory results on many sentence matching tasks.

B. MATCHING-AGGREGATION BASED METHODS
Given the shortcomings of single sentence-encoding-based models, the matching-aggregation-based ones are proposed in response to their disadvantages of them.For this group of models, thanks to the matching procedure conducted from rather lower layers, they are more capable of capturing word-level interactive information which would be further aggregated at higher layers for final sentence matching [25], [28], [29], [30].For example, Hu et al. proposed the ARCII model, which is to utilize CNN and Max-Pooling for the extraction of interactive features after obtaining the interaction matrix of two sentences [29].Wan et al. put forward a model to match two sentences with multiple positional sentence representations, by the aggregation over interactions between different positional sentence representations and the employment of K-Max pooling, the semantic relationship between sentences can be well captured by its model design [30].In order for more effectual extraction of sentence interactive features, in many recent studies on sentence matching, the cross-attention mechanism has been largely applied, intended for a more accurate sentence alignment, which is usually implemented by stacking multiple of them so that more in-depth interaction information can be possibly exploited [15], [31], [32], [33].For instance, Chen et al. utilized two Bi-LSTM (Bi-directional Long Short-Term Memory) in its model with the first to encode sentence semantic information, and the second to aggregate semantic information and sentence alignment in-formation extracted by attention mechanism, so that more effective sentence representation vectors can be obtained for final semantic matching [3].Yu et al. tried to exploit the in-depth interaction information by iterating the cross-sentence alignment procedure multiple times, and with the application of a multi-perspective pooling, the semantic relationship between sentences can be well determined [16].It has been found that most models built by this paradigm have achieved significant performance improvements over the former single sentence-encoding-based ones by considering not only the semantic information of each single sentence but also the rich patterns of interactive relatedness between sentences [29], [30], [34], [35].

C. SUMMARY
Although many prior researches have achieved good results on sentence matching, however, most of them only perform sentence interaction at an identical granularity for two sentences, and there is no clear semantic matching signal generated between sentences at different granularity combinations.Moreover, for those built by stacking multiple interaction layers, they may suffer from potential sentence alignment error propagation among multiple interactive layers, thus leading to a relatively unstable and low performance in some cases.Hence, in this work, we attempt to propose a novel deep model to address Chinese semantic sentence matching by introducing multiple parallelly organized cross-attention layers with each sentence contextually encoded at various granularities.By representing each sentence at multiple granularities without combing them in the rather lower layer, together with the introduction of different cross-sentence alignment mechanisms, semantic interactions can be more clearly and sufficiently performed between sentences.Furthermore, the parallel structure of the interaction layers is expected for the alleviation of potential sentence alignment error propagation among multiple serially stacked interactive layers, which has been proven in the experiment section.

III. METHOD
The architecture of our proposed MGCMPI model is shown in Fig. 1.As can be seen, MGCMPI consists of five components including the input embedding layer, contextual encoding layer, interaction& aggregation layer, fusion layer and prediction layer.
Our proposed approach processes sentence pairs symmetrically before the final prediction layer in a parameter-shared manner.Firstly, the input embedding layer represents tokens in two sentences into high dimensional dense word vectors, which then be passed through the contextual encoding layer where embedding vectors will be encoded by multiple convolution operations with various convolutional kernel sizes, so that each sentence can be represented at multiple level of granularities.Meantime, with the introduction of the distance-aware self-attention module above the CNN-based encoding layer, the originally acquired convoluted contextual representation can be further enhanced by the incorporation of distance-aware long-dependency information.Next, the enhanced contextual representation would be fed into the interaction & aggregation layer for in-depth capture of sentence interactive features via multiple parallelly-organized cross-sentence interaction layers as well as the two-stage aggregation strategy to aggregate intermediate attentive representation vectors from both aspects of alignment mechanisms and granularity combinations.Subsequently, in fusion layer, we combine the newly obtained aggregated interactive representation with the originally encoded representation of each word, aiming to retain not only interactive features between sentences but also the low-level contextual semantics of each single sentence for ultimate sentence matching.Finally, in the prediction layer, we utilize a MLP (Multi-Layer Perceptron) layer to pass through the fused word representation vectors of sentences, which would be firstly pooled to a fixed-length, to get the final matching result between two sentences.
Suppose we have two Chinese sentences P and Q, our goal is to correctly predict the semantic relation between them, the relation is labeled as y = 0, 1, where 1 indicates they have similar semantic meaning and 0 means they are different.Specifically, our task is to learn P(y | P, Q), which can then be used for the inference of semantic relations between sentence pairs through the following formula:

A. INPUT SENTENCE EMBEDDING
Firstly, we need to map words in sentences to highdimensional dense vectors.Since our proposed model is designed particularly for Chinese sentence matching, different from most others put forward for English language, we employ not only the word-level but also the character-level representations at the consideration of possibly richer semantic information implied in single Chinese character.Specifically, the word-level embedding can be obtained by pretrained word embeddings such as Glove or Word2Vec which have been trained in advance on large-scale open corpus.For the character-level embedding, it is obtained by one-dimensional convolutional operation over each Chinese word with each single character initialized to a random numerical vector, followed by the maximum pooling to embed each Chinese word at the character level.By considering both the word and character level representation during embedding, our model is able to not only exploit richer semantic information of Chinese sentences but also overcome the out-of-vocabulary (OOV) problem.Additionally, we follow this paper [16] to add the exact-matching signal to indicate whether there exits the same word in another sentence.Therefore, the final embedding vector of each word in sentences can be obtained by the concatenation of word-level, character-level and the exact-matching vectors, and are denoted as ∈ R l p ×d and H q = h 1 q , h 2 q , . . ., h l q q ∈ R l q ×d respectively for sentence P and Q, where l p (or l q ) indicates the length of sentence P (or Q), and d is the dimension of the final word embedding which equals to the sum of the dimensions of three types of vectors contained in each embedding vector.

B. SENTENCE CONTEXTUAL ENCODING
After sentence embedding, in order to encode the local contextual information into the representation of each word in sentences, we consider CNN neural network more suitable as the basic encoder where multiple convolutional kernels with various filter sizes can be employed to represent each sentence at multiple levels of granularities.Taking the encoding of sentence P as an example, the formula is as follows: where T i,k p denotes the convoluted contextual representation of the i th word in sentence P with convolutional window size k, Conv k represents the convolutional operation with filter size k, which can be flexibly adjusted to different values so that each word be contextually encoded at multiple granularities.To ensure that all the convoluted representations of each word can be fit to be fed into the identical subsequent layer, we impose the constraint that all convoluted representations live in the same space and have the same dimensionality by applying the same number of filters to each convolutional network.In addition, although the above encoding process using CNN network can already encode the contextual information to a certain extent, but the contextual features captured by CNN are mostly local semantics within a relatively small word span in sentence, for the complex long-distance dependency information, CNN is still unable to capture it, hence, in our work, in order to well capture such global contextual information, we follow the method in [15] to enhance the contextual representation of each word by a distance-aware self-attention mechanism, the equations are simplified as follows: Different from traditional self-attention mechanism, this approach further takes into account the distance difference when calculating the similarity value between word pairs.By adding the distance-aware self-attention module above the convolutional layer, the contextual information of each word can be more accurately represented via considering not only the local contextual semantics but also the global semantic dependencies of different word distances.Similarly, the enhanced contextual representation of the j th word in sentence Q at the convolutional filter size of k ′ can also be obtained and represented as T j,k ′ q .

C. SENTENCE INTERACTION AND TWO-STAGE AGGREGATION 1) INTUITION FOR DESIGN OF SEMANTIC INTERACTION MECHANISM IN OUR MODEL
In order to gain a more thorough exploit of semantic relatedness between sentences, multiple cross-attention layers built on various soft alignment functions are employed in our model.Rather than taking the common strategy to hierarchically stack multiple cross-attention layers layer by layer in a serial manner, our model organizes them in a parallel manner, intending to alleviate the potentially larger sentence alignments error problem caused by the gradual increment of interaction layers in many attention-layer-stacked models, which has been posed by many recent research on sentence matching.Still, such strategy to parallelly organize cross-attention layers would inevitably lead to the loss of some important sentence alignments information which can be originally captured by serially-organized interaction structure.However, by large study and investigation, it can be found that what the serially-organized cross-attention layers try to capture is the multi-grained comparison information by the gradual increment of interaction depth, with higher layers capturing alignments information at larger granularity between sentences, and vice versa.Therefore, in order to deal with the absence of such multi-grained semantic alignments information caused by the parallelly-organized interaction structure of our model, we consider to represent each word of sentences at multiple granularities in encoding layer, so that the specific sentence interaction can still be carried out from multiple level of granularities.Moreover, by encoding each word at different granularities without mixing them up, our model is able to explicitly generate rather clear interactive features between sentences at different granularity combinations.Finally, through the two-stage aggregation strategy to successively integrate the interactive representations of each word from the perspective of alignment mechanisms and granularity combinations, more reasonable and distinctive interactive features can be obtained for final sentence matching.
As shown in Fig. 2, the contextual representation vectors, denoted respectively as T k p and T k ′ q for sentence P and Q, are passed through the Interaction & Aggregation Layer where the deep semantic alignments between sentences as well as the two-stage aggregation operation will be performed so as to get the final interactive representation vectors, denoted respectively as V p and V q .

2) MULTI-GRANULAR INTERACTION BASED ON MULTIPLE ALIGNMENT MECHANISMS
Specifically, in order to more fully interact two sentences and capture rather complex semantic correlations between Chinese sentences, we employ multiple different alignment mechanisms to build this module, with each of them applied on one cross-attention layer.And by referencing [36], which conducts a comprehensive study of various alignments functions used in attention technique, we finally choose three most representative ones to be composed in our module, namely Bilinear, Concat and Minus attention functions.The formulas are as follows: where v a and v b are two contextual vectors to be aligned, CrossAtt f denotes the cross-attention function of f where f ∈ F = (b, c, m), which represents Bilinear, Concat and Minus attention, respectively.Note that, despite the hierarchically stacked mode of cross-attention layers as shown in Fig. 2 out of the consideration of page layout, they are actually organized parallelly in structure without serial connection between layers, intended for the alleviation of the potential alignment error propagation issue.In order to elaborate the interactive mechanism of this module, we take the calculation of the i th word's interactive representation vector in sentence P as an example, for the reverse direction, the interactive representation of each word in sentence Q can be obtained in a similar way, which won't be detailed here.Suppose we have word contextual representation vectors of two sentences T k p , T k ′ q , firstly we need to attend contextual representation of this word in sentence P to that of words at all positions in sentence Q by utilizing the cross-attention alignment functions as given by formula ( 5)- (7).Concretely, the formulas to calculate the attentive representation vector for each single word are as follows: where superscript k, k ′ ∈ T = (1, 1), (1, 2), . . .t, t ′ , which is a set of all the possible convolutional filter size combinations between sentences, with t, t ′ indicating the maximum filter size employed for the contextual encoding of sentence P and Q, respectively.By dynamically updating the learnable parameters contained in cross-attention functions, more reasonable weighting of relationships can be acquired for the calculation of attentive representation vectors.It is worth mentioning that, as can be seen from formula (8)-( 9), by specifying different values of f and (k, k ′ ) according to the set of F and T respectively, various attentive representation vectors for each word under different alignment mechanisms and granularity combinations can be obtained, which suggests that by our model design, the multi-granular semantic interaction would be explicitly and comprehensively carried out between sentences with each of them encoded at different granularities.

3) TWO-STAGE AGGREGATIONS
Subsequently, based on the attentive representation vectors obtained, the two-stage aggregation is performed to aggregate them from aspects of alignment mechanisms and granularity combinations successively by the utilization of attention pooling technique [37], which ensures that the semantic interactive information under all cross-attention functions at all granularity combinations can be entirely taken into account for the generation of final interactive representation vector.The concrete calculation formula is as follows: where formula (10)- (11)  .Next, formula ( 12)-( 13) are for subsequent stage-2 aggregation from the aspect of granularity combinations, which, by employing attention pooling again and assigning different attention weights to the stage-1 aggregated representation vectors, is able to aggregate different semantic alignment information under various granularity combinations for sentence pair, so that more reasonable and effective interactive representation vector can be obtained, denoted as p , which are acquired respectively in formula (10) and (12), are the intermediate values calculated before specific aggregation operations to indicate different importance of each word representation vector, and W i(i=1,2,3,4) are the learnable parameters, Act is the activation function.Similarly, the interactive representation vector of the j th word in sentence Q can be derived and denoted as V j q .The obtainment of interactive representation vectors for sentence pairs are illustrated in the Algorithm 1.

D. FUSION OF INTERACTIVE AND ORIGINAL REPRESENTATIONS
In order to retain the original semantic information of each single sentence, we consider to fuse the original contextual representation vector and the interactive representation vector for each word in sentences.However, because of the multiple contextual representations of each word at various granularities, we firstly average all the contextual vectors of each word to a single vector representation T i p , which would be concatenated with the interactive representation vector to get a new fused representation vector C i p .Besides, to further synthesize the local contextual information and interactive information from the global sentence level, in addition to the basic fusion operation above, we perform self-alignment operation again within each sentence by employing traditional self-attention mechanism [38] over newly obtained fused representation vectors, The formula is as follows.
101504 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Algorithm 1 Semantic Interaction and Aggregation
Input: T k p and T k ′ q (contextual encoding for sentence P and Q) Output: V p and V q (interactive representation for sentence P and Q) 4. Aggregate the newly obtained representation vectors in step 3 from the aspect of Granularity combinations, and obtain the interactive representation vector V p and V q 5. finish

E. SEMANTIC RELATIONSHIP PREDICTION
Given the outputs from fusion layer, it is necessary to firstly convert them to fixed length before feeding them into a MLP classifier for final matching result prediction.Specifically, we apply both average and max pooling on fused representation vectors,then all of the pooled vectors would be concatenated to form the final input vector, which would then be passed through a feed-forward neural network for the overall matching result prediction between sentences, the formula is as follows: where formula ( 17)and ( 18) are for the pooling operation over representation vectors of sentenceP and Q, respectively.G in formula (19) denotes the concatenated result of all pooled vectors, while formula (20) is the details of feed-forward neural network where W 5 (b 5 ) and W 6 (b 6 ) represent the weight(bias) of each layer, F is the final activation function used for final matching result probability estimation.

A. DATASET
The following experiment is carried out based on two Chinese datasets: ATEC 1 and CCKS2 datasets.The brief introduction of them are as follows: 1) ATEC

B. IMPLEMENTATION DETAILS
In our model implementation, we employ Sun [39] tool to segment sentences into word units, then the word-level embedding is initialized with 300d Word2Vec vectors which has been pre-trained on large Chinese language corpus, while the character-level embedding is initialized randomly for each character, and in order for further convolutional operation upon them, the maximum character length of each word is set to 4 and the filter numbers of each convolutional operation is set to 128, which is equal to the final dimension of the character-level embedding.And the filter number employed by convolutional operation for

1) BASELINE MODELS
We compare our model with both single sentence-encoding based and matching-aggregation based ones, the brief introduction of them are as follows: (1) Single sentence-encoding based models Siamese-LSTM [40] obtains the forward and backward sentence vectors of two sentences by passing them through the same LSTM unit, and then it feeds the two sentence representations to the softmax layer for classification.
Text-CNN [41] represents each sentence into embedding matric, then the role information is obtained by various convolutional kernels with different filter sizes, based on which the final matching classification result can be acquired.
L.D.C. [42]decomposed each word vector into the semantically similar and dissimilar part according to the previously calculated semantic matching vector, then a two-channel CNN is utilized for the prediction of similarity values between sentences.
(2) Matching-aggregation based models Match Pyramid [35] first utilizes similarity measurement function to calculate interaction matric between two texts, then the CNN and max-pooling is employed to get the final sentence matching result.
ABCNN [43] tries to integrate attention into CNNs by three designed mechanisms for sentence pair modeling, it combines the original sentence feature map and the newly obtained attention-based feature map, so that the good matching performance can be finally achieved.
Decomposable attention [44] uses attention mechanism to decomposes text matching problem into multiple subproblems that can be solved separately, and achieves good result without relying on any word-order information.
ESIM [3] is a typical model built over matchingaggregation framework, which employs Bi-LSTM combined with the cross-attention mechanism to enhance its overall performance, by additional aggregation in its rather higher layer, the final matching result is further improved.
DIIN [14] extracts semantic interactive features of sentence pairs hierarchically from the interaction space and is a new type of neural network structure-Interactive Inference Network (IIN) RE2 [45] is a sentence matching model which proposes three important features for the semantic alignments between sentences, and with the utilization of an enhanced residual connection, richer interactive features can be captured for final sentence matching.

2) PERFORMANCE COMPARISON AND RESULTS ANALYSIS
Table 3 shows the comparison result with many existed sentence matching models, most of which achieved excellent performance on English datasets in the past, we reproduce these models by referencing relevant papers or open source code, then apply them on our benchmark datasets to optimize their parameters, so that the best performance of them can be achieved for a more reasonable comparison with our model, it should be noted that all baseline models do not utilize any additional manual feature or external knowledge, which is in consistency with our model design.
It can be obviously found that, just as expected, most matching-aggregation-based models outperform the representation-based ones with absolute result improvement on both datasets with regard to either accuracy or F1 score.And whether it is single representation based models or matching aggregation based ones to be compared, our proposed MGCMPI model achieves the best performance on both ATEC and CCKS dataset.Specifically, MGCMPI obtains the accuracy of 78.54% and F1 score of 57.06% on ATEC test dataset, which demonstrates its more competitiveness by 0.83% accuracy gain and 1.74% F1 gain over the best matching-aggregation baseline model RE2 with its test accuracy and F1 score respectively equal 77.71% and 55.32%.And it also outperforms several other popular matching-aggregation-based models including Match Pyramid, ABCNN, Decomposable-Attention, ESIM and DIIN, among which the ESIM and DIIN model have previously achieved lots of excellent results on many English sentence matching tasks.Furthermore, the performance improvement is rather significant when compared to the single representation based ones, particularly, by the comparison with L.D.C model which obtains the highest matching result among those, our model achieves an increase of 8.78% and 15.99% respectively for test accuracy and Fl score on ATEC dataset.As for CCKS dataset, our proposed model also yields the best matching result compared to all baseline models, with the test accuracy reaching 88.53%, which is an improvement of 0.85% and 1.82% respectively over two best baseline results obtained by RE2 and DIIN on CCKS dataset.It can be found that despite excellent performance of ESIM, DIIN and RE2, our proposed model is still able to outperform them by a relatively large margin in terms of both accuracy and F1 score on both datasets.
In order to intuitively observe the training process of our proposed MGCMPI, we record the model performance on testing data of ATEC and CCKS.Fig. 3 shows the testing accuracy and loss of our proposed MGCMPI.It can be observed that with the increment of epochs, both the accuracy and loss curves can finally well converged.Specifically, it is at the 30th training epoch that our model achieves best matching performance on ATEC dataset, while for CCKS, it is at the 33th epoch.
The confusion matrices for sentence matching prediction are obtained for more detailed performance comparison.Fig. 4 depicts the confusion matrices for our proposed MGCMPI and two best performing matching-aggregation based models (i.e., RE2, DIIN).It can be observed that MGCMPI outperforms another two approaches in terms of classification on both synonymous and non-synonymous sentence pairs.Specifically, MGCMPI has 95% classification and 5% misclassification accuracy for synonymous class on CCKS testing set, while for non-synonymous class, it is 82% classification and 18% misclassification accuracy.The DIIN model behaves similarly to RE2, it has 92% classification and 8% misclassification accuracy for synonymous class, and 81% classification and 19% misclassification accuracy for non-synonymous class.Compared with CCKS, for ATEC dataset which has an uneven distribution of positive and negative samples in its dataset, the confusion matrices obtained on ATEC differ a lot with that on CCKS dataset, as can be observed from Fig. 4 (d)-(f), in all three approaches, the classification accuracy on negative samples is higher than that on positive samples.Our proposed MGCMPI still outperforms other two on ATEC testing set, it has 65% classification and 35% misclassification accuracy for positive samples, and 81% classification and 19% misclassification accuracy for negative samples.

1) ABLATION STUDY
In order to investigate the effectiveness of each component in our model, in this part, we conduct ablation studies on  both ATEC and CCKS validation dataset, and the matching accuracy is adopted as the evaluation metric for experiments in this part.As shown in Table 4, whatever module is removed or simplified from the base model MGCMPI, it will always experience some extent of performance decline, which demonstrates the importance of each composed part in our model.
By comparison with an ablation baseline model which gets its multi-granular encoding changed to the common single-granular type, which is to apply granularity size of one for both sentences, with the stage-2 aggregation removed as well, then the matching performance is found to drop from 87.62% and 95.78% to 86.97% and 95.36% respectively on ATEC and CCKS dataset.Moreover, the matching result degrades to 87.31% and 95.49% respectively on two datasets when replacing multiple alignment mechanisms with single one, which is to keep the only cross-attention layer built over most representative Concat alignment mechanism and remove another two, with the corresponding stage-1 aggregation procedure removed as well.Additionally, the model performance encounters even larger decline when turning both multi-granular encoding and multi-alignment mechanisms to their corresponding single paradigm, the accuracy decrease is respectively at 0.89% and 0.70% on ATEC and CCKS.Also, we conduct a further comparison with another ablation baseline which is built based on our original model but with some changes to encode each sentence at single granularity size of one and stack three interaction layers with each layer identically adopting the Concat alignment mechanism, the experiment result shows that our base model outperforms this new ablation baseline by 0.72% and 0.61% respectively on ATEC and CCKS dataset.Furthermore, if we replace the attention pooling with average pooling in either aggregation stage, the matching result will experience some level of decline, specifically, the accuracy decrease is 0.19% and 0.6% respectively for the first and second aggregation stage on ATEC, and it's 0.15% and 0.57% drop on CCKS.Another ablation study by removing the fusion layer gets an accuracy decrease of 0.77% and 0.50% respectively on two datasets.Finally we also explored potential impact of different embeddings on model performance, it can be observed that models based on only character-level embedding has the largest performance decline, with the accuracy dropping to 76.63% and 78.22% respectively on ATEC and CCKS, while for models based on only word-level embedding, the performance dropping is rather small, which indicates the very importance of word-level embedding as well as the additional improvements brought by integrating Chinese character-level embedding in our proposed MGCMPI model.

2) ANALYSIS OF INTERACTION AND AGGREGATION MODULE
Since Interaction & Aggregation module is the key component of our proposed MGCMPI model, which is responsible for the concrete cross-sentence interaction for semantic matching, in this section, we aim to study the influence caused by different settings of this module on overall model performance.
Firstly, since the sentence interaction contained in this module is actually carried out based on the underlying multi-granular contextual encoder, we need explore the likely performance variability with variation in the settings of the encoding granularity for both sentences.Theoretically, the larger the maximum granularity size is set for both sentence pairs, the greater number of granularity combination pairs can be generated for the specific matching between sentences, which is ideally expected to enhance model performance continually with the increment of this parameter value, however, as shown in Fig. 5, the model performance indeed gets sustainable improvement with the increment of this granularity size, yet only when it is smaller than 3. Once it is larger than 3, the performance improvement tends to be quite limited, and even worse, the model performance has an obvious decline on both ATEC and CCKS test datasets when the granularity size reaches 5, which is the reason we set the maximum granularity size to 3 for both sentences in our final model implementation.Specifically, as can be seen from Fig. 5, the matching accuracy reaches the highest when the maximum granularity is set to 3, with the matching accuracy 78.54% and 88.53% achieved respectively on ATEC and CCKS test dataset.Compared to the case where granularity is set to 1 for both sentences, the accuracy improvement gained is 0.56% and 0.60% respectively for ATEC and CCKS dataset.However, when the granularity size is set to larger than 3, particularly when it reaches 5, the model experiences serious performance decline on both datasets, with 0.89% and 0.84% accuracy decrease respectively on ATEC and CCKS dataset compared to the best matching accuracy achieved at the maximum granularity size of 3.
Additionally, we investigate the performance variation under each single alignment mechanism as well as the  combination of them.In order to generate a fair comparison result, we only change the alignment functions applied in this module, with other parameters unchanged, particularly the maximum granularity is fixed to 3 which has been verified most effective in capturing multi-granular comparison information by former experiment.As shown in Fig. 6, on both ATEC and CCKS datasets, our proposed model, by simultaneously adopting multiple attention mechanisms (i.e., Concat, Bilinear and Minus), achieves the best matching result compared to cases where only a single type of alignment mechanism is utilized.Another fact we can find by this experiment is that among all three alignment functions, the Concat function seems more capable of capturing useful alignment features than another two, with Bilinear worse and Minus worst.The slight improvement gained by Concat function may be due to its more flexible alignment mechanism to concatenate two sentence vectors, which allows the alignment features to be acquired from more general perspectives rather than a fixed aspect.Specifically, on both ATEC and CCKS datasets, our proposed model achieves the best matching result by combining the sentence alignment information from three different cross-attention functions, with the matching accuracy reaching 78.54% and 88.53% respectively on ATEC and CCKS dataset.Compared to the best matching result obtained under single alignment function, which is 78.22% and 88.19% respectively on ATEC and CCKS achieved both by Concat mechanism, the accuracy improvement gained through the combination of multiple alignments is 0.32% and 0.34% respectively on ATEC and CCKS dataset.It can also be found that the Bilinear alignment function performs slightly better than Minus function, with the accuracy enhancement of 0.03% and 0.02% respectively on ATEC and CCKS.Thus, whatever single alignment function is adopted, the final model performance is still not as good as combining all three of them, which justifies the effectiveness of combining multiple alignment information between sentences in enhancing overall matching performance.

3) VISUALIZATION ANALYSIS
As we all know, attention mechanism has achieved significant success in varieties of English sentence matching tasks, while for our proposed Chinese sentence matching model MGCMPI, it also plays an essential role in improving the overall performance with its application particularly in the Interaction & Aggregation layer expected for a higher-quality semantic alignment between sentences.Thus, in this section, in order to have a more intuitive understanding of how our model captures sentence alignments information through attention mechanism, we try to visualize the word-word attention distributions by means of heatmap representation.We firstly sample one sentence pair from the ATEC dataset, which is '' As shown in Fig. 7, the subgraph (a)-(c) represent the attention distribution between sampled sentence pair under three different alignment mechanisms.Note that due to page limitation, it is not possible to have the attention matrices of this sentence pair visualized at all granularity combinations, thus here we only choose the granularity combination of (1,1) for the visualization analysis in Fig. 7 (a)-(c).It can be observed that the sentence interaction based on single alignment mechanism is generally carried out between low-level basic language units of two sentences, and for words that co-exist in two sentences or those having quite similar meanings, they would be assigned relatively higher attention values than other ordinary ones, such as ''不 能(Trans: cannot)'' in sentence P and ''不(Trans: not)'' in sentence Q, ''支付(Trans: pay)'' and ''交(Trans: give to)'', as well as the overlapped word ''押 金(Trans: deposit)'' existing in both sentences, as can be seen from Fig. 7(a) and Fig. 7(c).Yet, some major alignment errors can also be found from the visualization results in Fig. 7(a)-(c), such as ''支 付(Trans: pay)'' and ''支持(Trans: support)'' in Fig. 7(b) and the ''单车(Trans: bicycle)'' and ''押金(Trans: deposit)'' in Fig. 7(c).
As for the Fig. 7(d), likewise, it also gives the visualization result of the attention distribution between two sentences, but different from former three ones on that weights of attention between sentences are calculated by further considering stage-1 aggregation over different alignment functions.Rather than visualizing the attentive relations under single alignment mechanism, we weight sum the attention distributions of multiple alignment functions according to attention weights further calculated via attention pooling in the stage-1 aggregation.By further considering integration of multiple different alignment mechanisms, the sentence  alignment quality can get certain level of improvement, as can be verified by relatively larger number of distinctive and meaningful word alignments in Fig. 7(d) compared to former three ones.
Lastly, the Fig. 7(e) visualizes attention distribution by further taking into account the stage-2 aggregation, which is performed over all the possible granularity combinations between sentences.Specifically, based on the previously obtained weight-summed attention distributions in Fig. 7(d), we conduct the weight-sum operation again but from the perspective of granularity combinations according to the attention weights calculated by the attention pooling in stage-2 aggregation.It can be observed from Fig. 7(e) that the attention distributions calculated by considering the adaptivity of attention pooling in two-stage aggregation from both aspects of alignment mechanisms and granularity combinations can indeed yield rather distinctive and effective interaction signals, thus achieving a significant improvement on the alignment quality between sentences.Words or phrases between sentences which play an essential role in determining final matching result are strongly aligned to each other, such as ''不 能(Trans: cannot)'' and ''不 支 持(Trans: not support)'', ''支付(Trans: pay)'' and ''交(Trans: give to)'', ''单 车(Trans: bicycle)'' and ''摩 拜 单 车(Trans: Mobike shared bicycle)'' as well as the overlapped word ''押 金(Trans: deposit)'' in two sentences, which suggests that our proposed model is absolutely able to correctly distinguish the semantic relationship of this sentence pair as synonymous, as labeled in dataset.
In order to demonstrate the superiority of our proposed model in aligning two Chinese sentences over serially-stacked interaction structure of many existed ones, we take the same sampled sentence pair as above to further visualize the attention distributions in another model with its all other parameters identical with ours but some changes to encode both sentences at a single granularity, and to stack multiple cross-attention layers in a serial manner with the Concat alignment mechanism applied on each layer.It can be observed from Fig. 8(a) that in the first interaction layer, the word-level alignment information can still be properly captured to some extent.While with the increment of interaction layers, the visualization result shows that the aligned attentions tend to be unstable and ineffective in higher interaction layers, which may be caused by the potential sentence alignment error propagation between its multiple stacked interaction layers, as can be seen from Fig. 8(b) and Fig. 8(c).
Compared to the visualization result of our proposed model in former experiment, it can be clearly found that our model is more capable of gaining a more accurate understanding of semantic alignment information between sentences, thus yielding higher matching result on Chinese sentence matching tasks.

4) DEFICIENCIES OF PROPOSED MODEL
Although our proposed MGCMPI is powerful, there still are some deficiencies.As shown in Table 5, two failure examples of MGCMPI model are given.In the first case, although most of word constitution and organization order are different in two sentences, they actually convey the same meaning, it could be due to the failure capture of semantic alignment between the word ''线 下 门 店 (Trans: offline stores)'' and ''实 体 超 市 (Trans: physical supermarket)'' that leads MGCMPI to make a wrong prediction for this case.It would be difficult for machine learning model to effectively capture such implicit alignment information between sentences without employment of extra knowledge in model design.While for the second case, through some analysis, we find that length difference between sentences may also affect the model performance.In this case, it is possibly due to the large length difference that makes it hard to deeply perform semantic interaction between sentences, resulting in incorrect prediction by MGCMPI.Despite some wrongly predicted samples by MGCMPI, it is able to correctly make predictions in majorities of sentence pairs, as shown by experimental result details above.In future, we intend to employ some external knowledge as well as some other variations of attention mechanisms to enrich contextual representation of each sentence and strengthen semantic alignment features between them, in hope of further enhancement of model performance.

V. CONCLUSION
In this study, we propose a novel approach for Chinese semantic sentence matching which takes into account both the richer semantic representation of single sentence as well as the profound capture of interactive features between sentences.With the application of multiple convolutional kernels by various filter sizes, each sentence pair can be represented at multiple different granularities, so that the multi-granular sentence interaction can be explicitly performed between sentences with each sentence encoded at its own granularity size.Additionally, by introducing multiple alignment mechanisms aimed for a more comprehensive interaction between sentences, also with the two-stage aggregation strategy designed in our model, the final interactive sentence vector representations can be more reasonably obtained in an adaptive way by aggregating sentence representation vectors from both aspects of alignment functions and granularity combinations.The performance of our proposed model has been evaluated by extensive experiments on two Chinese benchmark datasets, which demonstrates the effectiveness of our model in better modeling sentence relationships for Chinese semantic sentence matching.Particularly, the visualization analysis result shows the superiority of our model in capturing cross-sentence alignments information over the interaction-layer stacked approach, which is expected to provide some intuitive reference for future model design.
Since the primary goal of MGCMPI is to calculate the semantic similarity of two sentences, it can be theoretically extended to other natural language processing tasks like information retrieval, question answering system and machine translation etc., among which the semantic matching is considered as a foundation task.Thus in the follow-up study, we intend to apply our proposed model in other different NLP tasks to further check the validity and generalization ability of our proposed model.And we would also combine some external knowledge like grammar and syntactics information, and to utilize large-scale pretrained language model such as BERT [46] or ELMo [47] for the enrichment of sentence semantic representations, so that further performance enhancements can be expectedly achieved by our model.

FIGURE 2 .
FIGURE 2. An illustration of interaction & aggregation details.

FIGURE 3 .
FIGURE 3. Epoch curves for testing data on ATEC and CCKS, (a) testing accuracy under different epochs, (b) testing loss under different epochs.

FIGURE 5 .
FIGURE 5. Influence of the maximum granularity size for sentences on ATEC and CCKS datasets.

FIGURE 6 .
FIGURE 6. Influence of different alignment mechanisms on ATEC and CCKS datasets.

FIGURE 7 .
FIGURE 7. Visualization of attention distributions in our proposed MGCMPI model.Fig.7(a) -(c) visualize the attention distributions of sampled sentence pair at the granularity size of one respectively under Concat, Bilinear and Minus alignment mechanism.Fig.7(d) visualizes attention distributions by further considering the influence of stage-1 aggregation from the aspect of the alignment mechanisms.Fig.7(e) visualizes attention distributions by considering influence of aggregations from both perspectives of alignment mechanisms and granularity combinations.
(Trans: Why can't we pay the deposit with Ant Credit Pay for bike sharing)'' and ''为什么现在蚂蚁花呗不支持交摩 拜单车押金?(Trans: Why does Ant Credit Pay not support paying Mobike deposit now)'' with their relation labeled as synonymous in the dataset.

FIGURE 8 .
FIGURE 8. Visualization of attention distributions by hierarchically organizing interaction layers.Fig.8(a) -(c) respectively visualize the attention distributions of the first, second and third interaction layer of this model.

TABLE 1 .
The sample of ATEC dataset.

TABLE 2 .
The sample of CCKS dataset.
multi-granular encoding is set to 256, the output dimension of the enhanced self-attention layer is set to 128 in our experiment.Because the trainable word-level embeddings can easily cause overfitting problem, keep word-level embedding fixed during model training.Besides, since both sentences for semantic matching are formed by the same grammar rule of Chinese language, thus it's reasonable

TABLE 3 .
The experimental results of different models on ATEC and CCKS datasets.

TABLE 4 .
The ablation results on validation dataset.

TABLE 5 .
Failure examples of MGCMPI model.