Deep Residual Convolutional Neural Network for Protein-Protein Interaction Extraction

Knowledge extracted from the protein–protein interaction (PPI) network can help researchers reveal the molecular mechanisms of biological processes. With the rapid growth in the volume of the biomedical literature, manually detecting and annotating PPIs from raw literature has become increasingly difficult. Hence, automatically extracting PPIs by machine learning methods from raw literature has gained significance in the biomedical research. In this paper, we propose a novel PPI extraction method based on the residual convolutional neural network (CNN). This is the first time that the residual CNN is applied to the PPI extraction task. In addition, the previous state-of-the-art PPI extraction models heavily rely on parsing results from natural language processing tools, such as dependence parsers. Our model does not rely on any parsing tools. We evaluated our model based on five benchmark PPI extraction corpora, AIMed, BioInfer, HPRD50, IEPA, and LLL. The experimental results showed that our model achieved the best results compared with the previous kernel-based and CNN-based PPI extraction models. Compared with the previous recurrent neural network-based PPI extraction models, our model achieved better or comparable performance.


I. INTRODUCTION
Protein-protein interaction (PPI) is a physical contact established between two or more protein molecules resulting from biochemical events, which provides a useful proxy for cellular communication lattices and can be discovered in almost all cellular processes, such as metabolism, signaling, regulation, and proliferation [1], [2]. Compiling protein-protein interaction networks are meaningful to system biology research [3]. One of the most important ways to construct a PPI network is to curate PPIs from literature. For example, Kwon et al. [4] manually curated a Death Domain The associate editor coordinating the review of this manuscript and approving it for publication was Nuno Garcia. superfamily PPI database from 295 biomedical papers. The rapid growth of the biomedical literature makes the manual search of protein homologs and PPI annotation almost impossible [5]. Until May 2019, PubMed 1 comprised more than 29 million citations for biomedical literature. Designing and implementing effective methods that automatically extract PPIs from huge amounts of biomedical literature have become a popular topic in biomedical text mining research [6].
The task of PPI extraction is to identify whether two proteins in a given fragment of text are considered to be interacted. Although the relationship between proteins may occur across several sentences, most studies have concentrated on discovering PPIs within a sentence [6], [7]. In our paper, we only consider extracting PPIs within a sentence. The simplest PPI extraction methods are the co-occurrence methods, which assume that two proteins interact whenever they simultaneously occur in a sentence [8]. Despite its simplicity and high recall, the precision of the co-occurrence methods is extremely low. For example, in the AIMed corpus, no more than 17% of all sentence-level protein pairs describe protein-protein interactions [6]. Pattern matching methods are another type of PPI extraction methods that can achieve a high degree of precision. Using hand-crafted patterns, Baumgartner et al. [9] achieved a precision of 38%, but only a recall of 6% on the PPI extraction task of BioCreative II. Yu et al. [10] built dependency graph patterns for PPI extraction that achieved higher precision but lower recall than other machine learning based methods.
Compared with the co-occurrence methods and pattern matching methods, machine learning based PPI extraction methods can simultaneously achieve higher precision and higher recall [6]. Traditional machine learning methods for PPI extraction include feature engineering based methods and kernel-based methods. Feature engineering based methods include two steps, building the features and training the classifier. For example, based on manually designed lexical, syntactic and dependency features, Saetre et al. [11] trained a support vector machine model for PPI extraction. However, manually building the features is a laborious process and these features may fail when the entity and relation annotation rules change [8].
Kernel-based methods utilize kernel function to map features into a latent high-dimensional separable space. The kernel functions are similarity functions in essence that help the classifier fully utilize structural similarity between instances. Most kernel functions for PPI extraction are based on syntactic parse trees [12]- [15], [20] or dependency parse trees [16]- [19] that consider sentence structure and semantic information. For example, Airola et al. [18] proposed the all-paths graph (APG) kernel that considers all dependency paths between two entity mentions because dependency paths are considered important indicators for the PPI extraction task. Murugesan et al. [20] proposed the Distributed Smoothed Tree kernel (DSTK), which exploits both syntactic and semantic space information. In addition to rely on parse tree information, shallow linguistic information is also useful for the PPI extraction. Giuliano et al. [21] proposed a kernel for PPI extraction leveraging shallow linguistic information such as chunking and parts-of-speech (POS). One specific kernel cannot fully model the semantic of sentences. To retrieve the most extensive information of a given sentence, Miwa et al. [22] combined several PPI extraction kernels by multiple kernel learning to build a new kernel function that outperformed previous kernel functions. However, kernelbased PPI extraction methods heavily rely on natural language processing (NLP) tools, and the error induced by these tools can cause the model's performance to decrease. Additionally, when the training dataset is very huge, maintaining a kernel matrix is a difficult task.
Deep learning has achieved remarkable success in the field of natural language process [23] and computer vision [24]. Compared with traditional machine learning methods, deep learning methods can automatically learn features from data. Many researchers recently attempted to apply deep learning methods to improve the performance of PPI extraction. Some state-of-the-art PPI extraction models are based on the recurrent neural network (RNN) [25]- [27]. Hsieh et al. [25] proposed using the LSTM, a variant of RNN, to extract PPIs from sentences. Yadav et al. [26] thought the shortest dependency path (SDP) between two entity mentions was more useful for PPI extraction than the whole sentence. They proposed the Att-sdpLSTM model that combined SDP and attention based multi-layer LSTM to extract PPIs from literature. Instead of using SDP, Ahmed et al. [27] used LSTM to model the input sentence's dependency tree structure. Although these RNN-based models achieved better performance, training RNN is notoriously difficult because of gradient vanishing and gradient exploding [28]. Due to sequence dependence, paralleling RNN is more difficult than convolutional neural network (CNN), and the prediction speed of RNN is generally slower than CNN with a similar size. Other researchers developed CNN-based PPI extraction models. For example, Quan et al. [29] proposed a multichannel convolutional neural network by fusing word embeddings of multiple versions for the PPI extraction task. Hua and Quan [30] proposed the sdpCNN model for PPI extraction by combining the SDP and CNN. In addition to sentence dependency information, Peng and Lu [31] proposed the McDepCNN model considering abundant lexical information such as parts-of-speech and chunking information. These CNN-based PPI extraction models are all single layer models that only model phraselevel information instead of sentence-level information in essence. To improve the CNN model's performance, previous work introduced additional linguistic features such as the dependency structure. However, parsing sentences into the dependency structure introduces extra computing and time complexity.
To improve CNN model's feature expression ability and avoid too much speed sacrifice, stacking several convolutional modules into one deep architecture is a feasible solution. However, when multiple convolution modules are directly stacked, the performance of the model will not be improved too much because the gradient vanishing will occur [32]. Johnson and Zhang et al. [33] discovered that the residual connection between convolutional modules was very helpful to train a deep convolutional neural network for text categorization. Inspired by their model, we propose a deep residual convolutional neural network model for PPI extraction. We do not directly stack single convolutional modules but stack residual convolutional blocks that contain several convolutional modules. There is one shortcut connection between the input and output of the residual VOLUME 7, 2019 convolutional block. Our model downsamples sentence representation by half before putting the input into the residual convolutional block. The downsampling operation makes our model refines the sentence's representation gradually. Additionally, it makes the model's computation bounded to a constant as the model deepens. The combination of residual connection and downsampling operation makes our model achieve great performance improvement compared with previous shallow CNN-based models. Most of previous deep learning based models rely on other NLP tools. For example, Yadav et al.'s Att-sdpLSTM model relies on the Enju parser 2 that parses the input sentence into a predicate-argument dependency graph. The Enju parser only parses 3 sentences per second when we apply it to large corpora. When deploying the models, the time overhead is not tolerated. Our model does not rely on other NLP tools; thus, the practicality of our model is guaranteed. We evaluate our model on five benchmark PPI corpora, AIMed [34], BioInfer [35], HPRD50 [36], IEPA [37] and LLL [38]. Our model achieve better or comparable performance than previous models [25]- [27], [29]- [31].
The remainder of this paper is organized as follows. Section II lists some related work. Section III describes our proposed PPI extraction model. Section IV describes the experimental details and presents experimental results. Section V discusses our model and gives error analysis. Section VI concludes our work and gives future research directions.

A. DEEP LEARNING BASED BIOMEDICAL RELATION EXTRACTION MODEL
Most of the current deep learning based PPI extraction models are based on the RNN [25]- [27] and CNN [29]- [31], [56]. Zhao et al. [39] combined the deep feed-forward neural network and manually selected features to extract PPIs from sentences but the model did not present deep learning model's advantage compared with other RNN and CNN based models. The shortest dependency path (SDP) can be regarded as a simplified sentence, and some deep learning based biomedical relation extraction models rely on the SDP [26], [30], [40], [41]. However, the shortest dependency path can cause information loss for some sentences, especially when candidate entity pairs are located in the subordinate clause. The recursive neural network can fully use the parsed tree information and can also be used to extract biomedical relation [27], [42]. Compared with RNN and CNN, recursive neural network dose not obtain great improvement on performance, and its computation complexity is very high. Combining the RNN and CNN is another way to improve the performance of biomedical relation extraction [41], [43]. Dewi et al. [44] proposed to use deep CNN to extract drug-drug interactions but their model 2 http://www.nactem.ac.uk/enju/ directly stacked several convolutional modules instead of using residual connection.

B. DEEP CNN FOR TEXT
Compared with single-layer CNN, RNN processes a sentence by the token sequence order and is more suitable to build text representation. RNN can achieve satisfactory performance in complex NLP tasks such as machine translation [23] and image caption [24]. However, parallel implementation of RNN is troublesome due to the sequence dependence of RNN. Compared with RNN, CNN owns a high computing efficiency and simple parallel implementation. Increasingly, more researchers use CNN to model these tasks. For example, Gehring et al. [46] designed a CNN-based sequence to sequence architecture for machine translation. The deep CNN was also applied to text classification [33], [47]. The common characteristics of these models are deeper architecture and residual connection compared with earlier CNN-based NLP models [48], [49]. Dilated convolution is another strategy to deepen the CNN. Strubell et al. [50] applied dilated convolution to the sequence tagging task. The deep CNN is widely applied to real-world systems. For example, the biomedical publication recommender system Pubmender uses multi-layer CNN to help researchers choose the publication venue [51].

III. METHOD
A deeper architecture makes CNN have a larger receptive field. Residual connection guarantees gradient propagation stability. Combined deep CNN and residual strategy, we proposed a deep residual convolutional network for PPI extraction. As shown in Figure 1a, our model includes three parts, the embedding layer, the convolutional layer and the classifier layer. Our model leverages multi-layer CNN to extract meaningful features from the sentence with two marked protein entities. In the word embedding layer, our model transforms each word in the input sentence into word embedding. These word embeddings are fed into the convolutional layer. In the convolutional layer, we stack several residual convolutional blocks instead of convolutional modules. The residual convolutional block encapsulates several convolutional modules with a residual connection as shown in Figure 1b. Before each residual convolutional block, we apply a max-pooling layer with 2 strides that concentrates context information gradually and reduces the amount of calculation. After the convolutional layer, our model feeds the extracted sentence representation into the classifier layer and gives the prediction results. We describe our model in detail in the following subsections.

A. EMBEDDING LAYER
The word embedding layer transforms each word in the input sentence into a real value vector called word embedding. Word embeddings are low-dimensional dense vectors that capture words' syntactic and semantic information. For each sentence S(w 1 , w 2 , . . . , w n ), where n is the length of the sentence and w i is the ith word of the sentence S, the word embedding matrix E of the sentence S is constructed by looking up the word embedding table W emb . E is an n * d matrix, and each column of matrix E represents the corresponding word's embedding in the input sentence. The word embedding matrix E is fed into next layer. All the words occurring in corpora are encoded into the word embedding table W emb . Each column of the matrix W emb represents a word's embedding in the vocabulary table. W emb will be tuned when we train our models. In addition to words occurring in corpora, the vocabulary table also includes three special words, PROTEIN0, PROTEIN1, PROTEIN2. Before feeding sentence marked with entities into the word embedding layer, two marked proteins are substituted with PROTEIN1 and PROTEIN2 respectively. Other protein entities are substituted with PROTEIN0. PROTEIN1 and PROTEIN2 indicate candidate interaction's protein pairs in the input sentence. If we do not substitute two marked protein entities, our model cannot determine which pair of proteins in the sentence is interacting. Substituting the protein mentions with special words also makes our model learn the general pattern of corpora instead of paying excessive attention to proteins' literal meaning. Due to the diversity of biomedical entity mentions, using special protein words to replace them will reduce the size of the vocabulary table.
Pretrained word embeddings include rich semantic and syntactic information of words [57]. Collobert et al. [48] reported that initializing the word embedding layer by pretrained word embeddings is more beneficial to train model than randomly initializing the embedding layer. We initialize word embedding layer using pretrained word embeddings and tune parameters of the word embedding layer when we train our model.
The convolutional module with a window size of 1 is behind the word embedding layer. The aim of introducing the convolutional layer is to make the dimension of word embedding match following residual convolutional block.

B. RESIDUAL CONVOLUTIONAL BLOCK
In this section, we first introduce the convolutional module and then introduce how we stack several convolutional modules into a residual convolutional block.
The input of the convolutional module is X = x 1 ⊕ x 2 ⊕ · · · ⊕ x n , where ⊕ is the concatenation operator. Compared with computer vision, the one-dimensional convolutional module is more widely used in the field of natural language processing. The one-dimensional convolution operator is applied to each window to produce a new feature. Concretely, assuming x i:i+h−1 is the concatenation from the ith feature to the (i + h − 1)th feature, where h is the window VOLUME 7, 2019 size, we obtain new features as follows where c i is the new feature of the ith window, w is the weight of the convolution module, b is the bias, and f is the activation function. In this study, we adopt ReLU as our model's activation function. The feature map c = c 1 ⊕ c 2 ⊕ · · · ⊕ c n−h+1 is produced after the one-dimensional convolutional module. The length of the output feature map, n − h + 1, is not equal to that of the input feature, n. To match the length of them, h/2 zero vectors should be added into both sides of the input feature X .
When directly deepening the deep learning model, it may meet difficulty such as gradient vanishing or exploding problem. Residual learning, which connects low-level features and high-level features, can tackle the gradient propagation problem of deep learning models [52]. The residual convolutional block is inspired from the residual learning. As shown in Figure 1b, a sequence of convolutional modules are stacked and these convolutional modules are used to learn residual function. Assuming x is the input of the residual convolutional block and F is the stacked convolutional modules, is the output of the module. When stacking several residual convolutional blocks, gradient propagation from the highlevel block to the low-level block will be easier because of the residual connection. Batch normalization can make the training process more stable and less sensitive to the learning rate [53]. We add batch normalization after each convolutional module. The number of convolutional modules 'Conv_Num' in each residual convolutional block is a hyper-parameter.

C. DOWNSAMPLING BY MAX-POOLING
Our model performs max-pooling operation with a window size of 3 and a stride of 2 before each residual convolutional block except the first one. The pooling layer produces new hidden representations by obtaining the componentwise maximum of 3 contiguous input vectors. The new hidden representations can cover more context than the original input features. Thus, subsequent residual convolutional block models the text in more abstract features. Performing a pooling operation every other triplet reduces the amount of calculation by half. Even if more layers are stacked, the amount of calculation of the model is bounded to a constant. We perform a global max-pooling operation after the last residual convolutional block. Thus, we can obtain a fixed dimensional feature vector.

D. CLASSIFIER AND LOSS FUNCTION
The output of global max-pooling is fed into the feed-forward classifier. Assuming the input of the feed-forward layer is h, the hidden representation of the feed-forward classifier is calculated as where W f ∈ R d h ×d i is the weight matrix, b f ∈ R d h is the bias, d i is the dimension of the input, and d h is the dimension of the hidden. Then, the hidden representation is fed into linear transformation to get each category's score as shown in (4).
In (4), W s ∈ R d c ×d h is the weight matrix, b s ∈ R d c is the bias, and d c is the number of categories. We use the softmax function to normalize the score s into probability.
The loss function of our model is the cross entropy function. It is defined as where S i is the ith sentence of the corpora, class i is the label of the ith sentence, and N is the number of sentences in corpora.
We minimize the loss function by the Adam optimizer [54]. Dropout [55] is added after the word embedding layer and the penultimate layer of our model i. e. the feed-forward classifier's hidden layer. Additionally, the weight decay term is added to the loss function to prevent our model from overfitting.

A. DATASETS
We evaluate our model using five benchmark PPI corpora, AIMed [34], BioInfer [35], HPRD50 [36], IEPA [37] and LLL [38]. Some differences exist in entity annotation and interaction annotation among these corpora. For example, BioInfer corpora include n-ary interactions that are not considered in other corpora. HPRD50 and LLL corpora define the types of interactions, but other corpora ignore them. Pyysalo et al. [8] transformed the five PPI corpora's annotations to a shared level of information and stored them in a unified format. The converted corpora are available on the Internet 3 and have been adopted in previous PPI extraction research [6], [25], [27]. All the converted corpora only annotate protein-protein interactions within a sentence. Interactions across several sentences are ignored. Moreover, these converted corpora ignore the interaction types and only annotate whether two entities within a sentence interact. Hence, the PPI extraction task is cast into a binary classification task. We assume annotated interacted protein pairs as positive instances and assume other unannotated protein pairs in the same sentence as negative instances. In a sentence with n protein entities (n ≥ 2), n 2 instances will be generated. For example, in the sentence ''We also found another armadillo-protein, p0071, interacted with PS1.'', there are three protein entities which are armadillo-protein, p0071 and PS1. As shown in Table 1, the sentence generates 3 2 = 3 candidate PPI instances. The third instance is annotated as a positive instance in AIMed corpus, and the other two instances are negative instances. To ensure coverage of our model's vocabulary table, the two proteins of a candidate interaction are substituted by PROTEIN1 and PROTEIN2, and other proteins in the sentence are substituted by PROTEIN0.
Although Pyysalo et al. [8] provided unified corpora for researchers, annotation differences of entities among corpora still exist. In AIMed corpus, there are some nested protein entities. For example, ''p75'' and ''p75 neurotrophin receptor'' are both annotated as protein entity mentions in the sentence ''. . . the p75 neurotrophin receptor and the p140trk (trkA) tyrosine kinase receptor. . . ''. When generating candidate PPI instances, we ignore instances between nested entities because no interactions exists in most of cases. If one of the nested entities occurs in a candidate instance, others will not be substituted by PROTEIN0. If nested entities do not participate in the interaction, we replace the longest one of nested entities with PROTEIN0 and other mentions will be ignored. In BioInfer corpus, there are some composite named entity mentions and discontinuous entity mentions. For example, in the sentence ''Arp2/3 complex from Acanthamoeba binds profilin and cross-links actin filaments.'', ''Arp2/3'' is annotated as two entity mentions, ''Arp2'' and ''Arp3''. We propose several rules to preprocess these composite and discontinuous entity mentions. These rules are listed as follows.
1) If composite entities have a common prefix, we concatenate the prefix and each component's suffix. For example, ''Arp2/3'' will be replaced by ''Arp2 / Arp3''. 2) If composite entities have a common suffix, we concatenate the suffix and each component's prefix. For example, ''muscle and brain actin'' will be replaced by ''muscle actin and brain actin''.

3) If an entity mention is discontinuous and does not inter-
sect with other entity mentions, we make it continuous by including other text within the annotated range. For example, in the phrase ''Integrin (beta) chains'', the parentheses are removed, and the ''Integrin beta chains'' is annotated as an entity mention. In our experiments, we consider ''Integrin (beta) chains'' as the entity mention. 4) If composite entities have a common prefix or suffix and intersect with other entity mentions, we directly drop these intersected entity mentions and process composite entities as before. We preprocess the corpora by previous steps. The statistics of the preprocessed corpora are presented in Table 2.

B. EVALUATION METRICS
The F1-score is the most common performance evaluation metric for the PPI extraction task [6]. The F1-score is calculated by (9).
In (7) and (8), TP is the number of true positive instances, FP is the number of false positive instances and FN is the number of false negative instances. As shown in (9), the F1-score considers both precision calculated by (7) and recall calculated by (8). When precision is low and recall is high or precision is high and recall is low, the F1-score is low. A high F1-score means that most of the positive instances have been discovered and most of the extracted interactions are correct. Because the original corpora are not divided into the training dataset and test dataset, we perform 10-fold crossvalidation on the corpora following previous studies [6], [18] and then use average F1-score of 10-fold cross-validation experiments to evaluate our model's performance.
The macro F1-score, the average F1-score of positive instances' and negative instances' F1-scores, is adopted to evaluate performance by other previous studies [26], [27], [56]. It is meaningless to consider the negative instances' F1-score when evaluating a model's performance. Moreover, unbalanced corpora such as AIMed corpus can cause the model's macro F1-score is higher than normal F1-score due to the dominance of negative instances. To directly compare our model with these models adopting macro F1-score as the evaluation metric, we also evaluate our model using the macro F1-score.

C. PARAMETER SETTINGS
Chiu et al. [59] gave some suggestions about how to train word embeddings for biomedical literature using word2vec  and provided word embeddings 4 trained on the PMC corpus. We initialize our model's word embedding layer using the pretrained word embeddings. For out-of-vocabulary words such as special words PROTEIN0/1/2, we randomly initialize their word embeddings by sampling from the uniform distribution in [−0.001, 0.001]. The weights of convolutional layers and linear layers are initialized by the Xavier Uniform initializer [60], and their biases are initialized to zero. When the sentence length is less than 100, we pad it with a special pad token.
The hyper-parameter settings are shown in Table 3. For hyper-parameters not mentioned in the table, the number of residual blocks and the number of convolutional modules, we tune them by 10-fold cross-validation. We select the number of residual blocks from the range [1,6] and the number of convolutional modules from the range [1,4]. For AIMed and BioInfer corpora, the number of residual blocks is 6 and the number of convolutional modules is 3. For HPRD50 corpus, the number of residual blocks is 4 and the number of convolutional modules is 2. For IEPA corpus, the number of residual blocks is 5 and the number of convolutional modules is 2. For LLL corpus, the number of residual blocks is 6 and the number of convolutional blocks is 2.

D. RESULTS
We compared our model with several kernel-based PPI extraction methods and deep learning based methods. The compared kernel-based models are briefly introduced as follows.
1) Edit kernel. The edit kernel obtains the similarity between two annotated sentences by computing the edit distance between them [17]. 2) APG kernel. The APG kernel counts weighted shared dependency paths of all possible lengths. The dependency path weight is higher when the dependency path between entities is shorter [18]. 3) kBSPS kernel. The kBSPS kernel is based on the shortest path between two annotated entities, dependency graph nodes and k distance dependency information [16]. 4) Hybrid kernel. Miwa et al. [22] constructed the Hybrid kernel by combining several parsers and kernels. 5) DSTK kernel. The DSTK kernel exploits structural syntactic and phrase semantic information, which can be regarded as a compositional distributional semantic model [20].
The compared deep learning based methods are briefly introduced as follows.
1) DNN. Zhao et al. [39] pretrained feed-forward neural network on extracted features by autoencoders and then fine-tuned the pretrained feed-forward neural network on PPI corpora. 2) MCCNN. The MCCNN model is a convolutional neural network with a multichannel word embedding input layer [29]. 3) sdpCNN. The sdpCNN model combines a convolutional neural network with the shortest dependency path between two marked protein entities [30].

4) McDepCNN. The McDepCNN is a convolutional neu-
ral network with abundant lexical and syntactic input features [31]. 5) LSTM. Hsieh et al. [25] combined LSTM and the multilayer fully connected neural network to extract PPIs from the literature. 6) DCNN. The DCNN model combined the convolutional neural network with feature embeddings such as Word-Net embeddings to extract PPIs [56]. 7) Att-sdpLSTM. The Att-sdpLSTM model is an attention based multilayer LSTM model whose input is the shortest dependency path parsed by the Enju parser [26]. 8) tLSTM. The tLSTm model comprising of structured attention based architecture and tree-LSTM [27]. The tree-LSTM is build on dependency parse tree.
The experimental results are presented in   achieved similar or worse performance compared than kernelbased models. Compared with the LSTM model, our model achieved better performance in AIMed corpus. Almost all deep learning based PPI extraction models do not present experimental results for the HPRD50, IEPA and LLL corpora because these three corpora are too small. For example, there are only 164 positive instances in the LLL corpus, as shown in Table 2. To comprehensively analyze our model's performance, we present experimental results in the three corpora. Our model still outperforms most of kernel-based PPI extraction despite the small size of corpora. The DSTK kernel achieved better performance in the three corpora. A plethora of external lexical and syntactic knowledge is introduced into the DSTK kernel. When the data are very scarce, the knowledge might make DSTK achieve better performance.
The DCNN model, Att-sdpLSTM model and tLSTM model are evaluated by the macro F1-score. We also evaluated our model using the macro F1-score to directly compare our model with the three models. The macro F1-score experimental results are presented in Table 5. Our model achieved better performance in the AIMed and BioInfer corpora than the DCNN model and the tLSTM model. Although the Att-sdpLSTM model achieves better performance in the AIMed corpus, our model outperforms it in almost all other corpora. Moreover, the parse speed of the Enju parser, which relied by the Att-sdpLSTM model, is very slow. The timeconsuming parser makes the Att-sdpLSTM model lack of practicality compared with our model. The tLSTM model achieved slightly better performance in the IEPA and LLL corpora, likely due to the introduction of the dependency tree structure.

V. DISCUSSIONS A. ABLATION STUDY
To investigate the effectiveness of each component of our model, we perform an ablation study on the AIMed and BioInfer corpora, the results of which are shown in Table 6.
In the study, we analyze the effectiveness of residual connection, 2-stride max-pooling and pretrained word embeddings respectively. We delete a part of our model and reserve other settings. Next, we perform 10-fold cross-validation on the AIMed and BioInfer corpora. After we remove the residual connection in the residual convolutional blocks, our model is  similar to the ordinary multilayer convolutional neural network. Compared with the other two modifications, removing residual connection causes the largest performance decline, e.g. 4.1% in AIMed and 2.6% in BioInfer. This shows that the residual connection is an important component of our model. The 2-stride max-pooling operation before each residual convolutional block enlarges the coverage of context, and brings performance gain. Initializing word embedding layer using pretrained word embeddings is an important step when training deep learning based NLP models. Our ablation experiments also verify the strategy. Additionally, the effectiveness of initializing word embeddings is more obvious for the AIMed corpus than the BioInfer corpus. As shown in Table 2, there are more instances in the BioInfer corpus than the AIMed corpus. We suppose that the effect of initializing word embeddings diminishes with data increasing.

B. HYPER-PARAMETERS ANALYSIS
The number of residual blocks and the number of convolutional modules are two hyper-parameters of our model. We determine their values in the hyper-parameters selection phase. To examine how the two hyper-parameters affect the performance of our model, we perform experiments on the AIMed and BioInfer corpora. For the number of residual blocks, we select it from the range [1,6]. For the number of convolutional modules, we select it from the range [1,4].
As shown in Figure 2, the F1-score increases with the number of blocks, illustrating the effect of deepening the model's architecture. There is a large F1-score improvement when we increase the number of blocks from 1 to 2. However, the improvement is not very obvious when the number of blocks is greater than 3. We suppose that the captured context information by our model is limited when the number of blocks is 1 and the context is enlarged by stacking several residual blocks. Moreover, max-pooling with a stride size of 2 also plays an important role in expanding the context range. However, the residual, which supplements missing information from the first few layers, gradually decreases in the last few layers. Hence, when the prediction speed is an important factor of the PPI extraction system, 2 or 3 residual blocks are the best choices. When better accuracy is needed, 5 or 6 residual blocks are the best choices. When the number of convolutional modules is set to 1, our model achieved the worst performance. When the number of convolutional modules is set to 4, our model's performance may degrade. Thus, residual convolutional blocks with 2 or 3 convolutional modules are the best choices for our model.

C. ERROR ANALYSIS
In this section, we analyze our model's error predictions on the AIMed, BioInfer, HPRD50, IEPA and LLL corpora and summarize them as follows.
1) Indication words such as bind and link between two entities interfere with our model to make a correct judgement. For example, no interaction relation between the marked entities exists in the sentence ''This suggests that PROTEIN1 may link PROTIEIN2 activation to molecules that regulate GTP binding proteins. '', but our model predicts the relation is positive. The word link located between PROTEIN1 and PRO-TEIN2 is confusing because the objective of link is easy to be mistaken for the word PROTEIN2. However, the objective of link is the word activation.  He is currently the Changjiang Scholar Young Professor and the Pearl River Scholar Young Professor with the School of Computer Science and Engineering, South China University of Technology, China. He has published over 100 research papers in international journals and conference proceedings. His current research interests include evolutionary computation algorithms, swarm intelligence algorithms, and their applications in real-world problems, and in environments of cloud computing and big data.
Dr. Zhan's doctoral dissertation was awarded the China Computer Federation (CCF) Outstanding Dissertation and the IEEE Computational Intelligence Society (CIS) Outstanding Dissertation. He received the Outstanding Youth Science Foundation from the National Natural Science Foundation of China (NSFC), in 2018, the Wu Wen Jun Artificial Intelligence Excellent Youth from the Chinese Association for Artificial Intelligence, in 2017, and the First Grade Award in Natural Science from Guangdong Province, in 2018. He is listed as one of the most cited Chinese researchers in computer science. He is also an Associate Editor of Neurocomputing.
LAN HUANG received the Ph.D. degree from the College of Computer Science and Technology, Jilin University, Changchun, China, in 2003, where she is currently a Professor and a Supervisor for Ph.D. candidates. She is mainly involved in business intelligence theory and application research. She was invited to Italy Trento University as a Senior Visitor, in 2010. As a PI and a Co-PI, she has been undertaking or accomplished more than ten teaching and scientific research projects, granted by the National 863 Hi-tech Research and Development Program, the National Science Foundation China, provincial/ministerial foundations, and other sources. The software results researched and developed by her team have brought good economic benefit for the cooperative enterprises and application enterprises. She has published 64 academic papers and obtained 14 software copyrights. In recent years, her research interests focus on business intelligence application and social network mining algorithm. The works that she participated as a Main Investigator were awarded the Jilin Province Scientific and Technological Progress Award