Sequence Generation Network Based on Hierarchical Attention for Multi-Charge Prediction

The application of multi-label text classification in charge prediction aims at forecasting all kinds of charges related to the content of judgment documents according to the actual situation, which plays a vital role in the judgment of criminal cases. Existing classification algorithms have high accuracy for the single-charge prediction, but their accuracy for the multi-charge prediction is low. To solve this problem, in this paper we introduce a novel hierarchical nested attention structure model with relevant law article information to predict the multi-charge classification of legal judgment documents. By considering the correlation between different charges, the accuracy of multi-charge prediction is greatly improved. Experimental results on real-world datasets demonstrate that our proposed model achieves significant and consistent improvements over other state-of-the-art baselines.


I. INTRODUCTION
In recent years, the task of charge prediction has attracted increasing attention. The purpose of this task is to predict the charges, law articles, terms of imprisonment, and other related information through given facts. Multi-charge prediction, as a representative sub-task of automatic charge prediction, plays an important role in the legal assistance system and can benefit many real-world applications. For example, it can provide legal experts with convenient reference information and thus improve their working efficiency. In addition, it can provide people who are unfamiliar with legal terminologies and complex procedures with legal consultation [1], [2].
Existing algorithms regard charge prediction as a single-label classification problem, by either adopting a K -nearest neighbor (KNN) [1], [3] as the classifier with shallow textual features or manually designing key factors for specific charges to help understand the text [4], which makes those works difficult to scale to multi-charge classification. In the single-charge prediction task, the singlecharge model has a good prediction effect, but there are The associate editor coordinating the review of this manuscript and approving it for publication was An-An Liu . cases of ''one person with multiple charges'', resulting in the difficulty of extracting all the content features in the judgment documents. There are also works addressing a related task, finding the law articles that are involved in a given case. They often transform the multi-label classification problem into a multi-class classification task by only considering a fixed set of article combinations [5], which can only be applied to a small set of articles and does not fit real-world applications. Two improvements are proposed in the latest achievements: first a preliminary classification was performed and second a re-ranking method that deals with word-level and article-level features was used [5]. To some extent, these technologies have improved the experimental results, but they are heavily reliant on expert knowledge and extra textual analysis. Recent advances in neural networks have enabled us to jointly model charge prediction and relevant article extraction in a unified framework, where the latent correspondence from the fact description about a case to its related law articles and further to its charges can be explicitly addressed by a two-stack attention mechanism [6], [7]. However, these methods, which are used by setting a threshold, mostly ignore the logical correlation between different charges. Meanwhile, various parts of the text contribute differently to predicting different VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ charges. Inspired by the tremendous success of the sequenceto-sequence model in machine translation, abstractive summarization, style transfer, and other domains, a sequence generation model consists of an encoder-decoder where the attention is proposed to generate labels sequentially, and thus predicts the next label based on its previously predicted labels [8]. In our proposed multi-charge prediction based on a sequence generation model, we employ the logical correlation between charges to capture the critical information relevant to some specific charges. By considering related factors, such as single-charge prediction, charge correlation, and relevant article extraction, multi-charge prediction could benefit from these related tasks on sequence generation models to achieve evident improvements.
The main problem of multi-charge classification is the explosive growth of output space [1]. Assuming that there are 20 tags, the output space has a power of 20. To deal with the label space with exponential complexity, it is necessary to mine the correlation between charges. For example, if a criminal commits the charge of ''smuggling'' and ''selling drugs'', the possibility of the offender committing the charge of ''detaining others to take drugs'' is also high, but the possibility of committing the charge of ''corruption'' or ''bribery'' is very low. Effective mining of the correlation between charges is the key to the success of multi-charge prediction.
In practice, there is a strong logical connection between the charges, such as ''theft'' and ''robbery'', or ''smuggling'', ''trafficking and transporting drugs'', and ''detaining other people to take drugs'', which have a high frequency of co-occurrence. In actual multi-charge prediction, the charge sequence is formed by sorting the charges, and the correlation information between the charges is integrated into the model to improve the prediction effect. In Table 1, we list the correlations between several charges [9]. In brief, our contributions are as follows. (1) We find that conventional multi-label classification algorithms are not suitable for the multi-charge classification, and we introduce a novel framework to consider the correlation between different charges to capture the critical information. (2) We propose a novel hierarchical nested attention structure model with relevant law article information to predict multi-charge classification of legal judgment documents. By considering the correlation between different charges, the accuracy of multi-charge prediction is greatly improved on the charge prediction datasets. (3) At the same time, our model can also improve the accuracy of the single-charge prediction.
The rest of this paper is organized as follows. In section II, we generalize the related works and further deduce the motivation of our model. Section III presents our structure based on this theory and method. In section IV, the experimental settings and model performance evaluation are presented. Finally, we present the conclusions and present future research directions.

II. RELATED WORK
For a long time, experts in the field of law have been studying how to achieve automatic charge prediction. Raghav and Krishna have applied quantitative methods to predict judgments by calculating numerical values for factual elements [10]. Katz attempted to extract efficient features from case annotations [11]. Liu et al. introduced mathematical models for charge prediction, such as linear models and the scheme of nearest neighbors [4]. These methods are usually mathematical or quantitative, and they only work with small datasets with few charges.
Past works have considered the multi-charge prediction as a special multi-class classification task that uses factual descriptions as inputs and outputs charge labels. Binary relevance (BR) transforms the multi-label classification task into multiple single-label classification problems by ignoring the correlations between labels [12]. Classifier chains (CC) transforms the multi-label classification task into a chain of binary classification problems and takes high-order label correlations into consideration [13]. Label powerset(LP) transforms the multi-label classification task into a multi-class problem with one multi-class classifier trained on all unique label combinations [14]. Based on the phrase classification method [10], KNN is used to classify criminal charges. However, the generalization ability of the KNN method is poor, and the word-level and phrase-level features extracted are too shallow to fully represent the text content of the charge fact description. It is impossible to obtain a sufficient basis to distinguish similar charges with nuances.
Lin et al. proposed a Chinese legal document labeling scheme by adding artificially labeled content as an aid to the machine learning model to improve the understanding of the case [5]. There are also scalability issues with this approach as the need to artificially design and annotate these determinants for each type of charge requires significant labor costs [6].
In the civil law system, some work has focused on determining the applicable legal provisions for a particular case. Phrase-based classification transforms this multi-charge problem into a multi-class classification problem by considering only a fixed set of articles [4], [15]. When considering a large set of legal articles, the number of possible combinations will increase exponentially, so this method cannot be extended to massive legal articles. The extremely multi-label text classification (XMTC) approach includes an extensible two-step classification method that first uses support vector machines (SVMs) to initially classify the article and then uses the word-level features and co-occurrence trends between the articles to sort the results [16]- [18]. Luo et al. proposed an SVM model to extract the top k candidate articles and article-side attention to better understand the texts, but the method relies on a threshold to predict relevant charges and does not consider the logical correlation between different charges [19], [20]. Compared with the generation model, this threshold model cannot reflect the characteristics of the training data itself and has limited capacity: this model has uncertainty reading the prior structure. Furthermore, this model can not catch the logical correlation between similar charges, and would not perform well when the number of classes increases and more similar charges appear [21].
By considering the correlations between labels, SGM [8] views the multi-label classification task as a sequence generation problem. It proposes an encoder-decoder with an attention mechanism structure to predict different labels. Experimental results show that this model outperforms other methods by a large margin. However, it is likely that this model would make a succession of wrong label predictions in the following time steps if the prediction is wrong at time step t. Meanwhile, it is not suitable for the representation of document-level input. SGM uses the generation model to generate labels sequentially, but it does not consider the relevant law article information to enhance the effect of multi-charge prediction. We adopt a relevant law article extractor as an auxiliary means to improve the prediction effect of our model. Concurrently, we adopt a completely different hierarchical nested attention mechanism, which can better capture relevant semantic information [22].

A. OVERVIEW
An overall architecture of our proposed model is shown in Fig. 1. It consists of two parts: the encoder uses word-level attention to get the key information in the sentence, then uses sentence-level attention to get the key information about the facts of charges; and the decoder, with long short-term memory (LSTM) as the basic unit, which is used to decode the output vectors of the encoder and attention mechanism in charge prediction [23], where document refers to the document vector representation after processing, T j denotes the jth charge and S j denotes the hidden state of the decoder at time step j.
From the perspective of criminal facts, the charge prediction task can be modeled as finding an optimal charge sequence c * that maximizes the conditional probability [24]: where f is the fact of the judgment document, c denotes the charges contained in the judgment document, and c i is a single charge.
Considering that a sentence is a combination of a series of words, and an article is a combination of a series of sentences, the text embedding problem can be transformed into a combination of words and sentence embedding problems by using hierarchical structure. In the following, we will present how to build the sentence-level and word-level vectors progressively from words by using word encoder and sentence encoder to solve the text embedding problem.

B. ENCODER
The structure of encoder is shown in Fig. 1. In the encoder, we adopt a hierarchical attention network. As the criminal facts text belongs to a long text level, we first perform the word-level attention operation on each sentence to achieve VOLUME 8, 2020 feature extraction of each sentence. Then, the attention operation at the sentence-level is performed to obtain the feature representation of the entire text. On this basis, we perform the sentence-level attention operation. The key words and sentences in the criminal facts text can be obtained through the hierarchical attention operation [2], [7], [25].

1) WORD ATTENTION
Since different words in a sentence have different effects on the meaning of the entire sentence, it is helpful to introduce an attention mechanism to extract the words with important meanings and aggregate the representation of those informative words to form a sentence vector. We introduce the LSTM to represent the meaning of the sentence: where W and V represent the weight matrices to be trained, b represents the bias vectors, x t represents the input at time step t, h t denotes the concatenated vector representation of forward hidden state − → h t and backward hidden state; and ← − h t represents the hidden state at time step t.
The structure of word processing is shown in Fig. 2. First, we transform every word into a vector form by the embedding matrix. We introduce an attention mechanism to extract the words with important meanings and aggregate the representation of those informative words to form a sentence vector [26]: , where W w and U w represent the word-level attention weight matrices, b w represents the bias vector, and h it represents the hidden state at time step t of the ith sentence. We first feed the word annotation h it through a one-layer MLP to obtain u it as a hidden representation of h it , then we measure the importance of the word as the similarity of u it with a word-level context vector u w and obtain a normalized importance weight of it through a Softmax function [21]. Next, we compute the sentence vector s i (we abuse the notation here) as a weighted sum of the word annotations based on the weights. The context vector u w can be seen as a high-level representation of a fixed query ''what is the informative word'' over the words similarly to that used in memory networks. The word context vector u w is randomly initialized and jointly learned during the training process [27]- [29].

2) SENTENCE ATTENTION
The structure of sentence processing is shown in Fig. 3. Given the sentence vectors s i , we can obtain a document vector in a similar way. We use a bidirectional LSTM to encode the sentences: where W and b represent the weight matrices and bias vectors, respectively; x t represents the input at time step t; h t denotes the concatenated vector representation of the forward hidden state − → h t and backward hidden state; and ← − h t represents the hidden state at time step t. To reward sentences that are clues to correctly classifying a document, we again use an attention mechanism and introduce a sentence-level context vector u s and use the vector to measure the importance of the sentences [30]. This yields where W s and U s represent the sentence-level attention weight matrices, b represents the bias vector, h i represents the hidden state of the ith sentence, and v is the document vector that summarizes all the information of sentences in a document. Similarly, the sentence-level context vector can be randomly initialized and jointly learned during the training process [31].

C. DECODER
The decoder structure of our model is shown in Fig. 4, where T j denotes the jth charge. In the decoder, we use the LSTM as the basic unit and draw on the attention mechanism in machine translation. On the one hand, it integrates the logical correlation between charges into the model; on the other hand, it strengthens the encoder-decoder information flow, thus completing the final multi-charge prediction [32]. The decoder reads each word vector sequentially and then copies the final hidden state to the decoder as the initial state. In the process of charge prediction, the decoder first reads the initial charge ''sos'' (start of the sentence), predicts the first charge described in the crime facts, then copies the charge to the second step as the input, and then predicts the second charge until the prediction result is the cut-off charge ''eos'' (end of the sentence) [33]. The hidden state s t of the decoder at time step t is computed as follows: where [g(y t−1 ); c t−1 ] represents the concatenation of the vectors g(y t−1 ) and c t−1 , g(y t−1 ) is the embedding of the label that has the highest probability under the distribution y t−1 .
Here, c t−1 denotes the context vector at time step t − 1 and y t−1 is the probability distribution over the label space L at time-step t − 1 and is computed as follows: where O t is the output of the LSTM cell and T t is the probability of predicted charge at time step t. Here, W o , W d , and V d are weight parameters, and f is a nonlinear activation function. At the training stage, the loss function is the cross-entropy loss function. We employ the greedy search algorithm here.
When the decoder predicts the end charge, the model stops predicting. The prediction paths ending with the ''eos'' are added to the candidate path set [34].

D. USING LAW ARTICLES
In the process of multi-charge prediction, we add an article extractor as an auxiliary means to improve the prediction effect of the model according to the content of the dataset. The first k law articles are selected by classifier, and then the feature vectors of these k articles are obtained by a neural network to represent semantic information, and the feature vectors are fed into the ''attention'' mechanism in Fig. 1. The extraction part of the law article is set according to the content of the dataset, which as mentioned in the experiment is legal information in the CJO dataset, and the extraction module of the law article can be added, but not in the CAIL dataset. We use the legal information in the data as the auxiliary means to predict the related charges, and then we combine the logical connection between the charges of criminal law to further improve the effect of charge prediction.

E. THE OUTPUT
To make the legal charge prediction, we first concatenate the document embedding and the aggregated article embedding, then use the full connection layer and Softmax layer to predict the classification charges. As the number of charges for each instance varies, we do not normalize the prediction probability [35], [36]. The loss function is given as follows: (8) where N is the number of samples, L is the number of labels, y ij ∈ [0, 1] and y ij ∈ [0, 1] are the prediction probability and true values, respectively, for the ith sample and the jth label.

IV. EXPERIMENTS
Legal judgment documents are usually long texts with large amounts of words and data. First, there are 469 types of charges in criminal law. If the multi-charge prediction task is transformed into a two-class problem, 2 469 new labels will be generated, which will cause huge manual processing costs, and significantly increase the complexity of the model. Using the sequence-based generation model, we add a startup label and end label, but the label space has not changed [37]. Considering the logical correlation between the charges, those charges that are less likely to occur at the same time are excluded. In order to verify the effectiveness of our model on criminal prediction, we conducted experiments on datasets and compared our model with several state-of-the-art baselines [2].
For charge prediction, first we sort the charge sequence of each sample according to the frequency of the charges in the training dataset, where high-frequency charges are placed in the front. In addition, the ''sos'' and ''eos'' symbols are added to the head and tail of the charge sequence, respectively [5]. VOLUME 8, 2020 A. DATASETS CONSTRUCTION We collect and construct two different legal judgment datasets: CJO and CAIL. CJO consists of criminal cases published by the Chinese government from China Judgement Online 1 and CAIL is another criminal case dataset of ''Chinese AI and Law Challenge''. 2 We selected 148,841 pieces of data from the CJO dataset, where each sample is divided into three parts: (1) criminal facts; (2) list of charges; (3) articles of law. There are 131 charges in the CJO dataset. We select 36,012 pieces of data from the CAIL dataset, each composed of the description of a case and the facts in the legal document. There are 106 charges in the CAIL dataset. It also includes the legal provisions involved in each case, the accused's conviction, and the length of the sentence.

B. BASELINES
We compare our proposed methods with the following baselines.

1) BINARY RELEVANCE
The basic idea of this algorithm is to decompose the multi-label classification tasks into Q independent binary classification problems, wherein each binary classification problem corresponds to a possible label in the label space. When the number of labels is large and the label density is low, class imbalance may occur in the binary classifier of each label [38].

2) CLASSIFIER CHAINS
The basic idea of this algorithm is to transform the multi-label learning problem into a chain of binary classification problems, where subsequent binary classifiers in the chain are built upon the predictions of preceding members. The disadvantage of this algorithm is that it loses the chance for parallel computing because it needs chain call to predict charges [13].

3) LABEL POWERSET
The basic idea of this algorithm is to transform the multi-label classification problem into the multi-class classification problem. Mapping 2 Q possible label sets to 2 Q natural numbers. The feature of this method is that the label set of label powerset (LP) prediction must already exist in the training set. It cannot generalize the label set that has never been seen before. As a result, the output space of this method is too large and the classification efficiency is low [14].

4) PREDICT CHARGES FOR CRIMINAL CASES WITH LEGAL BASIS
The predict charges for criminal cases with legal basis (fact_law) jointly models the charge prediction task and the relevant article extraction task [19]. We experiment with both the use of the relevant article information (fact_law) and the absence of the relevant article information (fact_wo). 1 http://wenshu.court.gov.cn. 2 http://cail.cipsc.org.cn/index.html.

5) HIERARCHICAL ATTENTION NETWORKS
Hierarchical attention networks (HAN) uses two levels of attention mechanism applied at the word-level and sentencelevel, enabling it to attend differentially to more and less important content when constructing the document representation [22].

C. EXPERIMENTAL SETTINGS
For the two datasets described previously, as the documents are well-structured and human-annotated, we can easily extract fact descriptions, applicable law articles, charges, and terms of penalty from each document using regular expressions.
For all models and baselines, we use Adam [39] as the optimizer, and set the learning rate 0.001, the dropout rate [40] 0.5 and the batch size 32. Since the case documents are written in Chinese with no spacings between words, we employ word segmentation. Afterward, we adopt the Skip-Gram model to pre-train the word embeddings on these case documents, with embedding size set 100 and frequency threshold set 25.
According to the statistics of the charges in the CAIL dataset, we decide that the frequency of single charges should be at least 50. In the CJO dataset, we set the minimum number of charges 80, and we sift the charges whose frequency below 80 out. Then we set the charge name dictionary according to the frequency.
The law needs to be textualized as input: we set the text length of the CJO 300 and the law of CJO 500, the text length of CAIL is set 400 and law remains unchanged. The number of certain charges related to the data is too small, we set the CJO's text length 300, the CJO's law 500, and the CAIL's text length 400.

D. EVALUATION METRICS
Following the previous work, we adopt hamming loss and macro-F1 score as our main evaluation metrics for the performance comparison, because both are widely used evaluation methods for multi-label classification problems [19]. Hamming loss is used to calculate the accuracy of the multi-label classification model: where N is the number of samples, L is the number of charges, y ij is the true value of the jth component in the ith prediction result,ŷ ij is the predicted value of the jth component in the ith prediction result, and XOR is the ''exclusive OR'' operation. Here, we employ accuracy, macro-precision, macrorecall, and macro-F1 as our evaluation metrics [2], [8], all the formulas are defined as follows: Precision j , Recall j , where TP j , TN j , FP j , and FN j represent the number of true positive, true negative, false positive, and false negative test samples with respect to the jth charge, respectively.

E. RESULTS AND ANALYSIS ON MULTI-CHARGE PREDICTION
The performance comparison with the previous research work is demonstrated in Tables 2, 3, and 4. As shown in Fig. 5 and Fig. 6, the precision and loss of our model in the validation set in the training process is about 90%.
Meanwhile, Fig. 7 shows that the Hamming loss of the CAIL dataset remains very low. Almost all existing methods perform poorly under the Macro-F1 metric, which shows that they do not effectively combine the law article information with the nesting of the attention structure. Conversely, our model achieves promising improvements, demonstrating the robustness and effectiveness of our method. By adding a      law extractor, the law text encoder, introducing criminal law information improves the prediction effect of the model. However, without introducing the correlation information between labels, the multi-charge prediction effect is not as good as that of single-charge prediction.

F. RESULTS AND ANALYSIS ON SINGLE-CHARGE PREDICTION
Taking the single-charge prediction as a special case of the multi-charge prediction, our model can also be used to predict single charges. In the process of single-charge prediction, the decoder first reads the initial charge ''sos,'' predicts the related charge described in the crime facts, then takes this charge as the input, and the prediction result is the cut-off charge ''eos.'' We can observe in Table 5 that our model also outperforms all the baselines. As shown in Fig. 8, the precision of our model in the training process is approximately 100%. Experimental results show that our model also has good robustness on single-charge prediction.

G. CASE STUDY
In this section, we choose a representative case to show that the attention module helps improve the accuracy of multi-charge prediction in Fig. 10. In this case, the defendant was convicted of ''obstruction of official duties'' and ''provocation.'' Because cases involve specific amounts of money and damaged goods, it is often difficult to decide whether a case should be judged as an ''obstruction of public affairs'' or a ''provocation'' because both are related to violence. An important feature of both is that the scenario in the case is the public goods of the administrative department, which will be given a higher weight in attention. Thus, we believe that the attribute is essential in the charge prediction of this case. In addition, we visualize the heat map of this case when predicting the attribute intentional injury. Words with deeper background color have higher attention weights. In Fig. 10, we observe that the attention mechanism can capture key patterns and semantics relevant to the current attribute.

V. CONCLUSION
In this work, we have focused on the task of multi-charge prediction according to the fact descriptions of criminal cases. To address the problem of predicting countless and confusing charges, we have introduced a novel hierarchical nested attention structure to predict multiple charges of legal judgment documents. Specifically, our model learns the hierarchical nested attention structure and legal judgment fact representation jointly by utilizing an attribute-based attention mechanism and logical correlation. Experimental results on CAIL and CJO datasets have shown that our model outperforms significantly all baselines and conventional multi-label classification models.
In future, we will explore the following directions. (1) To verify the generalization ability of the model, we will use our model in similar task in other languages. (2) We will explore more complicated legal judgment cases, such as multiple defendants and charges. Thus, it is challenging to handle this general form of charge prediction. (3) We will explore graph embedding with adversarial training methods to investigate the effectiveness of multi-charge prediction [41], [42]. (4) We will explore how to incorporate task-sensitive features to improve the performance of multi-charge prediction [43], [44].
In particular, if designed properly, transfer learning can provide encouraging results [45]- [48]. Meanwhile, we expect that all kinds of high-performance language models can be applied to the multi-charge prediction in the future.