A Neural-Network-Based Model of Charge Prediction via the Judicial Interpretation of Crimes

The neural-network-based charge prediction, which is to predict the defendants’ charges from the criminal case documents via neural network, has been a development-potential affair in artificial intelligence (AI) based legal assistant system and made some achievements. Neural network is playing important role to capture deep information in current work. However, charge prediction suffers from serious data imbalance in real-world situation. Only high-frequency charges are easy to be predicted whereas plenty of low-frequency ones are hard to be hold. Furthermore, the presence of confusing charges makes prediction worse. Here, we propose a novel model of charge prediction via the judicial interpretation of crimes (CPJIC) to provide more accurate charge prediction. The concept of crime interpretation is introduced into CPJIC, which alleviates the problems resulted from data imbalance and confusing charges. With the technique of embedding, both fact description and crime interpretation are embedded into a low-dimensional vector space as well as a neural network, delivering implemented computable charge prediction. The experimental results demonstrate that CPJIC can identify the low-frequency and confusing charges better than previous work.


I. INTRODUCTION
In real life, lots of legal and criminal cases occur every year [1], [2]. It is a significant task to make a judgment in criminal cases, which requires fair attitude, rational judgment, enriched experience and legal knowledge [3], [4]. The conviction by human judgment often suffers from the heavy workload and strict work requirements. Recently, with the technology of artificial intelligence (AI) being introduced into the case decisions, many AI-based legal assistant systems have been explored to ease the pressure of manual judgement [5], [6]. Charge prediction, which is to predict the defendants' charges by analyzing the factual description in case document, has become an important task in legal assistant system [7]- [9].
The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Asif .
Because the case materials are mostly organized by the unstructured texts, charge prediction can be regarded as the problem of text classification, according to which the similarity among multiple texts are analyzed [10], [11]. Especially, the neural-network-based models, with ability of capturing the richer context information, have also been widely applied in charge prediction [12], [13], and thus the accuracy of charge prediction has been greatly improved because of the self-learning and self-adaptability of neural networks. Furthermore, the characteristics of legal provisions can also be brought into neural networks [14] to build more general rules for charge prediction. Although this also brings lots of irrelevant information (such as the detailed rules of some specific penalties), the idea of introducing law knowledge proposed by literature [14] does inspire our work.
In this background, it is noted that there is serious imbalance in data of crime facts. As shown in Figure 1, 149 distinct VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. The imbalance of distribution in the charge dataset. The distribution of charge data is here drawn in both frequency chart (a) and violin-plot chart (b), where the former directly signifies the presence frequencies of charges and the latter overall depicts both the groups of charges and the kernel density of charge distribution. It can be seen that only few charges have high presence frequencies (corresponding to the peak regions in (a)) and the whole dataset has the serious polarization (corresponding to the violin-plot shape in (b), i.e. the extremely-narrow high-frequency region and the broad low-frequency region.). Note that there are some regions with negative frequency values in the violin-plot chart, and this only mirrors the fitted trend of data in mathematical statistics but has no really physical meanings [15].
charges from the criminal cases published in China Judgments Online 1 are analyzed by the frequency and violin-plot charts in statistics [15], respectively. On the Internet of Chinese judicial documents, the proportion of charges of theft and robbery is close to 50%, in contrast, the proportion of dozens of low-frequency charges such as embezzlement of specific funds and property and incitement to violence is less than 5%. The experimental data used in most current works have usually been filtered and processed, and as a result, the low-frequency charges are ignored, leaving only high-frequency charges. Low-frequency charges are hard to be predicted, although the high-precision classification results are produced. In fact, in the real-world scenarios, both high-frequency and low-frequency charges should be considered in the process of charge prediction. Besides, there are also some similar but different charges (called confusing charges in this paper) for the low-frequency charges in datasets, which may reduce the legibility of data. The data imbalance and illegibility aggravate the difficulty of charge prediction (especially for the low-frequency). To address this issue, some attributes of charges have been introduced as the mid-level mapping [16] to implement the fine-grained identification. Nevertheless, there still exist some confusing charges (such as theft and fraud, embezzlement of funds and embezzlement of specific property) which have the exactly same attributes. In this situation, it is still very hard to accurately predict charges.
In this study, we propose a model of charge prediction via the judicial interpretation of crimes (CPJIC) to solve the problem mentioned above. Similar to previous works, the text vectorization and attribute mapping are still used as the underlying techniques in CPJIC. Furthermore, the judicial 1 http://wenshu.court.gov.cn interpretation of crimes (abbr. crime interpretation) is introduced into CPJIC to prevent the imbalance and illegibility of charge data. Compared with the general law terms, crime interpretation has stronger discrimination and less information redundancy because of the compact texts and unique concepts. This helps not only tuning the imbalance of data but also distinguishing the confusing charges. CPJIC employs the technique of embedding in Natural Language Processing (NLP) [17], and uses neural network to encode the crime facts and interpretations into a low-dimensional vector space, by which the computable charge prediction is implemented.
To sum up, the contributions of this paper are as follows: (1)We focus on the problem of data imbalance and confusion in charge prediction, and introduce the crime interpretation to assist the prediction of charges.
(2)We propose a new dynamic learning framework, which enhances the interpretability and working ability of the model in a multi-step manner of capturing and expanding text features.
(3)We validate the model on real-world datasets. The results show that the CPJIC model, with having introduced the crime interpretation into charge prediction, can better identify charges than previous work.

II. RELATED WORK
Charge prediction is to infer the categories of facts from text descriptions. The traditional methods of machine learning utilize manual feature extraction and shallow classification model to obtain the features of text. Some statistical-based methods, such as Bayes [18], [19], Support Vector Machine (SVM) [20], [21] and Term Frequency-Inverse Document Frequency (TF-IDF) [22]), and k-Nearest Neighbor (KNN) [10], [11], have been employed and achieved progress in the classification of documents. With the direct feature engineering involved, these methods are easy to interpret and suitable for the small-scale datasets. However, due to lack of capturing the latent deep features in texts, their scalability and classification-ability are weak.
To make up for the deficiencies of traditional methods, the classification models based on deep learning have attracted more and more attentions recently. Through the multi-layer neural networks, these models usually employ the word embedding tetchiness [23], [24] (e.g. Word2Vec [25], Glove [26]) in natural language processing (NLP) [27] of AI to capture richer semantic information. Most of these models typically have the complex neural network structures [28]. Some classic neural network models, such as Convolution Neural Network (CNN) [12], Long Short-Term Memory (LSTM) [29], the attention-based Bidirectional Recurrent Neural Networks (BRNN) [30] and Recurrent Convolutional Neural Networks (RCNN) [31], can effectively capture local semantic information and global semantic representations in texts. In addition, Graph Convolutional Networks (GCN) [32], [33] converts the text classification problem into a node classification problem and has achieved successes in many applications [34], [35]. BERT [36] breaks through the problem of word polysemy representation from static word vectors. The optimization algorithm based on meta heuristic [37], [38] improves the prediction effect of neural network model by adjusting the calculation strategy. In particular, aiming to the problem of data imbalance, Gao et al. [39] and Geng et al. [40] have proposed the Hybird Attention-based Prototypical Network and Induction Network, respectively, to study the class-wise representations. These models have high-quality representations of complex patterns embedded in data, but they only focus on the information contained in text facts while ignoring the diverse background knowledge. In fact, the real-world convictions should follow certain legal provisions, so it is not comprehensive to predict charges only through analyzing the text. To address this deficiency, Luo et al. [14] have proposed a hierarchical attentional network to predict charges and extract the relevant articles synchronically, by which some information of legal provisions is used as the external knowledge to capture the latent semantic information of texts. But this work paid little attention to the imbalance of criminal facts and redundancy of legal texts. Besides, Hu et al. [16] have proposed an attention-based neural model which incorporates several discriminative legal attributes, among which some exactly same attributes are used. Chen et al. [41] have fused multiple information of charges into a legal graph and analyze the charge space for prediction, but have ignored the context information of judicial text. This would also affect the final prediction effect of model.
The work of charge prediction is also related to the legal assistant systems. In this field, answering legal questions and searching relevant cases have been extensively studied [13], [42], [43]. Some attempts consider charge prediction as a part of the automatic judgment prediction [44]- [46]. These studies capture the complex semantic interaction among fact description, plaintiff's request, legal articles and etc., which can make the prediction effect more reasonable in the legal scenario. To improve the interpretability of charge prediction, Chao et al. [47] extracts the short but charge-decisive text snippets from case description, and utilizes dynamic attention mechanism to capture the semantic information in snippets and better predict charges. Nevertheless, these methods only take high-frequency charges as the target of prediction, but ignore to the diversity caused by the low-frequency charges in reality.
Different from the previous work, this study proposes a neural network prediction model which introduces crime interpretation to capture the potential differences among different charges.

III. MAIN IDEA
In this study, focusing on the classification of imbalanced datasets and confusing charges, we introduce the crime interpretation as the external knowledge into the classification to improve the charge prediction. The difficulty lies in that the crime behaviors are too varied to be represented exactly. Fortunately, the crime behaviors can be corresponded to some specific charges, and each one of these charges, called prejudged charge in this study, has a specific crime interpretation in the judicial field. We consider the crime interpretation of prejudged charges, although they may not be the final predicted charges, to be the external knowledge of fact descriptions. The overall architecture of the CPJIC model is proposed, as shown in Fig. 2, where the fact information about criminal cases is extended by crime interpretations, and the charge prediction is performed more accurately. The specific calculation of prediction comes from the technique of embedding of NLP [23]- [26], by which the semantic information in both behavior facts and interpretations of a crime are embedded into a low-dimensional Euclidean space and thus represented as a series of low-dimensional vectors. In this process, the initial vectors are inputted into a neural network and then the representation vectors are got by machine learning; the final prediction results are calculated on the representation vectors, as shown in Fig. 3.

IV. METHODOLOGY OF CPJIC
The detailed architecture of CPJIC is shown in Fig. 4. CPJIC follows the framework of neural-network-based text VOLUME 8, 2020  categorization [12], [13]. The input is a factual description of a case, and the output is the probability distribution of each charge. The purpose is to find the correct classification result from the given labeled fact text.
As illustrated in Fig. 4, CPJIC includes three main function modules: 1) Fact Encoder, which is used to generate vector representations of fact descriptions (i.e. factor vector). 2) Crime Interpretation Encoder, which deduces the interpretations of crimes highly related to factual description, and converts them into the vector representations. 3) Charge Predictor, by which the vector representation of crime interpretation is integrated with the weighted factor vector to predict the final probability distribution.
Note that the calculation process of charge prediction is performed jointly by these three modules, in the form of neural network. In this study, we use the classical Long Short-Term Memory (LSTM) [29] neural network to implement the calculation. The LSTM neural network can deduce the meaning of current word by learning of all previous words. Furthermore, it also can handle the problem of long-distance dependence effectively.
Remark 1 (LSTM): LSTM is a special recurrent neural network. It can memorize vector sequences of indefinite time length. Every cell has some corresponding gates in the network, which determine whether the input is essential to be remembered and can be utilized as output.

A. FACT ENCODER
Factor encoder is responsible for converting the text description of crime fact to vector, as illustrated in Fig. 4. We first transform every word in text description into word vector as the input of LSTM network, and then use LSTM to embed whole text description into a vector.
The word embedding technique [25] of Word2Vec is used to encode the discrete word sequences in text into a group of word vectors. The textual description of criminal fact is denoted by x = {x 1 , x 2 , · · · , x i , · · · , x n } , where x represents the sentence of fact description; x i denote the i-th word of x, and n here is the number of words in the sentence. By word embedding, every word x i ∈ x is transformed into the d-dimensional vector x i ( x i ∈ R d ). Correspondingly, the whole description text can be represented by a series of vectors, that is These vectors are successively inputted into the LSTM network. In each iterative step t ∈ [1, n] of LSTM, the inputted word vector in X is denoted by x t . Then, by the processing of input gate i t (which determines how much of the cell state at previous moment is reserved at current moment), forget gate f t (which determines how much of network input is saved to the cell state at current time) and output gate o t (which determines how much of the cell state is regarded as current output of LSTM), the outputted hidden state h t of each LSTM cell is obtained as the following: where σ (·) is the sigmoid activation function; tanh(·) is the hyperbolic tangent function; W (i,f ,o,c) , b (i,f ,o,c) represent weight matrices and bias vectors, respectively;Ĉ denoted the abstract information with current state, and C is the cell state which combines current state with long-term state. According to this process, crime fact is transformed into the continuous hidden states ( h t ) of LSTM neural network. Note that every h t on t should be used in the predictor.

B. CRIME INTERPRETATION ENCODER
Crime interpretation encoder is responsible for converting the interpretations of a crime to vectors. We extract the key words from crime fact descriptions, and then build the prejudeged charges by the constructed keyword-charge lexicon. Finally, via prejudeged charges, the crime interpretations are got from the off-the-shelf charge glossary and thus transformed into low-dimensional vectors.
According to this idea, the key words are first extracted via the word frequency statistics based on term frequency-inverse document frequency (TF-IDF) [48], which is employed to evaluate the importance of words in fact descriptions. The quantitative definition of TF-IDF is displayed in Formula (3) as the following: where the TF value, denoted by tf i,j , mirrors the frequency of that a given word i appears in a specific text j, and IDF, denoted by idf i , measures the the universal importance of word i in all texts by a logarithmic function. n is the number of occurrences of current word t in text d, and D is the total number of texts in corpus. The words with high TF-IDF value are called the key words.
Remark 2: In this study, to get the key words for every category of crime texts for the crime texts with same charge label in training corpus, they should be merged into one category, and then the key words in different-category texts are calculated by TF-IDF.
Next, based on the key words obtained, we construct a lexicon for every charge via the technique of partof-speech (POS) tagging [49]. Firstly, the POS of each key VOLUME 8, 2020 word is extracted, and only the words tagged with POS ''v'' (meaning verb) are retained. In each category of crime texts, the first k key words 2 with the highest weights are selected as the relevant lexicons. This process, in this paper, is called the tfidf-vw method. Then, as shown in Fig. 5, for each crime fact, we extract the prejudged charges highly correlated with fact description. The higher correlation here suggests that more words in the charge lexicon are matched with fact description, and thus the charge is regarded as the prejudged charge. Based the extracted prejudged charges, corresponding crime interpretations are gained by querying the off-the-shelf legal glossary.
Referring to the method described by Formula 2 in Section IV-A, LSTM is employed to get the hidden state output (ˆ h t ) of the text sequence of crime interpretations, as the following:ˆ where Emb(·) is the same function as Formula (2), and Y denoted the series of vectors representing the crime interpretations, similar to Formula (1).

C. CHARGE PREDICTOR 1) ATTENTION-BASED ARTTRIBUTE REPRESENTATION
To tune the imbalance of charge data, the attention mechanism in NLP [50] is introduced to select the most important factors in descriptions. The attention mechanism can selectively learn the input sequence by retaining the intermediate output results of the LSTM encoder, and then associate the output sequence with the input when the model outputs results. According to this mechanism, ten charge attributes [16] are introduced, and the weight of the k-th attribute on time step i, denoted by a k,i , is computed as the following, to mirror the importance of attribute k: where vector u k is the intermediate result of LSTM, corresponding to the k-th attribute, which is used to calculate how informative an element is to the attribute k; W a is a weight matrix shared by all attributes. Therefore, the final attribute representation is got by the following: where m is the amount of charge attributes, and t denotes the current state layer.

2) COMBINATION REPRESENTATION OF CRIME FACT
By combining the hidden layer state vectors of the fact description and the crime interpretation, a sequence of fused vectors is obtained: Then, it is fed into the max-pooling layer of LSTM network to produce the ultimate representation vector q with where q (i) , H (i) 1,··· ,n represent the i-th components of vectors q, H 1,··· ,n , respectively.

3) OUTPUT OF CHARGE PREDICTION
To make the final charge prediction, by combining the fact representation and the attribute representation, CPJIC uses a softmax classifier to get the predicted charge distribution z, which is calculated by: where r is the average of m attribute-aware representation vectors; v denotes the final fact representation vector; W , b represent the weight matrix and the bias vector, respectively. Based on the cross entropy, the training objective can be written in the form of the loss function of machine learning as the following: where N is the number of charges; o i is the ground-truth label, and z i is the prediction probability. The resulted z i corresponds to the finial charge prediction.

V. EXPERIMENTS
To evaluate the proposed CPJIC model, we conduct a series of experiments on the public datasets and analyze the experimental results.

A. DATASETS
In this study, three public datasets with different scales [16], obtained from the Chinese government published on China Judgments Online, 3 are composed of many fact descriptions of charges. They are used to be the basic datasets, as listed in Table 1. These datasets include some imbalanced crime fact texts and various of charges (totally 149 distinct charges), but does not contain any crime interpretations. Therefore, we use the crime interpretations collected from China Find Law Online 4 to be the extension of datasets. Table 2 lists an example about the crime interpretations of confusing charges. It is worth noting that the textual description of each crime interpretation is concise and unique.

B. EXPERIMENT SETUP
We use the open source tool LTP 5 for the word segmentation and POS tagging on the Chinese extended dataset. In experiment, we get the pre-trained word embedding generated by the Skip-Gram model [25] on all datasets. The dimension of word embedding is set to 100, and the size of the hidden layer in LSTM is set to 100, too. In the CNN layer, we set the filter widths to 2, 3, 4, respectively, with each filter size containing 128 feature maps. In the RNN+attention-based model, the size of the hidden layer and the attention layer are set to 128 and 100, respectively. For all experiments, the Adam algorithm [51] is used to train the model; the dropout rate [52], the learning rate and the batch size are set to 0.5, 0.001 and 64, respectively. Besides, the metrics of accuracy, precision, recall and F1 are employed to evaluate the experimental results.
In addition, our experiments are performed on a PC with 1080ti GPU. The corresponding environment is an operation system (OS) of Linux, in which tensorflow and python are equipped to configure the experimental environment. Because the CPJIC model needs to extract the prejudged charges from each text dynamically, the training time increases from 8/14/26h to 9.5/17.5/32h(corresponding to three scales of datasets).

C. BASELINES
For comparison, some off-the-shelf models are used to be the baselines as the following: • CNN. CNN has been widely applied into the text categorization [12]. We here add the regularization into CNN to filter the text, and implement the CNN-based classifier.
• RNN+attention. A bidirectional RNN [53] is implemented and trained, which includes a forward RNN cell and a forward RNN cell. For simplification, we only consider the sentence-level attention here.
• Fact-Law Attention Model. This is a hierarchical attention network proposed by Luo et al. [14], which is to predict the charges and extract the relevant articles synchronously.
• Few-Shot+Attributes Model. This is an attentionbased neural model proposed by Hu et al. [16], which incorporates some discriminative legal attributes.

D. RESULT AND ANALYSIS 1) OVERALL EFFECT
As shown in Table 3, the CPJIC model can significantly outperform the baselines on all tested datasets, and has better effectiveness and robustness than the baseline models. 6 It can be seen that the CPJIC model significantly works better than the previous methods, specifically on the Medium-Scale dataset.
It is worth noting that compared with the traditional neural network classification model (e.g. CNN), the method of introducing external knowledge has better performance. For example, Fact-Law/Few-shot+attr/CPJIC support the charge prediction based on the introduction of external knowledge, while CNN/RNN+att only utilize fact text to train classifiers; the former are better for the gain of positive information than the latter. Furthermore, each model has its own characteristic. Fact-Law introduces the laws related to charges, in which the text is long and redundant. Few-shot+attr introduces ten attributes of charges, in which the attributes of some charges are identical. Comparatively, CPJIC introduces the concise  and accurate crime interpretation, which is more conducive to reflecting the differences between charges.
In addition, Rep-emb model can also improve the classification effect, which shows that criminal interpretation itself can expand the semantic features of word vectors.

2) IMBALANCED CHARGE EXTRACTION ANALYSIS
To further verify the effectiveness of CPJIC model in identifying imbalanced charges, the validity of charge prediction with different-frequency scale is tested. The results are listed in Fig. 6, where the charges are divided into three modules according to their frequencies: the low-frequency (< 100 cases), the medium-frequency (≥ 100 cases and < 1000 cases) and the high-frequency (> 1000 cases). Compared with the Few-shot+attr model, the CPJIC model can better predict high-frequency charges, and also improve the prediction effect of low-frequency charges on all datasets. The CPJIC model takes leading when predicting low-frequency charges on small dataset, which verifies its effectiveness on handling imbalanced or small-scale data issues.

3) CONFUSING CHARGE EXTRACTION ANALYSIS
The prediction validity of CPJIC on various confusing charges is also tested, and the results are listed in Table 4. We get the criminal offences with hierarchical structure in the Chinese government from China Judgments Online, 7 and consider the charge attributes proposed by the Few-shot+attr model. Some charges with identical attributes under the same hierarchical structure are considered to be the confusing charges (In the experiments, 120 out of 149 charges meet the above conditions). Compared with Few-shot+attr model, CPJIC model achieves more than 4% improvement when predicting confusing charges on all datasets, which verifies its effectiveness in dealing with noisy data issues.

4) PREJUDGED CHARGE EXTRACTION ANALYSIS
The validity of prejudged charge extraction is tested to verify the accuracy of CPJIC with introducing crime interpretation, and the results are listed in Table 5. We use the proposed tfidf − vw method to obtain three prejudged charges for each fact text. The proportion of prejudged charges including the final predicted charges is more than 92%. This shows that prejudged charges are closely related to the final forecasting results, which can help the model capture the potential behavior characteristics of fact text.

E. CASE STUDY
To demonstrate the correlation between criminal interpretation and factural text, we randomly select an example from the experiment to analyze its situation. As shown in Fig. 7, the highlighted text directly reflects the corresponding crime behaviors.
It shows that the crime interpretation of prejudged charges is closely related to the case. Intuitively, we can make a preliminary analysis based on the crime interpretation of Crime of negligent death / Crime of traffic accident/ Crime of Major Liability Accident, and then infer the charges involved in the case.

VI. CONCLUSION
Aiming at the imbalance and confusion of charge dataset, we propose a neural-network-based charge prediction model, called CPJIC model to provide more accurate charge prediction. The crime interpretation is introduced into CPJIC to prevent the prediction from suffering from the imbalance and illegibility of charge data. Furthermore, CPJIC employs the technique of embedding to encode the crime facts and interpretations into a low-dimensional vector space. The final prediction results are calculated on the vectors based on an LSTM neural network. Experiment results demonstrate that our model performs better than the baseline models on real-world datasets of different scales. In addition, we also validate the effectiveness of the acquisition of prejudged charges. Besides, our method also reflects that a case is not only related to a charge. In the future, we will pay more attention to multi-label charge prediction, expand more application scenarios, and make our research more practical.
XIAOJUN KANG (Member, IEEE) received the B.Sc. degree in computer science and technology from the China University of Geosciences, Wuhan, China, in 2000, and the Ph.D. degree in resource and environment information engineering from Huazhong Agricultural University, Wuhan, in 2011.
He is currently an Associate Professor with the School of Computer Science, China University of Geosciences. His current research interests include artificial intelligence, natural language processing, and bioinformatics.
CHENWEI WANG received the B.Sc. degree in network engineering from the China University of Geosciences, Wuhan, China, in 2018, where he is currently pursuing the M.Sc. degree in computer science and technology.
His research interests include knowledge graph, charge prediction, and machine reading comprehension.
LIJUN DONG received the B.Sc. degree in mechatronic engineering from the Nanjing University of Science and Technology, China, in 1999, and the Ph.D. degree in computer science and technology from the Huazhong University of Science and Technology, China, in 2008.
He is currently an Associate Professor with the School of Computer Science, China University of Geosciences, China. His research interests include network sciences and applications, knowledge graph, and information system security. He is currently a Professor with the School of Public Administration, China University of Geosciences. His research interests include public policy and social governance. He is a member of the Chinese Public Administration Society.