TenRR: An Approach Based on Innovative Tensor Decomposition and Optimized Ridge Regression for Judgment Prediction of Legal Cases

With the development of big data and artificial intelligence technologies, the use of computers to assist judgments in legal cases has become a popular topic. Traditional methods for judgment prediction mainly depend on feature models and classification algorithms. However, feature models require considerable expert knowledge and manual annotation work. They have strong dependence on vocabulary and grammar information in datasets, which is unconducive to improving the universality and accuracy of subsequent prediction algorithms. Meanwhile, the outputs of classification algorithms are discrete prediction results with coarse granularities. This paper proposes a new algorithm based on innovative tensor decomposition and ridge regression for judgment prediction of legal cases, namely, TenRR. TenRR is mainly divided into three steps. First, we propose a tensor representation method, namely, RTenr. RTenr expresses legal cases as three-dimensional tensors. Second, we propose an innovative tensor decomposition algorithm, namely, ITend. ITend decomposes original tensors representing legal cases into core tensors. Lastly, we propose an optimized ridge algorithm, namely, ORidge, to construct a judgment prediction model for legal cases. We further propose an optimization algorithm for ORidge to ITend; thus, core tensors obtained using ITend carry tensor elements and tensor structure information that is most beneficial to improving the accuracy of ORidge. Core tensors greatly reduce the dimension of original tensors. They eliminate the meaningless, redundant, and inaccurate information in original tensors. Experiments show that our method has higher accuracy than traditional methods for judgment prediction.


I. INTRODUCTION
With increasing maturity of big data and artificial intelligence technologies, the use of computers to assist judgments in legal cases has become a prominent research area. Judgment prediction algorithms mainly have the following two functions: (1) predict judgment results, which can provide a reference for judges, and (2) prevent the occurrence of wrongful conviction. Judgment prediction algorithms serve as a warning when the judge's judgment differs greatly from the result predicted using prediction algorithms. For example, The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wang .
in two burglary cases involving $30,000 each, the judgment prediction algorithm sentenced 2 years, whereas the judge sentenced 5 years.
Previous research on judgment prediction was mainly based on feature models and classification algorithms. The former is used to model legal cases. The latter predicts the scope of judgments. These methods have many shortcomings. From the perspective of feature models, (1) substantial legal expertise and manual annotation are required. These models have strong dependence on vocabulary and grammar in datasets. (2) Dimensional explosion and data sparseness are easy to occur. (3) Cases cannot be described in multiple directions. (4) Considerable inaccurate, meaningless, and redundant information exists. These issues seriously affect the accuracy and stability of subsequent prediction algorithms. From the perspective of classification algorithms, (1) the granularity is coarse. Detailed prediction results cannot be provided. (2) These algorithms have a strong dependence on training data. They cannot accurately extract useful information in datasets.
This article proposes a new algorithm for judgment prediction, namely, TenRR. TenRR combines innovative tensor decomposition with optimized ridge regression. As shown in Figure 1, TenRR is mainly composed of three parts, namely, RTenr, ITend, and ORidge. First, we use RTenr to represent legal cases as three-dimensional tensors. Second, we decompose original tensors into core tensors by using ITend. Core tensors greatly reduce the dimension of original tensors. Lastly, we use core tensors to train ORidge and obtain a prediction model for legal case judgment. TenRR solves shortcomings faced by traditional judgment prediction algorithms. RTenr automatically extracts case features without the need for considerable expert knowledge and manual labeling. It characterizes cases from multiple directions. ITend greatly avoids data sparseness and dimensional explosion. It also removes substantial inaccurate, meaningless, and redundant information by using mapping matrices. ORidge is based on a regression model, which can provide fine-grained prediction results. ORidge optimizes the tensor decomposition process in ITend through mapping matrices. Therefore, obtained core tensors carry tensor elements and a tensor structure that is most beneficial to improving the accuracy of prediction methods.
The main contributions of this article are as follows: • A method based on tensor models for representing legal cases, namely, RTenr, is proposed. RTenr represents cases as three-dimensional tensors. It automatically extracts features without a large amount of legal expertise and manual labeling. This characteristic avoids the occurrence of sparse data. RTenr has a weak dependence on lexical and grammatical information in datasets.
• An innovative tensor decomposition algorithm, namely, ITend, is introduced. ITend decomposes original tensors representing legal cases into core tensors. Core tensors greatly reduce the dimension of original tensors. ITend removes redundant, meaningless, and inaccurate information from original tensors. It improves the accuracy of subsequent prediction algorithms.
• An optimization algorithm for ITend to ORidge is recommended. ORidge uses this algorithm to guide the tensor decomposition process in ITend; hence, core tensors obtained using ITend carry tensor elements and structure information that is most conducive to improving the accuracy of ORidge.
In the remainder of this article, Section II presents research on judgment prediction of legal cases. Section III describes related calculations used in this article. Section IV details the principles of our approach. Section V provides experimental results and analysis.

II. PREVIOUS WORK
Research on judgment prediction of legal cases mainly focuses on modeling legal cases and constructing prediction methods. Previous studies mainly used feature models to describe cases and classification methods to predict judgments. Classification methods mainly include machine learning algorithms and neural networks. At present, few studies exist on judgment prediction. Considering that legal case documents are text data, this article divides previous methods into four categories on the basis of text analysis techniques, namely, (1) prediction methods based on feature models, (2) prediction methods based on matrix decomposition, (3) prediction methods based on tensor models, and (4) prediction methods based on unsupervised tensor decomposition.
Prediction methods based on feature models refer to the combination of feature models and prediction algorithms. Gruginskie [1] proposed a method based on feature models and machine learning algorithms. This method represents cases as matrices. It uses various classification algorithms, such as support vector machine and neural networks, to predict judgments. Manes and Downing [2] recommended a method based on feature models and rules. Unlike the previous method, this method uses rule reasoning to complete judgment prediction. Prediction methods based on feature models have many deficiencies, such as follows: (1) considerable expert knowledge and manual annotation are required; (2) cases cannot be described from multiple levels; and (3) data sparseness and dimensional explosion are prone to occur, which affects the accuracy and stability of subsequent prediction algorithms.
Prediction methods based on matrix decomposition refer to the combination of matrix decomposition and prediction algorithms. Jing [3] proposed a classification algorithm based on singular value decomposition (SVD). This method decomposes original matrices derived from feature models by using SVD. It uses obtained matrices to train neural networks. Similarly, Li [4] solved problems of data sparseness and dimensional explosion in feature models via matrix decomposition, which enhanced the accuracy and stability of prediction algorithms. Prediction methods based on matrix decomposition have deficiencies, including (1) the natural drawbacks of feature models and (2) poor guidance and matrix decomposition that may lead to loss of useful information for prediction methods.
Prediction methods based on tensor models refer to the combination of tensor models and prediction algorithms. Wimalawarne [5] showed that regression and classification algorithms based on tensor models have advantages over matrix models. These methods represent cases as threedimensional tensors and then train prediction algorithms by using such tensors. Tensor models automatically extract case elements. However, prediction methods based on tensor models have the following deficiencies: (1) dimensional explosion is easy to occur; (2) considerable redundant, useless, and meaningless information exists, which affects the accuracy of subsequent prediction algorithms.
Prediction methods based on unsupervised tensor decomposition refer to the combination of unsupervised tensor decomposition and prediction algorithms. Taguchi [6] reduced the dimension of original tensors via unsupervised tensor decomposition and used the obtained results as the input of prediction algorithms. Similarly, Zheng [7] proposed a method based on Tucker tensor decomposition. Prediction methods based on unsupervised tensor decomposition have poor guidance. They may cause loss of information useful for judgment prediction.

III. PRELIMINARIES
In this section, we provide a number of tensor-related calculation criteria used in this article, including the calculation rules for tensors and matrices or vectors. The identity matrix is represented by E.
Definition 1 (Trace Norm): For a matrix or tensor, its trace norm is the sum of elements on the main diagonal. That is, given a matrix M , M ∈ R I ×I , the trace norm of M is Given a tensor χ, χ ∈ R I 1 ×···×I N , where I 1 = I 2 = · · · = I N , the trace norm of χ is For a vector, matrix or tensor, the value of the Frobenius norm is the square root of sum of squares of all elements. That is, given a tensor χ, χ ∈ R I 1 ×I 2 ×···×I N , the square of the Frobenius norm of χ is

Definition 3 (Tensor Vectorization):
Given a tensor χ , χ ∈ R I 1 ×I 2 ×···×I N , the tensor vectorization of χ is χ vec , That is, the vectorization of a tensor refers to the vector formed by expanding all the elements of the tensor.
Definition 5 (Hadamard Product): Given two matrices A and B, A, B ∈ R I ×J , the Hadamard product of A and B is That is, for two vectors, matrices or tensors with the same dimensions, the Hadamard product has the same dimension. Definition 6 (n-Mode Product): Given a tensor χ and a matrix A,

IV. METHODOLOGY
This article proposes a method based on innovative tensor decomposition and optimized ridge regression for judgment prediction of legal cases, namely, TenRR. As shown in Figure  2, TenRR is mainly composed of three modules. (1) RTenr. RTenr represents each legal case as a three-dimensional original tensor. (2) ITend. ITend decomposes the original tensor representing a legal case into a core tensor by using a set of mapping matrices. (3) ORidge. This article proposes an optimization method for ORidge with respect to the set of mapping matrices. ORidge controls the tensor decomposition process in ITend by optimizing mapping matrices. As a result, the obtained core tensors carry tensor elements and tensor structure information that is most conducive to improving the accuracy of TenRR.

A. RTenr
The principle of predicting judgments of legal cases is to model cases. Traditional case-modeling methods are based on feature models, which have the following disadvantages: (1) a large amount of legal expertise and manual labeling are required; (2) the explosion of dimensions and the problem of sparse data are easy to appear; and (3) feature models have a strong dependence on lexical and grammatical information in datasets. This situation greatly increases the computational complexity and volatility of subsequent prediction algorithms while reducing their accuracy and stability.
This article proposes a method based on tensor models for describing legal cases, namely, RTenr. RTenr represents legal 167916 VOLUME 8, 2020 cases as three-dimensional tensors. RTenr mainly includes the following steps: (1) division of case modules, (2) filtering of vocabularies in each module, (3) matrixization of modules, and (4) generation of original tensors.
Division of case modules refers to the division of legal cases into multiple modules. In accordance with previous research and expert consultation, we divide each legal case into five modules, namely, subject, object, behavior, reason, and result modules. The subject module refers to the victims in legal cases and their background information. The object module refers to the suspects in legal cases and their background information. The behavior module refers to the process of committing crimes. The reason module refers to the cause of cases and the subjective attitude of victims and suspects. The result module refers to the property loss and social effect caused by cases.
Filtering of vocabularies in each module refers to the cleaning of vocabularies in each module. All tensors representing legal cases have the same dimension; consequently, the number of vocabularies in each module is the same. Filtering of vocabularies is mainly divided into three steps.
• Vocabulary reduction: Dictionaries of stop words and legal terms are constructed. Meaningless and redundant vocabularies in case modules are removed. Legal-related terms are retained to avoid the loss of case elements caused by word segmentation errors.
• Vocabulary ranking: The frequency and TF-IDF (term frequency-inverse document frequency) value of each vocabulary in case modules are calculated. The vocabularies are sorted in accordance with the occurrence order to ensure that words that are crucial to judgment prediction of legal cases rank first.
• Cutting or padding: The standard length of case modules is set on the basis of the distribution of sample length in the dataset of legal cases. Modules with more vocabularies than the standard length are cut. Modules with less vocabularies than the standard length are padded. Matrixization of modules refers to the representation of each case module as a matrix. On the basis of filtering of vocabularies, we use Google's word2vec tool and a large number of Chinese corpora to train the word vector model. Vocabularies in case modules are represented as lowdimensional dense vectors. A padded word can be represented as a zero, mean, or random vector. Generation of original tensors refers to representing legal cases as threedimensional tensors. We merge matrices representing case modules in a three-dimensional space and then obtain an original tensor representing a legal case.
Compared with traditional feature models, RTenr has advantages. One is the automatic extraction of case features. RTenr requires no large amount of legal expert knowledge and manual annotation. It can avoid data sparsity. The other is the description of cases from various levels. It captures the potential correlation among case modules. Based on the abovementioned characteristics, RTenr greatly improves the accuracy and universality of subsequent algorithms.

B. ITend
Original tensors representing legal cases derived using RTenr cannot be directly used as inputs of subsequent judgment prediction algorithms, mainly due to the following two points.
(1) When the size of legal cases is large, original tensors obtained using RTenr have high dimensions, which easily cause dimensional explosion. This phenomenon seriously increases the computational complexity of subsequent judgment prediction algorithms. (2) Original tensors obtained using RTenr carry a large amount of redundant, meaningless, and inaccurate information, which seriously affects the accuracy of subsequent judgment prediction algorithms.
This article uses the tensor decomposition strategy to solve the aforementioned problems. Tensor decomposition methods decompose original tensors into core tensors and a series of factor matrices. Core tensors represent the main tensor elements and tensor structure information of original tensors. Tensor decomposition methods have the following two advantages. (1) They greatly reduce the dimension of original tensors, thereby decreasing the computational complexity of subsequent prediction algorithms. (2) They remove the VOLUME 8, 2020 meaningless and redundant information in original tensors and correct the inaccurate information. However, traditional tensor decomposition methods (such as CP or Tucker tensor decomposition algorithms) are unsupervised and poorly interpretable and guidable.
This article proposes an innovative tensor decomposition algorithm, namely, ITend. Unlike traditional tensor decomposition algorithms, ITend enhances interpretability by setting a set of mapping matrices. As shown in Figure 3, ITend is mainly divided into two parts. (1) Calculation of the transitional tensor. The value of the transitional tensor is calculated using original tensors and the set of mapping matrices.
(2) Calculation of the core tensor. The value of the core tensor is calculated using the transitional tensor. ITend maps original tensors into a space represented by the set of mapping matrices to generate core tensors. Subsequent judgment prediction algorithms intervene in the process of tensor decomposition in ITend by optimizing the set of mapping matrices. Finally, core tensors obtained using ITend carry tensor elements and tensor structure information that is most conducive to improving the accuracy of subsequent judgment prediction algorithms.

1) CALCULATION OF THE TRANSITIONAL TENSOR
Subsequent judgment prediction algorithms intervene in the tensor decomposition process through the set of mapping matrices in ITend. The set of mapping matrices can be interpreted as the space that is most conducive to improving the accuracy of subsequent algorithms. ITend is mainly divided into two steps: (1) calculation of the transitional tensor and (2) calculation of the core tensor. Core tensors obtained using ITend represent the tensor elements and tensor structure information that is most conducive to improving the accuracy of subsequent prediction algorithms.
In this article, the transitional tensor is calculated from the original tensor and the set of mapping matrices. The transitional tensor represents the projection of the original tensor on a space represented by the set of mapping matrices. Problem 7 provides the formal definition of the problem to be solved in this section. Under the constraints of the set of mapping matrices, the transitional tensor contains the main tensor elements and tensor structure information in the original tensor. The transitional tensor is a bridge connecting the original tensor and the set of mapping matrices. It provides a strong support for the subsequent calculation of the core tensor.
Problem 7: Given the original tensor χ derived by RTenr, χ represents a legal case, χ ∈ R I 1 ×I 2 ×···×I N , the set of mapping matrices {C n }, C n ∈ R J n ×I n , n ∈ [1, N ], calculate the value of the transitional tensor ν, ν ∈ R J 1 ×J 2 ×···×J N , that ν minimize the following objective function: where ν can be interpreted as the mapping of χ in the space represented by the set of mapping matrices {C n }.
Lemma 8: Given a tensor χ, χ ∈ R I 1 ×I 2 ×···×I N , the Frobenius of χ has the following properties: : Given a matrix A, A ∈ R I ×J , then the square of the Frobenius norm of A has the following properties: And the value of B can be obtained by singular value decomposition of A.
Lemma 12: Given a tensor χ and two matrices A and B, Lemma 13: Given a tensor χ and two matrices A and B, Lemma 17: The objective function F Pr o1 in problem 7 can be transformed into the following form: B n satisfies the following condition: where V (n) , (n) and U (n) can be obtained by performing singular value decomposition on matrix C n . C n = U (n) (n) V (n)T . U (n) and V (n) are orthogonal matrices. (n) is a diagonal matrix. (n) is the inverse matrix of (n) . According to lemma 17, we convert problem 7 into the form of equations 2 and 3. Lemmas 8,9,10,11,12,13,14,15 and 16 provide support for the proof of lemma 17. Proofs 31, 32, 33, 34, 35, 36, 37, 38, 39 and 40 give the proof process of lemmas 8,9,10,11,12,13,14,15,16 and 17 respectively.
In this article, we use the least squares method to calculate the value of ν in equation 2. The key is to find out the partial derivative of the object function F Pr o1 with is not a function of D, ∂Trace(C T C) ∂D = 0. Therefore, the remaining problem is to solve the values of the following partial deriva- By lemmas 18, 19 and 20, we can obtain that where ξ is a constant, ξ = 2. Proofs 41, 42 and 43 give the proof process of lemmas 18, 19 and 20, respectively. In summary, we can get that ∂F Pr o1 where ξ is a constant, ξ = 2. According to the least squares method, we set ∂F Pr o1 ∂ν to 0. Finally, we get the calculation of ν, that is

2) CALCULATION OF THE CORE TENSOR
In Sub-subsection IV-B1, ITend provides the set of mapping matrices {C n }, n ∈ [1, N ]. {C n } can be interpreted as the mapping space that is most conducive to improving the accuracy of subsequent judgment prediction algorithms. We map the original tensor χ, which represents the legal case into the space represented by {C n }, and then obtain the transitional tensor ν. ν represents the main tensor elements and tensor structure information in χ . We use the core tensor χ to approximate ν. Accordingly, χ represents the main tensor information in χ that is most conducive to improving the accuracy of subsequent judgment prediction algorithms. Problem 21: Given the transitional tensor ν, ν ∈ R J 1 ×J 2 ×···×J N , calculate the value of the core tensor χ, χ ∈ R J 1 ×J 2 ×···×J N , so that χ minimize the following objective function: where χ can be interpreted as the main element information in tne original tensor χ and the main structure information in the set of mapping matrices {C n }.
Combining equation 1 in problem 7 and equation 13 in problem 21, we can conclude that with ν as the bridge, χ represents both the main tensor element information in χ and the main tensor structure information in {C n }. Therefore, χ is interpreted as the tensor element and tensor structure information in χ that is most conducive to improving the accuracy of the subsequent judgment prediction algorithms. VOLUME 8, 2020 Lemma 22: Given the transitional tensor ν, ν ∈ R J 1 ×J 2 ×···×J N , the core tensor χ, χ ∈ R J 1 ×J 2 ×···×J N , and the object function Lemma 23: Given the transitional tensor ν, ν ∈ R J 1 ×J 2 ×···×J N , the core tensor χ, χ ∈ R J 1 ×J 2 ×···×J N , and the object function Lemma 24: Given the transitional tensor ν, ν ∈ R J 1 ×J 2 ×···×J N , the core tensor χ, χ ∈ R J 1 ×J 2 ×···×J N , and the object function where ξ is a constant, ξ = 2.
We use the least squares method to find the value of χ in equation 13. According to the calculation steps of the least square method, we need to find out the partial derivative of the object function F Pr o2 with respect to χ. N ]. By lemma 9, we can get that

C. ORidge
In terms of judgment prediction of legal cases, the dimension of the core tensor χ obtained using ITend is much smaller than the dimension of the original tensor χ obtained using RTenr. However, χ may be multicollinear. Multicollinearity has a considerable effect on judgment prediction algorithms. It can cause the following problems: (1) parameter estimates are sensitive, unstable, and distorted; (2) the variance and covariance of parameter estimates are large; and (3) the analysis function of models is reduced.
To solve the abovementioned problems, this article proposes an optimization algorithm based on ridge regression and a set of mapping matrices, namely, ORidge. Based on the traditional ridge regression model, ORidge introduces a set of mapping matrices into its loss function. This article also proposes an optimization algorithm for the loss function in ORidge versus the set of mapping matrices. ORidge controls the tensor decomposition process in ITend by optimizing the set of mapping matrices. As a result, core tensors obtained using ITend carry tensor elements and tensor structure information that is most conducive to improving the accuracy of ORidge. ORidge uses the L2 regular term to prevent overfitting and solve problems caused by multicollinearity. Definition 25 presents the formal definition of the loss function in ORidge.
Definition 25: Given the core tensor , and the set of mapping matrices {C n }, C n ∈ R J n ×I n , n ∈ [1, N ], χ (m) represents a legal case. The loss function of ORidge is defined as where ϕ( χ (m) ) represents the judgment result of the legal case represented by χ (m) , including sentences and fines.
A 2 F is the L2 regular term. Lemma 26: Given matrices A, B and C, A ∈ R P×M , B ∈ R M ×N , C ∈ R N ×P . Then T race(ABC) = Trace(CAB).
Lemma 27: Given a tensor χ (m) , vec µ). χ (m) represents a legal case, ϕ( χ (m) ) represents the judgment result of χ (m) , including sentences and fines. Then vec ϕ( χ (m) )). χ (m) represents a legal case, ϕ( χ (m) ) represents the judgment result of χ (m) , including sentences and fines. Then Lemma 30: Given a matrix C n , C n ∈ R J n ×I n , F C = Trace(C T n C n ), then We use mini-batch gradient descent (MBGD) to solve values of parameters in function 15. The key to the problem is to find out the partial derivative of function 15 with respect to C n . Let According to equations 8 and 14, we can get that ∂ χ (m) ∂B n ∂C n can be obtained by the rule of inverse matrix derivation. Thus where By lemma 30, we can get that ∂ψ(α n ,C n ) ∂C n = 2α n C n . Thus Proofs 47, 48, 49, 50 and 51 give the proof process of lemmas 26, 27, 28, 29 and 30, respectively.

V. RESULTS AND ANALYSIS A. DATA DESCRIPTION
The dataset used in this article consists of real legal cases obtained from the Chinese Referee Document Network. The dataset contains nearly 3 million legal cases involving 203 crimes. We divide each judgment of legal cases into two parts, namely, sentences and fines. On the basis of a large amount of legal expertise, in the early work of this article, we extracted features from the original dataset and then obtained the number of sentence and fine of each legal case. Such work provided effective data support for the training of judgment prediction models. The legal cases in the dataset involve sentences ranging from 0 to 300 (in units of months) and fines ranging from 0 to 100,000 (in units of yuan). Figure 4 shows the number of sentences and fines in some legal cases of fixed-term imprisonment. We analyze legal case data and find interesting phenomena. (1) Sentences in more than 80% of legal cases are concentrated within 3 years. (2) Fines in more than 80% of legal cases are concentrated within 6000 yuan. (3) The number of legal cases sentenced to death or life imprisonment is small, 0.159% and 0.328%, respectively. (4) In nearly 50% of legal cases, the number of fines is 0 yuan.

B. BASELINE APPROACHES
This article proposes a new method for the judgment of legal cases, namely, TenRR. TenRR mainly consists of three modules. (1) RTenr. It represents legal cases as threedimensional original tensors. (2) ITend. It decomposes original tensors into core tensors. (3) ORidge. It is trained using the obtained core tensors, and a judgment prediction model for legal cases is obtained. The set of mapping matrices {C n } in ITend is crucial. {C n } is a bridge connecting ITend and ORidge. It enables ORidge to control the tensor decomposition process in ITend. Consequently, the obtained core tensor χ contains the tensor elements and tensor structure information that is most conducive to improving the accuracy of TenRR.
Studies on judgment prediction are currently limited. Proposed methods are mainly based on feature models and machine learning algorithms. Considering that the case documents are text data, we set the following four VOLUME 8, 2020 baselines on the basis of latest research in the field of text analysis.
• Prediction methods based on feature models: These methods use feature models to represent legal cases. The obtained matrices are input into prediction algorithms.
• Prediction methods based on matrix decomposition: These methods use feature models to represent legal cases decompose original matrices via matrix decomposition, and train prediction algorithms through obtained matrices.
• Prediction methods based on tensor models: These methods represent legal cases as three-dimensional tensors and then input them into prediction algorithms.
• Prediction methods based on unsupervised tensor decomposition: These methods use tensor models to represent legal cases. They decompose original tensors into core tensors by using unsupervised tensor decomposition. Core tensors are used to train prediction algorithms. The highlight of the method proposed in this paper is the setting of mapping matrices. Mapping matrices closely link the case-modeling process with subsequent prediction algorithms. The abovementioned four baselines can reflect the advantages of mapping matrices in TenRR. Prediction algorithms used in this article include commonly used neural networks and regression algorithms. Neural networks include TextCNN [8], TextRNN [9], TextCNN attention [10], TextRNN attention [11], LSTM [12], Bi-LSTM [13], GRU [14] and Bi-GRU [15]. Regression algorithms include linear regression [16], polynomial regression [17], ridge regression [18], Lasso regression [19], and ElasticNet regression [20].

C. PARAMETER ADJUSTMENT AND EXPERIMENTAL SETTINGS
This article proposes a new method based on the innovative tensor decomposition and optimized ridge regression, namely, TenRR. TenRR consists of three parts: RTenr, ITend, and ORidge. Unlike traditional ridge regression algorithms, in addition to the regression coefficient, the parameters involved in TenRR include a set of mapping matrices {C n }. In TenRR, ORidge controls the tensor decomposition process in ITend by optimizing the values of {C n }. Therefore, the selection of initial values of the set of mapping matrices {C n } is important. It affects the convergence speed and accuracy of judgment prediction algorithms.
Before conducting formal experiments, we establish numerous preliminary experiments on small batch datasets. We set different initial values of mapping matrices in accordance with different steps on each small batch dataset. Then, we monitor the influence of the initial values of mapping matrices on the convergence speed of prediction algorithms. From previous experiments, we can conclude that when each mapping matrix has a simple linear relationship and the absolute values of its elements are small, the loss function can reach the optimal value rapidly. That is, the prediction algorithm has a fast convergence speed.
TenRR is a regression algorithm, and each obtained sentence or fine is an accurate value. However, in practice, the judge would prefer to see the range of sentences or fines. Therefore, we set a fault tolerance window. When the difference between the predicted and real values is within the window, we consider the predicted value to be correct. In this paper, the fault tolerance window adopts a dynamic floating mechanism. For sentences, the fault tolerance window is 3 months. When the prison term is more than 10 years, the fault tolerance window can be extended up to 6 months. For fines, the fault tolerance window is 300 yuan. When the fine involved is more than 10,000 yuan, the fault tolerance window can be extended to 500 yuan. When the fine involved is more than 50,000 yuan, the fault tolerance window can be extended to 1,000 yuan.
For traditional neural networks, such as TextCNN, TextRNN, TextCNN attention, TextRNN attention, LSTM, Bi-LSTM, GRU, and Bi-GRU, this article performs 10 iterations with a batch size of 128, a hidden layer size of 512, a hidden layer number of 3, and a learning rate of 0.001. We use TensorFlow as the program development tool and a graphics processing unit to increase the calculation speed. For traditional regression algorithms, such as linear, polynomial, Lasso, and ElasticNet, this article uses the optimal order and regression coefficients in each specified interval.

D. EXPERIMENTAL RESULTS AND ANALYSIS
This subsection provides the experimental results and analysis in this article. Figures 5, 6 and 7 show the predicted  and actual values of sentences in legal cases of fixedterm imprisonment, life imprisonment, and death penalty, respectively. In cases of death penalty, sentences mean suspended sentences. In cases of life imprisonment, sentences mean commutation. Figure 8 shows the predicted and actual values of the fines in legal cases. Subfigure (a) depicts the experimental results of the judgment prediction method based on tensor models, which is an algorithm combining RTenr and traditional Ridge regressions. Subfigure (b) demonstrates the experimental results of TenRR. Each figure has two curves, of which the dark curve represents the distribution of actual sentences or fines, and the light curve represents the distribution of predicted sentences or fines for corresponding legal cases.
From Figures 5, 6, 7 and 8, compared with the prediction results of the traditional ridge regression algorithms, those of TenRR on sentences and fines are more accurate. The main reason is the use of ITend and ORidge in TenRR. ITend decomposes the three-dimensional original tensor χ obtained using RTenr into a core tensor χ. χ represents a legal case. First, ITend establishes a set of mapping matrices {C n }. It maps χ into a feature space represented by {C n } and then obtains the core tensor χ. χ greatly reduces the dimensions of x while removing redundant, meaningless, and inaccurate information from it. Second, ORidge intervenes in the tensor decomposition process in ITend by optimizing the value of {C n }; therefore, the obtained core tensor χ carries tensor elements and tensor structure information that is most beneficial to improving the prediction accuracy of TenRR.
{C n } is the bridge connecting ITend and ORidge. In terms of sentence prediction, the number of legal case of fixedterm imprisonment is larger than those of life imprisonment and death penalty, and, hence, the prediction accuracy rate of fixed-term imprisonment is higher.  Tables 1 to 4 present the judgment prediction methods based on feature models, matrix decomposition, tensor models, and unsupervised tensor decomposition, respectively. For convenience, the prediction accuracy of TenRR is also shown in Table 3.
Subtables (a) in Tables 1 to 4 demonstrate the accuracy of judgment prediction algorithms based on neural networks. In this article, every 3 months is classified as one category. Every 100 yuan is also classified as one category. The prediction accuracy of RNNs is higher than that of CNNs. RNNs can capture the contextual information among vocabularies. By contrast, in CNNs, the convolution kernel focuses on capturing spatial correlation information among vocabularies. Therefore, RNNs are better at processing time series data than CNNs. Nevertheless, LSTMs have higher prediction accuracy than RNNs. The main reason is that handling the long-distance dependency is difficult for RNNs. LSTMs, on the contrary, solve the problem by setting long-term state and gates. Compared with unidirectional LSTM, Bi-LSTM fully considers the context information of vocabularies and, thus, has higher prediction accuracy. GRUs and LSTMs have comparable accuracy, but GRU has faster convergence speed. GRUs pass outputs directly to the next neuron, and no output gate is needed. They also have fewer parameters than LSTMs.
Subtables (b) in Tables 1 to 4 present the accuracy of different regression algorithms. Polynomial regression has higher prediction accuracy than linear regression because polynomial regression can handle nonlinear relationships in datasets. Compared with linear and polynomial regression, ridge regression, Lasso regression, ElasticNet, and TenRR have higher prediction accuracy. Linear and polynomial regression have difficulty dealing with multicollinearity. Ridge regression, Lasso regression, ElasticNet, and TenRR solve the problem through L1 or L2 regular term. Compared with ridge regression, Lasso regression has higher accuracy. L1 normal form has the function of feature selection, which can reduce the influence of unnecessary features on the accuracy of prediction models.
Tables 1 and 2 indicate that prediction methods based on matrix decomposition models are more accurate than those based on feature models. Matrix decomposition methods (such as SVD) greatly reduce the dimensions of original matrices. They remove redundant information and predict missing data. These factors improve the accuracy of subsequent prediction algorithms. Similarly, as shown in Tables 3 and 4, the accuracy of prediction algorithms based on tensor decomposition models is higher than that of prediction algorithms based on tensor models.
Matrix or tensor decomposition algorithms can reduce the dimension and sparseness of original data. They help improve the accuracy and stability of subsequent prediction algorithms. Comparison of Tables 1 to 4 shows that prediction methods based on tensors perform better than those based on matrices. Tensor models can describe legal cases from multiple directions. Sparseness and accuracy of data in tensor models are better than those of data in matrix models. Tensor models also provide considerable data support for the training of subsequent prediction algorithms.
Tables 1 to 4 imply that compared with a series of neural networks and regression algorithms, TenRR has higher accuracy in judgment prediction of legal cases. The main reason is the use of mapping matrices. ITend sets mapping matrices and decomposes original tensors representing legal cases into core tensors under the guidance of mapping matrices. Core tensors greatly reduce the dimensions of original tensors while removing inaccurate, redundant, and meaningless information from them. ORidge intervenes in the tensor decomposition process in ITend by optimizing the value of mapping matrices. As a result, core tensors obtained using ITend represent the tensor elements and tensor structure information that is most conducive to improving the accuracy of TenRR.

VI. CONCLUSIONS
This article proposes a new method for judgment prediction of legal cases, namely, TenRR. TenRR is based on innovative tensor decomposition and optimized ridge regression. TenRR is mainly divided into three steps. (1) A method based on tensor models for representing legal cases, namely, RTenr. We use RTenr to represent legal cases as three-dimensional original tensors. (2) An innovative tensor decomposition method, namely, ITend. ITend decomposes original tensors into core tensors. (3) An optimized ridge regression algorithm, namely, ORidge. We train ORidge using obtained core tensors. Finally, a judgment prediction model for legal cases is obtained.
Compared with judgment prediction models based on feature models and classification algorithms, TenRR has the following advantages. (1) RTenr does not require considerable expert knowledge and manual labeling, and it can fully describe legal cases. (2) ITend establishes a set of mapping matrices {C n }. It maps original tensors into a feature space represented by {C n } and then obtains core tensors. Core tensors greatly reduce the dimensions of original tensors while removing redundant, meaningless, and inaccurate information from them. ITend avoids dimensional explosion and sparse data. (3) ORidge interferes with the tensor decomposition process in ITend by optimizing the value of mapping matrices {C n }. Therefore, core tensors derived using ITend carry the tensor elements and tensor structure information that is most conducive to improving the accuracy of TenRR. The aforementioned advantages greatly improve the accuracy of TenRR. This article further proposes an optimization algorithm for ORidge with respect to the set of mapping matrices {C n }. First, we calculate the partial derivative of the loss function with respect to mapping matrices ∂F ORidge ∂C n . Then, we complete the iteration of {C n } by using MBGD.
Proof 34: By performing singular value decomposition on A, we can get that A = U V T . U ∈ R I ×I , v ∈ R J ×J , ∈ R I ×J . U and V are orthogonal matrices.
is the diagonal matrix, and the elements on the diagonal are not all 0, Since I J , , there exists a matrix , which satisfies = E. From the properties of orthogonal matrices, we can get that Let C = AB, ν = χ× n C, then C ∈ R I n ×K n . ν ∈ R I 1 ×···×I n−1 ×K n ×I n+1 ×···×I N . From the definition of 6, VOLUME 8, 2020 J n j=1 χ i 1 i 2 ···i n−1 ii n+1 ···i N A ij B jk n . Therefore, γ = ν. That is χ × n A× n B = χ× n (AB).
Proof 36: Assume that the value of m is less than n, By the definition of tensor product 6, we can obtain that By the definition of tensor product 6, we can get that Based on the above analysis, we can obtain that γ = ω. That is χ× n A× m B = χ × m B× n A.
Proof 37: By lemma 9, we can get that From the properties of Frobenius norm, we can get that AU 2 F = (AU ) T 2 F = U T A T 2 F . By lemma 9,

U T A T 2 F = Trace((U T A T ) T U T A T ). That is U T A T 2 F = Trace(AU U T A T ). From the properties of orthogonal matrices, we can obtain that
Proof 38: From the definition of 2, we can get that F . According to lemma 15, when C 2 F takes the minimum value, C 2 F takes the minimum value. By lemma 14, we can obtain that C 2 Then F Pr o1 = λ 2 F . By lemma 8, we can get that F Pr o1 = F . According to lemma 11, let B satisfies the condition that C p B = E, B ∈ R I p ×J p . then B = V (p) U (p)T , where U (p) , , V (p) can be obtained from the singular value decomposition of C p , C p = U (p) (p) V (p)T . U (p) and V (p) are orthogonal matrices, (p) is the diagonal matrix. is the inverse of the diagonal matrix (p) . is the diagonal matrix. From lemma 16, we can get that when AB 2 F takes the minimum value, A 2 F takes the minimum value.
From the above analysis, we can get that F Pr o1 = . According to the same principle, we can get the following formula by analogy.
B n satisfies the following condition: where V (n) , (n) and U (n) can be obtained by performing singular value decomposition on matrix C n . C n = U (n) (n) V (n)T . U (n) and V (n) are orthogonal matrices. (n) is a diagonal matrix. (n) is the inverse matrix of (n) .

SECOND APPENDIX
Proofs 41, 42 and 43 give the proof process of lemmas 18, 19 and 20, respectively. Proof 41: From the matrix multiplication rule,, we can get that According to definition 1, Proof 42: From the matrix multiplication rule,, we can get that A ij = When k = i, A jj . From the matrix multiplication rule, we can get