Semi-Supervised Gaussian Processes Active Learning Model for Imbalanced Small Data Based on Tri-Training With Data Enhancement

To solve the problem that some imbalanced small sample datasets only contain a few labeled samples, a semi-supervised gaussian processes active learning model based on improved tri-training with enhanced data is proposed. Firstly, the label samples are balanced and enhanced, and we present a quantitative enhanced data evaluation criteria based on the JS distance and the similarity of information entropy between enhanced data and original data to select the best enhanced data. Secondly, an improved semi-supervised learning method based on tri-training is proposed to find the unlabeled samples which have high confidence, so the certainty of the labeled samples group can be increased, in order to ensure that the three classifiers of tri-training have both difference and robustness, random forest is introduced to divide the features of the dataset into three groups with equal contribution, and each classifier trains different combinations of two feature groups. Thirdly, in order to query and classify the most informative unlabeled samples more precisely, active learning based on the Gaussian process and JS distribution range is structured, because of the high uncertainty of the unlabeled samples predicted by active learning, the similarity distribution range of JS distance is introduced to compare the similarity of unlabeled samples and labeled samples in active learning‘s classifier, so the model can classify more diverse samples. The final experimental results show that compared with several traditional models, the proposed model performs better on artificial datasets and imbalanced small-size UCI datasets.


I. INTRODUCTION
The traditional machine learning classification tasks are usually divided into two types: one is supervised learning, and the model uses labeled samples for training; another is unsupervised learning, the model clusters unlabeled samples. Nevertheless, in practical application, several datasets only contain a few labeled samples, which require a huge time and labor to mark the unlabeled samples [1]. Not only that, but some datasets also have the defects of small sample size and imbalanced samples [2]. Therefore, how to effectively predict unlabeled samples by training only a few samples has become The associate editor coordinating the review of this manuscript and approving it for publication was Dongxiao Yu .
a key problem in current machine learning, this problem is called semi-supervised learning problem.
At present, there are two solutions to this problem. One is semi-supervised learning(SSL), and another is active learning(AL) [3]. SSL attempts to find the unlabeled samples with the highest confidence in prediction, mark them and put them into labeled dataset, then continue to predict the remaining unlabeled samples by the new labeled dataset until all unlabeled samples are predicted. AL aims to query the most informative unlabeled samples and mark them to expand the labeled dataset and repeat the program until all unlabeled samples are marked.
The difference-based SSL model is the most mainstream SSL model, originating from co-training [4]. Co-training divides the data attributes into two groups that are conditionally independent of each other, and the two groups of data are trained by two classifiers. Dalva et al. summarized the co-training strategy into three types: agreement, disagreement and self-combined [5]. The agreement strategy hypothesizes unlabeled data as a class by both classifiers with a confidence score, sorts the agreed samples according to the sum of confidence scores, then select the samples whose confidence score exceeds a certain threshold. The disagreement strategy aims to classify the hard sample which is decided by one classifier but another classifier is indecisive. In the disagreement strategy, unlabeled samples are sorted according to the absolute difference of absolute confidence scores, then the sample whose confidence score of the corresponding hypothesis exceeds a certain threshold is selected. The self-combined strategy allows two classifiers to select high confidence samples independently, then select the unlabeled samples that are classified into the same class by two classifiers. However, co-training has several shortages, such as neglecting learning model's relevance and dataset's characteristics [6]. Therefore, researchers have been improving co-training for many years, and the most famous variant of co-training is tri-training [7]. Tri-training was proposed by Zhou et al. in 2005, it's a semi-supervised classification model with ensemble thinking. This model can fully use the feature set of data to improve the efficiency of semi-supervised learning. For a long time, tri-training was considered an effective SSL model. However, like other SSL models, tri-training will introduce noise due to false predicting of unlabeled samples in iterative training when lacking enough data, leading to degradation of classification performance [8].
AL actively finds the most informative unlabeled samples to increase the diversity of labeled training samples. In the active learning process, the learning machine actively searches the unlabeled samples with the most information through the query strategy, trains and classifies these unlabeled samples, and then adds these samples into the labeled training samples group. These new labeled samples can significantly reduce the wrong classification information, thereby improving the classification accuracy of the classifier [9]. However, the most informative unlabeled samples are also the most uncertain samples, which contain more noise. If these samples are not effectively classified, the generalization ability of the model will be significantly affected. Therefore, active learning sometimes has high uncertainty.
The full prediction ability of Tri-training on data can effectively avoid the risk of wrong prediction in AL, and AL can select the best samples for the classifier to predict. Therefore, some researchers try to combine SSL with AL [2]. Xu et al. proposed a QBC and tri-training based on the active SVM model [3]; this model uses an improved tri-training algorithm to label the unlabeled samples with the highest confidence, and then uses an AL algorithm based on QBC to select these new labeled samples with the highest inconsistent to increase the generalization performance of the model. However, the threshold setting of this model depends on manual operation, which will undoubtedly affect the model's classification performance. Zhang et al. introduced tri-training algorithm in the CEAL model to select the most confident unlabeled samples, and improved the AL strategy in the CEAL based on voting entropy [10]. Nevertheless, this model can not select the pseudo-label samples with high precision. Although these algorithms that combine SSL and AL have more advantages than SSL and AL alone, these algorithms will select too many redundant samples, and need to search the entire sample space when querying samples, thus increasing the complexity and running time of the algorithms. In addition, most similar studies rarely consider the imbalanced small sample problem.
The imbalanced small sample problem is also a significant difficulty in semi-supervised learning. Due to the imbalance of training data and the scarcity of sample size, the classification results of machine learning models tend to favor the majority of class samples, lacking the learning of minority class samples, thus affecting the model's generalization ability. Zhao et al. ingeniously proposed a semi-supervised learning algorithm based on mixed sampling for imbalanced data classification [2], this algorithm can effectively improve the classification ability of semi-supervised model for smallsize imbalance samples. However, it does not pay enough attention to minority samples, and the effect on the binary classification task is not ideal.
In order to solve the above problem, based on the model proposed in literature [2], we propose a semi-supervised model suitable for two-class imbalanced small sample datasets. In this model, we combine the robustness of tri-training and the diversity of AL. We have innovated the feature allocation technology of tri-training and improved the query strategy and classifier in AL. The proposed model is called a semi-supervised gaussian processes active learning model based on improved tri-training.
The main contributions of this study are summarized as follows: 1)A new semi-supervised learning model with imbalanced samples is proposed, which is suitable for binary classification. In this model, tri-training and AL are combined.
2)The classifier feature assignment mechanism of tritraining is improved. The features are divided into three groups with the same contribution value. The three groups of features are combined in pairs, so the three classifiers have sufficient prediction ability and certain difference.
3)The query strategy of AL is improved to Gaussian processes, and the similarity distribution range of JS distance is introduced into the classifier of AL. The Gaussian process can better measure the uncertainty of samples than the traditional voting entropy and KL divergence, and the improved classifier, which introduces the distribution range of JS distance, can effectively help judge the class of unlabeled samples. 4)A quantitative enhanced data evaluation criteria is proposed to measure the quality of enhanced data, the JS distance between the original sample and the enhanced sample is used to measure the quality of the extended data, and the JS distance of information entropy between the original sample and the enhanced sample is used to measure the diversity of the extended data digitally. 5)In order to solve the problem that all prediction results of the three classifiers of tri-training are not the same, cause the training process can't be continued. A total classifier is introduced to predict all the remaining samples.
Experiments on two artificial datasets and five UCI datasets prove the effectiveness of the proposed model.

II. RELATIVE WORK A. TRI-TRAINING
Tri-training is an improved co-training algorithm, it uses three classifiers to identify the label of each unlabeled sample. Therefore, tri-training has strong robustness. In the training process, classifiers are used for cooperative training, and unlabeled samples with high confidence are selected for labeling. Although the fall prediction will introduce noise into labeled samples, Zhou et al. proved in his paper that when there are enough new data, the impact of noise can be offset [7].
In recent years, many scholars have been improving and expanding the application of Tri-training. Inspired by the asymmetric tri-training framework for unsupervised domain adaptation, Saito et al. proposed a model-agnostic metalearning method which is applied to the recommender system [11]. Mo et al. improved tri-training by using ladder network [12], allocating different weights to the new labeled data, and expanding the training set. Liu et al. introduced the theory of teacher-student model in tri-training [13]. Zhang et al. introduced the convex optimization method into tri-training to reduce the noise label [14], replaced the error rate with cross-entropy, proposing a Safe Tri-training Algorithm Based on Cross Entropy. Zhang et al. implemented the Tri-Training algorithm in cost-effective active learning to improve generalization performance on image classification problems [10]. Tseng et al. proposed a tri-training decision module based on the judgment of probability threshold [15].

B. ACTIVE LEARNING
Different from semi-supervised learning, active learning actively search and labels the most informative unlabeled samples, that is, the most uncertain samples [16]. There are several query strategies frameworks for active learning to find the most uncertain samples like uncertain sampling, query-by-committee, expected model change and densityweighted methods, etc. In fact, these strategies are querying the unlabeled sample which is most different to discriminate. In recent years, researchers have found that simple query strategies have been difficult to measure the uncertainty of samples, and many researchers have tried to propose new query strategies or combine different query strategies [17]. Xu et al. selected the most inconsistent unlabeled samples while the vote entropy are higher than the threshold and the most consistent unlabeled samples while the vote entropy is lower than the threshold [3]. Gu et al. provided an active learning risk bound based on the informativeness and representativeness of unlabeled samples [17], then propose a novel batch mode active learning combined with semisupervised SVM based on risk bound, improving the generalization ability of the model. Zhao et al. introduced the mixtures of Gaussian processes into active learning [18], and designed three query strategies based on mixtures of Gaussian processes. Compared with other deterministic models or probabilistic models, this model uses the Gaussian processes to select the most uncertain samples from the probability, thus providing a flexible framework for probabilistic regression and classification. This model is especially suitable for binary classification problems.
Dwarikanath et al. improved active learning in the medical image classification task [19], aiming at the problem that active learning cannot be applied to multi-label samples, a new sample selection method based on graph analysis is proposed to identify information samples in multi-label environment. Lee et al. proposed a data acquisition framework based on active learning for the highly unbalanced distribution of property in data-driven metamaterials design [20], aiming to guide the generation of diversity and task-aware data. Hossein et al. proposed Probabilistic Minimax Active Learning (PMAL) [21], which uses the variational method in the likelihood function of logistic regression to approximate the PMAL target, thus minimizing the upper risk limit of the classifier. Luciano et al. uses active learning based on uncertainty in the application of diagnosing unknown industrial faults [22], which helps experts by intelligent fault diagnosis and searching for potential samples of new types of faults.

III. SEMI-SUPERVISED GAUSSIAN PROCESSES ACTIVE LEARNING MODEL BASED ON IMPROVED TRI-TRAINING WITH ENHANCED DATA
Our model is comprises of labeled data enhancement module, high confidence sample classification module, and low confidence sample classification module. A quantitative enhanced data evaluation criteria based on sample similarity and diversity similarity to evaluate the enhanced samples is proposed, and the best enhanced method is selected to improve the model's prediction ability for imbalanced and small-size samples. After data enhancement, all labeled samples are input into the improved tri-training as the train samples, and the improved tri-training predicts all unlabeled samples to find the highest confidence samples. In order to ensure the difference and robustness of the three classifiers of tritraining, each classifier is input features group with similar contribution value. When the prediction results of the three classifiers for an unlabeled sample are same, the sample is classified into the labeled sample set as a new train sample.
When the unlabeled samples whose prediction results of three classifiers are inconsistent, these samples are input into the active learning. First, the most uncertain unlabeled samples are selected by the Gaussian processes. Then decides the inclined classes of the three classifiers according to the voting entropy of the prediction results of each sample. After that, calculates the JS distance between each unlabeled sample and their inclined labeled samples. Suppose the JS distance between an unlabeled sample and their inclined labeled samples is within a certain threshold range, in that case, the sample can be considered as a true positive sample or a true negative sample. The distribution range of JS distance between labeled samples determines the threshold range. The remaining unlabeled samples are input into the model again until all unlabeled samples are labeled. The model's flow is shown in FIGURE 1.
The description of our model is as follow in Algorithm 1.

A. QUANTITATIVE ENHANCED DATA EVALUATION CRITERIA
In traditional literature, KL divergence is usually used to measure the similarity between the enhanced data and the original data. KL divergence is usually used to calculate the difference between two distributions, and its formula is: However, KL divergence is also asymmetric, which makes KL divergence not flexible in practical application. Therefore, the JS distance is introduced to improve from KL divergence as the measurement standard of sample similarity. Compared with KL divergence, JS distance can distinguish the similarity more accurately and has symmetry, which makes it more flexible than KL divergence. The formula is: Meanwhile, the traditional standard for measuring the enhanced data is only to compare the similarity between the enhanced data and the original data. Zang et al. introduced the diversity of enhancement data into the measurement standard [23]. However, literature [23] only relies on the distribution map of samples to measure the diversity, which undoubtedly makes the measurement standard of diversity in literature [23] highly subjective. At the same time, literature [23] believes that better enhancement samples should have better diversity, but the enhancement samples that are too diverse will also deviate from the original samples. Consequently, the diversity of enhancement samples should be close to the original samples. Therefore, information entropy is introduced as a digital measurement standard to enhance the diversity of data, and measures the proximity of the diversity of the enhanced data and the original data by comparing the JS distance between the enhanced data information entropy and the original data information entropy. Information entropy was first proposed by Shannon to measure the occurrence frequency of each probability. Its formula is: According to the above evaluation criteria, our model will select the most similar enhanced data to the original data to balance and expand the labeled samples.

B. TRI-TRAINING BASED ON RANDOM FOREST'S FEATURE ASSIGNMENT
Tri-training is a prediction method based on the difference between classifiers. However, if the classification ability of one classifier is too weak, the totality classification effect will decrease and noise will be introduced. Therefore, all three classifiers should have similar classification ability. Inspired by this idea, the random forest is introduced to calculate the contribution value of each feature of the dataset. Random forest divides these features into three groups of features with equal total contribution value, so that the three classifiers have similar classification performance while ensuring that the three classifiers have differences. At the same time, in order to improve the classification ability of the three classifiers, the three groups of features are combined in different pairs, so there are three different combinations of two feature groups. Then each classifier is trained with a combination so that the tri-training has both the difference and better classification ability. The structure of tri-training based on random forest feature assignment is shown in FIGURE 2.

C. GAUSSIAN PROCESS ACTIVE LEARNING WITH DISTRIBUTION RANGE OF JS DISTANCE
The active learning module is composed of Gaussian process and classifier. Gaussian processes is a random process in which the observed values appear in a continuous domain [24]. In the Gaussian process, every point in the continuous input space is associated with a normally distributed random variable. Each finite set of these random variables has a multidimensional normal distribution. That is to say, the distribution of Gaussian process is the joint distribution of all random variables.
Suppose there are N training points. For all, if obey multivariate Gaussian distribution, can be said to be a Gaussian process, the formula is: Gaussian process is usually used as regression method, but its principle can also be applied to classification problems. Gaussian process regression method can be used for binary classification problems by taking positive and negative labels as output. The classification is performed by determining the sign of the average value of the prediction distribution. If the average value exceeds a certain threshold, the test points are classified as positive, otherwise, the opposite is true. The distribution formula of the Gaussian process is as follows: p k m k (x * )) 2 (6) In (5) and (6), m(x) is the predicted mean, and σ (x) is the predicted variance. It can be seen from (1) that the mean value of the labels of the two classes of samples is the decision boundary, and the sample nearest to the decision boundary can be found by calculating the difference between the predicted mean value of the samples and the decision boundary. It can be seen from (2) that the sample with the highest degree of deviation, that is, the sample with the lowest confidence, can be found by comparing the size of the sample prediction variance. According to the above derivation, three query strategies based on Gaussian processes is used to select the most informative samples [18].
(1)Select the sample closest to the classification boundary according to the mean value of the prediction probability. The formula is: (2) Select the sample with the lowest confidence according to the predicted probability variance. The formula is: (3)Select the sample whose category cannot be determined most according to the variance and mean value of the prediction probability using the cumulative distribution function of a standard Gaussian distribution N(0, 1). The formula is: In formulas (7-9), it means all unlabeled data, m(x) is the predicted mean, and σ (x) is the predicted variance of all unlabeled data, and is the prediction probability of each unlabeled sample.
As for classification tasks, traditional active learning uses QBC(query by committee). The principle of QBC is similar to tri-training, which uses two classifiers for prediction. If the classification results are consistent, the samples can be considered true. However, in this model, the function of QBC will coincide with tri-training, thus increasing model's redundancy. Therefore, we attempt to combine the original classification results of tri-training, determining the class of sample by combining the voting entropy of the classification results of unlabeled samples in tri-training with the similarity of each class of labeled samples. For example, suppose the tri-training classification result of an unlabeled sample is biased toward the positive sample, and the similarity with the positive labeled sample is higher than a certain threshold. In that case, the unlabeled sample can be considered as the positive sample.
Therefore, how to set the similarity threshold becomes a key point. Because there are differences among all samples, even among the same class samples have different similarities, hence, the similarity between all samples with the same label should be in a certain range. Calculating the JS distance between each sample in the same label to obtain the similarity distribution range of each class of samples by calculating the maximum, minimum, and average of these JS distances. The average of the population maximum and population average is defined as the upper bound, and the average of the population minimum and population average is defined as the lower bound. The range consisting of the upper bound and the lower bound is the similarity range. Suppose the average JS distance between an unlabeled sample and a labeled sample is within the similarity distribution range. In that case, the unlabeled sample can be considered to belong to the labeled sample. FIGURE 3 shows the classification process of active learning.

D. THE TERMINATION STRATEGIES OF MODEL
One problem of tri-training is that when three classifiers predict different result for all samples, the model won't continue to train. Although the introduction of active learning will help tri-training complete the training, there will still be samples that cannot be queried by active learning. Therefore, a classifier trained by data with all features is introduced to predict the remaining unlabeled samples. Since most of the unlabeled samples have been predicted to be labeled in the previous tri-training and active learning processes, the prediction at this time has been transformed into a traditional supervised learning classification problem.

IV. EXPERIMENTS AND RESULTS ANALYSIS
In this section, two experimental groups are conducted to demonstrate the validity of the proposed quantitative enhanced data evaluation criteria and proposed model. First, the best enhancement data is selected according to the proposed quantitative enhanced data evaluation criteria, then compare the prediction performance of tri-training after training by different enhancement data to verify the validity of selected enhanced data. After that, the proposed model is  compared with other semi-supervised models on different datasets to verify whether the proposed model has better prediction performance on imbalanced small datasets.

A. DATASETS
So far, there is no publicly available and generally agreed benchmark dataset for semi-supervised classifier, researchers often used other common datasets for semi-supervised classification experiments. However, for some datasets whose labeled samples are not representative, their unlabeled samples are also unavailable [25]. In order to obtain a comprehensive statistical analysis and fairly compare the performance of the proposed models and other listed model, we constructed two artificial datasets by using the make_classification in sklearn(V.0.0). These two artificial datasets contain problems in actual prediction, The sample number of each dataset is not more than 1000, and the imbalanced rate is 20%. Considering the common noise in datasets, redundant features are set in artificial datasets.
To further explore the performance of the proposed model and its ability to solve practical problems, five commonly used UCI (University of California, Irvine) datasets were used in experiments [3]. These five datasets are CMC, Vehicle, WDBC, Diabetes and Heart Disease datasets. Note that, in view of the fact that there will be a large number of features in real problems, the Vehicle dataset with 18 features was used in the experiment.
The information of artificial datasets is shown in

B. EVALUATING INDICATOR
In traditional machine learning classification experiments, accuracy is usually used to evaluate the classification effect of the classifier. However, for imbalanced datasets, because of the small proportion of minority samples in the overall sample, accuracy is difficult to evaluate the classification effect of the classifier for minority samples.
For imbalanced data, four descriptions, TP, TN, FP and FN are usually used for evaluation [2]. The meaning of these four descriptions is shown below.
•TP: True Positive. A positive sample is classified as a positive sample.
•TN: True Negative. A negative sample is classified as a negative sample.
•FP: False Positive. A negative sample is classified as a positive sample.
•FN: False Negative. A positive sample is classified as a negative sample.
Based on these four descriptions, the true positive rate(Abbreviated to TPR) and the false positive rate(Abbreviated to FPR) can be calculated as shown below.
where, the TPR means the proportion of real positive samples to all the samples predicted as positive samples, and the NPR means the proportion of real negative samples to all the samples predicted as negative samples. Sorting the samples according to the prediction results of the model, and the samples are predicted as positive samples in order. Then the FPR of each sample is taken as the abscissa and the TPR of each sample as the ordinate for plotting. So the ROC curve is got, and AUC(Area Under Curve) can be obtained by calculating the area under the ROC curve. AUC can both effectively measure the classification results of positive and negative samples.
In addition, the F-measure is also introduced to evaluate the model's generalization ability. The F-measure is calculated as shown below.   where (TP/(TP + NP)) means the calculation of precision rate, (TP/(TP + FN )) means the calculation of recall rate, β is a parameter in F-measure, in this paper we set β as 1.
It can be seen from (12) that the F-measure is defined as the harmonic average of the precision rate and recall rate.

C. EFFECTIVENESS OF THE PROPOSED QUANTITATIVE ENHANCED DATA EVALUATION CRITERIA
The four most commonly used oversampling methods are selected for comparison according to the enhanced data evaluation criteria proposed in Section IV-C. These five methods are SMOTE [26], Borderline-SMOTE [27], ADASYN [28] and SMOTETomek [29]. According to (2), the JS distance between the generated data of the four methods and the original data is shown in the TABLE 3.
According to (2) and (3), the JS distance between the diversity of the data generated by the four methods and the diversity of the original data is shown in the TABLE 4.
It can be seen from TABLE 3 and TABLE 4 that compared with other enhanced data, the enhanced data of SMOTETomek and Borderline-SMOTE have the same similarity to the original data. However, when comparing similarity with the diversity of the original sample, the information entropy value of enhanced data of SMOTETomek is the most similar to that of the original data. Based on the above results, the enhanced data of SMOTETomek can be considered the best enhanced data.
In order to verify the effectiveness of the proposed quantitative enhanced data evaluation criteria, the above oversampling methods are respectively used to enhance training data, and the training data after enhancement is used to train the tri-training. The experimental datasets are WDBC and Vehicle datasets. In order to evaluate the effect of each kind of enhancement data on the prediction performance of tritraining when dealing with different proportions of labeled samples, the labeled samples of the experimental datasets are set with different proportions. The experimental results are shown in FIGURE 4 to FIGURE 6.
From FIGURE 4 to FIGURE 6, it can be seen that compared with other listed enhanced data, SMOTETomek's enhanced data can better improve the prediction performance of tri-training. This result proves the effectiveness of the proposed quantitative enhanced data evaluation criteria.
Combining the results of the above experiments, we believe that SMOTETomek is more suitable for data enhancement.

D. PERFORMANCE COMPARISONS BETWEEN PROPOSED MODEL AND OTHER SEMI-SUPERVISED METHODS
In order to evaluate the prediction performance of the proposed model, it is compared with the original XGBoost model [30], tri-training model [7], tri-training model with local convex optimization(Abbreviated to TRLOC) [13], tritraining model with Gaussian process(Abbreviated to TRGP) [18], and semi-supervised gaussian processes active learning model based on tri-training and improved QBC model without data enhancement(Abbreviated to TRGPQ) [3]. In order to obtain a comprehensive comparison of the performance of the listed semi-supervised models, different proportions of labeled samples in the whole experimental samples are set. We try to explore the predictive ability of the listed models when the ratios of labeled and unlabeled samples are different.
Aiming to compare the prediction ability and the ability to deal with practical problems of listed models more clearly, we conducted experiments on the artificial datasets and UCI datasets respectively. In order to obtain an overall evaluation effect and analyze model performance statistically, we calculated the average of the prediction results of each model after dealing with each dataset.

1) CROSS VALIDATION
Considering the small number of samples, aiming to avoid experimental errors, cross validation technology is applied in our semi-supervised model comparison experiment [32]. When setting different proportions of labeled samples and unlabeled samples for the experiment, each dataset is randomly assigned to labeled samples and unlabeled samples five times. We take the average of the five prediction results of the model as the overall prediction result of a proportion.
As an example, TABLE 5 shows the cross validation experimental prediction results and their average result of the proposed model on the heart disease dataset.

2) EXPERIMENTAL RESULTS OF ARTIFICIAL DATASETS
We set the percentage of labeled samples to unlabeled samples from 10% to 50%, with each percentage increasing by 5%. The accuracy comparison of listed models on two artificial datasets is shown in TABLE 6.
The AUC comparison of listed models on the two artificial datasets is shown in TABLE 7.
The F-measure comparison of listed models on the two artificial datasets is shown in TABLE 8. As shown in TABLE 6 to TABLE 8, it can be observed from the experimental results of two listed artificial datasets that when the number of labeled samples in datasets is too small, the original model will not obtain better prediction effect. Whereas the semi-supervised model can significantly improve the prediction effect when dealing with such datasets, which proves the effectiveness of semi-supervised learning. Among them, the proposed model has the best average prediction effect on listed datasets, which shows that the prediction ability of the proposed model is generally better than that of other listed models. Especially on AUC, the proposed model has more obvious advantages than other models, showing that the semi-supervised classifier after training by enhanced data has the same strong prediction ability for both majority samples and minority samples.

3) EXPERIMENTAL RESULTS OF UCI DATASETS
We set the percentage of labeled samples to unlabeled samples from 10% to 50%, with each percentage increasing by 5%. The accuracy comparison of listed models on the five UCI datasets is shown in TABLE 9.
The AUC comparison of listed models on the five UCI datasets is shown in TABLE 10.
The F-measure comparison of listed models on the five UCI datasets is shown in TABLE 11.  In order to verify the significance of the experimental results in TABLE 12, we introduce the Friedmanchisquare test for statistical analysis. Friedmanchisquare test is often used to examine whether the performance of different models is the same in machine learning. In Friedmanchisquare test, models with the same performance are considered to have the same rank value [32]. We assume that k models are compared on N datasets, and r represents the rank value of the i-th model. Assuming that the rank value of each model follows a normal distribution, the corresponding chi-square statistic is: where, τ x 2 obey the chi-square distribution with k-1 degree of freedom. Models that statistic exceed the statistical threshold of chi-square distribution and have a P-value is less than 0.05 can be considered to have significant differences. According to chi-square distribution   other listed models in most cases, which shows that the proposed model can be effectively applied to practical problems. Precisely, by comparing the average predicted result of each model, it can be seen from TABLE 12 that the proposed model obtains higher average accuracy and average AUC than other listed models. This excellent prediction result is attributed to the strong classification prediction ability of the proposed model. And the proposed model performs much better especially on AUC, which means the proposed model can effectively predict both the majority samples and the minority samples. In addition, TABLE 12 exhibits that on F-measure, the proposed model also performs much better than other listed models. The highest F-measure of proposed model can be attributed to data enhancement by SMOTETomek. The F-measure result shows that the proposed model has better generalization ability than other listed models when dealing with practical problems.
We think the reason why the proposed model performs better in the experiment than other listed models is that the proposed model combines the advantages of two main semisupervised methods, and strengthens the prediction ability of each module. When the proposed model is in the training process, the enhancement of training data increases the generalization ability of the model, so that the model has similar prediction ability for positive and negative samples, therefore the proposed model exhibits the best AUC and F-measure in the experiment. In the iterative process of the proposed model, semi-supervised learning and active learning are combined to make the model both robust and diverse. At the same time, the feature assignment mechanism is not only   used to strengthen the prediction ability of semi-supervised learning, the JS range is also used to avoid the impact of the uncertainty of active learning on prediction. These methods that enhance the prediction ability in each module make the proposed model perform better in the experiments than other listed models.
From the above experiments and analysis, compared with the original model and other listed semi-supervised models, the proposed model is the most effective semi-supervised learning method.

V. CONCLUSION
Aiming at the semi-supervision binary classification problem when dealing with an imbalanced small dataset, a semisupervised Gaussian processes active learning model based on improved tri-training with data enhancement is proposed. This model selects the best enhancement samples according to a originally quantitative enhanced data evaluation criteria, and enhances the training data by these enhancement samples. Then proposes an improved tri-training based on the random forest's feature assignment to increase the robustness of this model. After that Gaussian processes is introduced in active learning to select the most informative unlabeled samples to increase the diversity of this model, and the distribution range of JS distance is proposed in active learning to help predict the most informative unlabeled samples. This model combines the advantages of tri-training and active learning so that it has stronger prediction ability than tri-training and its variants. Compared with several traditional semi-supervised models, the experimental results show that the proposed model is the most effective. However, since the model is composed of several classification modules and each sample is calculated in detail, the computational complexity is slightly higher. In future work, we will try to reduce the computational complexity of the model. CHENXIAO ZHOU received the bachelor's degree in automation from Wuchang Shouyi University, China, in 2020. He is currently pursuing the master's degree with the School of Electrical and Information Engineering, Wuhan Institute of Technology. His research interests include artificial intelligence, machine learning, few-shot learning, semi-supervised learning, and active learning.
LIANYING ZOU received the bachelor's degree in communication engineering and the master's and Ph.D. degrees in microelectronics and solid state electronics from the Huazhong University of Science and Technology, China, in 1998China, in , 2003China, in , and 2006, respectively. She is currently an Associate Professor with the School of Electrical and Information Engineering, Wuhan Institute of Technology. Her research interests include embedded system design, FPGA system design, and VLSI integrated circuit design. She has contributed to over 20 peer-reviewed publications in journals, such as Journal of Huazhong University of Science and Technology (Natural Science Edition) and the Journal of China Universities of Posts and Telecommunications.