An Efficient Approach to Select Instances in Self-Training and Co-Training Semi-supervised Methods

Semi-supervised learning is a machine learning approach that integrates supervised and unsupervised learning mechanisms. In this learning, most of labels in the training set are unknown, while there is a small part of data that has known labels. The semi-supervised learning is attractive due to its potential to use labeled and unlabeled data to perform better than supervised learning. This paper consists of a study in the field of semi-supervised learning and implements changes on two well-known semi-supervised learning algorithms: self-training and co-training. In the literature, it is common to develop researches that change the structure of these algorithms, however, none of them proposes automating the labeling process of unlabeled instances, which is the main purpose of this work. In order to achieve this goal, three methods are proposed: FlexCon-G, FlexCon and FlexCon-C. The main difference among these methods is the way in which the confidence rate is calculated and the strategy used to select a label in each iteration. In order to evaluate the proposed methods’ performance, an empirical analysis is conducted, in which the performance of these methods has been evaluated on 30 datasets with different characteristics. The obtained results indicate that all three proposed methods perform better than the original self-training and co-training methods, in most analysed cases.


I. INTRODUCTION
T HE technological progress in recent years has greatly promoted the availability of large amounts of data. Storage and communication resources have expanded exponentially, increasing the demands for more autonomous computational tools. These tools should automatically process a considerable amount of this data, thus reducing human intervention and dependence on experts [1].
Considering this scenario, machine learning techniques have gained considerable relevance, given that, based on past experiences, they are, by themselves, capable of creating a hypothesis or function capable of solving the problem to be addressed [2].
Depending on the degree of supervision used during the training phase, Machine Learning (ML) techniques can be divided into three categories: supervised, unsupervised and semi-supervised. In these three types, the ML algorithms learn from past experience and from the implicit knowledge present in existing datasets; however, what distinguishes them is the fact that the data which such algorithms use have information that may or may not be labeled. In supervised learning, for instance, during training, the classification algorithms receive, as input, instances that contain the desired output information (labels), representing the class to which each instance belongs. On the other hand, in unsupervised learning, the classes to which the training instances belong are not previously known [3]. Finally, semi-supervised learning makes it possible to train classifiers with a small amount of labeled data and a large amount of unlabeled data [4]. This last mentioned approach for learning has become widely used in recent years [5]- [11].
Semi-supervised learning uses the previously labeled in- VOLUME 1, 2021 stances to build its initial hypotheses and then combines the information obtained from those instances to label the unlabeled ones. The newly labeled instances are then inserted into the labeled dataset and these instances will serve to classify the remaining instances of the unlabeled dataset. Among the several semi-supervised learning algorithms found in the literature, the two most popular are self-training [12] and cotraining [13], which will be used as the basis for the development of the proposal in this work. The main difference between these two algorithms is that self-training uses the dataset with all attributes as input, while co-training divides the input dataset into two subsets with different views.
In real-world problems, the datasets usually have few labeled instances and a lot of unlabeled instances. Then, a semi-supervised challenger deals with datasets that have these characteristics. There are many ways to solve this problem: oversampling [14], [15] and semi-supervised algorithms [9], [11], [16]. In this paper, we used semi-supervised learning.
One of the problems related to the semi-supervised learning is the automated label assignment. Considering that this process is not an easy task, especially in relation to the choice of unlabeled instances to be labeled at each stage, the authors of [17] proposed the inclusion of a parameter of confidence to guide the labeling process of the self-training algorithm. According to these authors, the general idea of using confidence in the automatic labeling process is to minimize the inclusion of noisy data, improving the overall classification accuracy.
By analyzing the above cited work, it is possible to observe a problem: the use of this confidence parameter in a static way may not exploit the full potential of a semi-supervised technique and may even increase the computational cost of the labeling process. In this way, instances with incorrect labels can be selected if the confidence parameter has a low value or instances with correct labels can be discarded if this parameter has a very high value.
As a consequence of the aforementioned problem, it can be observed that the use of a confidence parameter implies that it needs to be tuned, which is not always a trivial task. In addition, in order to obtain an effective labeling process, the confidence parameter value must be able to change during the labeling process (adaptive confidence value), given that the difficulty in selecting instances to label can change during the labeling process and a static confidence value does not capture the different levels of difficulty. Thus, the problem mentioned above was the main motivation for the development of this paper, which presents three different ways to automatically adjust the confidence parameter. The proposed approaches are intended to reach a more efficient performance in wrapper semi-supervised learning algorithms. However, in this paper, the proposal will be adjusted to be applied to two well-known semi-supervised methods, selftraining and co-training.
This paper presents an extension from our previous works [8], [11]. In [8], an initial attempt to improve the efficiency of the selection process was presented. In that study, we presented three different ways to automatically adjust the confidence parameter in the self-training method, in order to improve the efficiency of this method and to avoid a parameter tuning process. Our objective was to define ways of selecting a confidence value that was the most suitable for a specific dataset and, at the same time, adjustable during the labeling process. Later, in [11] we propose an extension of the co-training algorithm, named Co-Training with Fixed Threshold-CTFT, which uses a confidence threshold as a selection criterion of the unlabeled instances.
In this paper, we perform an exploratory investigation of our previously proposed methods (Flexcon, Flexcon-G, and FlexCon-C) [8]. The main difference between this paper and the previous one is that we extended the proposed methodology to the co-training algorithm and increased to 30 the number of datasets to investigate the performance of these methods in various contexts. Then, the main contributions of this paper can be summarized as follows: 1) As mentioned earlier, this study performs a more robust analysis of our previously proposed methods adding other algorithm and additional datasets; 2) Our methods define ways to select a confidence value that is the most suitable for a specific dataset and, at the same time, adjustable during the labeling process. It is possible because the wrapper algorithm uses a dynamic confidence threshold to include new instances in the training set at each iteration; 3) We extend the instance selection strategies for cotraining algorithm, which can effectively improve the semi-supervised learning performance. This paper is divided into seven sections as follows. Sections II and III describe the fundamental concepts and some research related to the subject of this paper, while a detailed description of the proposed method is illustrated in Section IV. Section V describes, the methodology used in the empirical analysis. Sections VI and VII present the experiments and the results provided by the empirical analysis and some discussions about them. Finally, Section VIII presents some conclusion and further works.

II. THEORETICAL REFERENCE
This section presents the main concepts related to this paper, semi-supervised learning and classification algorithms.

A. SEMI-SUPERVISED LEARNING
The semi-supervised learning mechanism handles partially labeled data in order to achieve higher classification levels [18]. Semi-supervised learning considers the set of patterns D to be divided into two subsets: 1) labeled data {D L } = {(x i , y i )|i = 1, . . . , l} where x is the pattern, y is the default label for x and l is the number of labeled instances; and 2) unlabeled data {D U } = {(x j )|j = l + 1, . . . , l + u}, where x is the pattern and u is the number of unlabeled instances. Usually, |D U | |D L |. One of the advantages of semi-supervised learning is the potential to reduce the need for large amounts of labeled data, particularly in domains where only a small set of labeled patterns is available. In specific domains, where there are no previously labeled datasets, it is common for an specialist to label the data manually. Therefore, another advantage of this type of learning can be seen when an expert only has knowledge about some patterns of a given dataset, having, as a consequence, great difficulty to label instances in order to increase the training set of data [3].
In the literature, it is possible to find several algorithms that handle semi-supervised datasets, including: self-training, mix models, co-training, graph-based methods and semisupervised support vector machines [4]. However, for the development of this work, we selected both self-training [12], considering that the work which served as the basis for this research used it, and co-training [13], because of the similarity of its classification process to self-training, differing in the use of multi-vision learning.

1) Self-training Algorithm
Self-training is probably the oldest idea on how to classify unlabeled data from a smaller subset of already labeled data. According to defined in the feature selection area, the wrapper algorithm uses a supervised method to guide their choice [19]. Then, self-training is a wrapper algorithm that repeatedly uses a supervised learning method, starting by training only the labeled data. At each step, some portion of the unlabeled instances is labeled according to the current decision function. Again, the supervised method is retrained using its own predictions as additional labeled instances [3].
In the self-training functioning, initially, a classifier is generated with a small portion of labeled data. Next, this classifier is used to classify unlabeled data. The instances that are labeled with the highest confidence index, along with their predicted labels, are included to the training set. The classifier is retrained and the procedure repeated until the unlabeled dataset is empty.
Algorithm 1 presents the sequence of steps performed to carry out this process. From this algorithm, it should be noted that the classifier uses its own predictions to learn (hence the name self-training) [4].
In [17], an extension of the self-training algorithm was implemented with the addition of a confidence parameter to be used as a threshold for including new instances in the labeled dataset. Therefore, any instance whose confidence of the prediction is greater than or equal to the minimum confidence rate for new instances to be included (threshold) will be added to the labeled dataset.

2) Co-training Algorithm
The co-training algorithm, initially proposed by [13], is similar to self-training, given that it increases the set of labeled data by iteratively classifying the set of unlabeled data and moving the most reliable prediction instances to the set of labeled data. However, differently from self-training, two complementary classifiers are simultaneously generated, fed with two different views of the set of attributes A = {a 1 , . . . , a n }, Train classifier C based on {D L } by using semi-supervised learning; 4: Apply C on the instances in {D U };

5:
Remove a subset S = {s 1 , s 2 , . . . , s n } from {D U }, containing the instances with the greatest confidence value; 6: Add the subset where the subsets A (1) and A (2) represents the View 1 and View 2, respectively. In addition, another characteristic of this set is A (1) A (2) = A and A (1) A (2) = ∅. In other words, when an attribute is included in a view, it cannot be included in another because both views are mutually exclusive. Figure 1 shows an example of two possible views generated by co-training, in which, X = {x 1 , x 2 , . . . , x 6 } represents the instances of a dataset; x ij corresponds to the j th attribute of the i th instance, a T j represents a column vector with the j th attribute of all instances, and y i is the label of the i th instance. Notice that View 1 comprises the attributes a 1 and a 4 and View 2 is composed by the attributes a 2 and a 3 . Besides, Views 1 and 2 contains two subsets, namely: labeled In this approach, these two complementary classifiers proceed as following: after generating the two views of the data, the prediction of the first classifier is used to increase the labeled dataset available for the second classifier and vice versa [20]. In other words, the prediction of one classifier is VOLUME 1, 2021 presented to the other (and vice versa) and their outputs are combined.
At the beginning two views are created, then two supervised classifiers (C 1 and C 2 ) are generated based on the labeled data of each of the two views. The next step is to classify the unlabeled data for each of the views, using classifier C 1 to label View 1 and C 2 to label View 2. Afterwards, the instances with highest confidence in the prediction of the classifier C 2 will be added to the labeled dataset in View 1. In the same way, View 2 will receive the instances with the highest confidence in the prediction of the classifier C 1 . This process is repeated until the set of unlabeled data is empty. Algorithm 2 corresponds to the series of steps performed to carry out this process. 2) ]. Let A the set of the attributes A = {a 1 , a 2 , . . . , a n }, where A (1) and A (2) represents the Views 1 and 2, respectively. In addiction, A (1) A (2) = ∅. Initially, we have all of the training instances as D 2) , y l )} and the unlabeled instances as D Generate classifiers C (1) and C (2) based on training data D (1) L and D (2) L , respectively. 3: Classify unlabeled data D (1) U and D (2) U using classifiers C (1) and C (2) , respectively. 4: Add to the set D (2) L the instances classified by C (1) with a prediction confidence equal to the maximum confidence value achieved. 5: Add to the set D (1) L the instances classified by C (2) with a prediction confidence equal to the maximum confidence value achieved. 6: Remove these instances from the unlabeled dataset. 7: until {D U } = ∅ Output: labeled data

B. CLASSIFICATION STRUCTURES
As it can be observed in the previous section, the semisupervised methods apply a classification algorithm in order to label the unlabeled instances. Therefore, in this section, a brief description of classification structures is presented.
In the context of machine learning, a classifier is a technique that formulates a hypothesis based on a sample of data [21]. In other words, a classification system creates a model using the customer's existing data in a useful way to calculate the risk for a new application and then decides whether or not to approve the requested credit [22].
Several algorithms can be used in classification systems, the main differences refer to the learning strategy, representation language, and the amount of knowledge previously used [23]. The classifiers used in this work are namely: Naive Bayes [24], decision tree [25], Ripper [26] and k-NN [27].
The increased complexity and wide applicability of classification systems has led to exploring many approaches and methodologies. Nevertheless, there is a perception that no classifier is considered completely satisfactory for a particular task; therefore, the idea of combining different methods to improve performance has emerged as a very promising possibility [28]. This combination is called classifier ensembles, also known as multi-classifier systems.
In classification tasks, an ensemble includes several submodels called base classifiers, which are usually obtained by training a basic learning algorithm (decision tree, neural network, k nearest neighbors, among others). The ensembles can be built based on the same learning algorithm, producing homogeneous ensembles, or using different algorithms and generating heterogeneous ensembles [29].
The proposition of classifier ensembles is to create and combine several inductive models for the same domain, obtaining better prediction quality [30]. After generating the base classifier set, the next step is the choice for the methods to combine their outputs. There is a vast number of methods for combining classifiers in the literature.
In this work, two well-known combination methods will be used (sum and voting). The choice for these methods was made due to the use of information from all classifiers. Sum is one of the simplest and most widely used combination methods. In this method, once the base classifiers have generated their outputs (degree of membership to each of the classes) for an instance, all the outputs of each classifier, for each of the classes, are added up and the winning class is the one that has the highest absolute value. The voting method, often used to combine classifiers, performs the combination by voting on the results of each classifier when a new instance is presented.

III. RELATED WORK
In the literature, there are several works in the field of semisupervised learning [31]- [33]. Some researches address the self-training and co-training algorithms, which seek to define ways to evaluate these algorithms [34] or to solve problems [16], [35]- [38], others to make changes in the structure of these algorithms, creating extensions or new algorithms [6], [7], [9], [17], [39]- [41], in addition to their use in different applications [42]- [44]. The research developed in this article fits into the group of researches which change the structure of these algorithms.
In [31], a novel neural embedding matching (NEM) method was proposed to tackle domain adaptation by en-forcing consistent class-wise cross-domain instance distributions in the embedding space. According to the authors, the proposed method is a progressive learning strategy, which can improve the semi-supervised learning effectively. In [32], the authors proposed a multiple kernel active learning algorithm, which incorporates distribution matching with multiple kernel learning as a group lasso into uncertainty. In [33], the authors proposed two novel semi-supervised frameworks which combine secondary screening algorithm and semi-supervised learning for guaranteeing the diversity of samples. They are named the syncretic one-fold secondary screening algorithm and semi-supervised learning framework (OFSS-SL) and syncretic multiple secondary screening algorithms and multiple-verification semi-supervised learning framework (MSS-MVSL).
In [34], the objective is to use a decision tree as a basic learner for the self-training algorithm. The research clarified that by improving the probability estimation of the decision tree, self-training improves its performance. In [35], a semisupervised approach was proposed which adapts active learning to the co-training algorithm as a way classify hyperspectral images, automatically selecting new training samples from unlabeled pixels. The effectiveness of the proposed approach is validated by using a probabilistic support vector machine classifier.
Among the studies that solve problems in semi-supervised methods, in [36], the objective is to design a semi-supervised learning model for the NER (Indonesian Named Entity Recognition) system. NER aims to identify and classify an entity based on its context, however few instances have a label. Therefore, co-training, as a semi-supervised learning model, was used to handle unlabeled data in the NER learning process and produce new labeled data that can be applied to improve a new NER classification system. Additionally, in [37], a new approach is proposed for classifying students in an academic credit system, combining transfer of learning and co-training. The resulting model can effectively predict the study status of a student enrolled in a given educational program, which is done by means of a classification model enhanced by transfer learning and co-training techniques operating on educational data from another program.
In [38], the authors proposed a semi-supervised framework that use an ensemble in order to identify wrongly classified instances. This approach uses these instances, also Selftraining and F-measure estimations, to improve an ensemble. In addition, this framework is based on error detection and unsupervised accuracy estimation, which are combined to build an ensemble to classify instances. The obtained results perform very accurate estimations for error and F-measure. In [16], a semi-supervised approach was proposed for recommendation attack detection, based on Co-forest algorithm, that presents two main aspects: i) a feature extraction method; and ii) a detection method. The feature extraction method uses a window to extract features to be used to train the detection method. The detection step denotes whether it is a genuine or an attack profile. In order to determine the type of profile, an ensemble must be fitted using the Co-forest approach.
As previously mentioned, it is possible to find in the literature different studies which make changes in the structure of the self-training and co-training algorithms in order to create new algorithms or extensions [6], [7], [9], [17], [39]- [41]. In [6], for instance, the authors developed a new self-training style algorithm which explores a multiple hypothesis to optimize the self-labeling process. This process uses a graphbased transductive method to generate reliable predictions. In contrast to the standard self-training, the proposed algorithm uses labeled and unlabeled data as a whole to label and select unlabeled instances to increase the training set. According to the author, the proposed algorithm has several interesting properties, among them: it can generate more reliable labels in unlabeled data and has a strong tolerance to noise in the training set.
Following the previous context, in [7], the authors proposed a framework for semi-supervised self-training classification, whose structure of the data space distribution (spherical or non-spherical) was integrated into the self-training process. This framework consists of two main parts, one is discovering the real structure of the entire data space, researching and locating spikes in data density; the other is the integration of the actual structure of the entire data space into the self-training process to iteratively train a classifier.
Continuing with some more examples of studies that changes the structure of the self-training algorithm, in [17], the authors proposed four semi-supervised methods, based on self-training, which can be applied to multi-label classification problems. The main idea of these methods is to minimize the randomness with which the instances are chosen in the labeling process. In this way, the main objective of the work is to use a confidence parameter in the automatic data labeling process, aiming to minimize the inclusion of noise, improving, in general, the accuracy of the classification. Therefore, only instances whose classifier output labels have confidence values above a confidence threshold are taken into account. In this sense, this confidence threshold (value between 0 and 1), was used to control the automatic label assignment in the semi-supervised learning process.
Among the studies that create new algorithms or extensions for co-training, in [40], the Deep Co-training method was created, which is a method based on deep learning, inspired by the structure of co-training. Deep Co-training trains several deep neural networks to be used with the different views required for co-training to work and explores contradictory examples to encourage differences between the views in order to prevent the networks from collapsing with each other. In [9] a method was proposed, called multi-cotraining, whose objective is to improve the performance of document classification. Documents are transformed using three methods of document representation to increase the variety of attribute sets for classification.
In [41], a semi-supervised learning algorithm is introduced, combining co-training with a support vector machine VOLUME 1, 2021 classification algorithm (SVM -Support Vector Machine). By using an interactive learning procedure, the new final labeled dataset can be determined based on unlabeled datasets, training two SVM classifiers.
Finally, some studies use the self-training method in different application domains [42], [44]. For example, in [44], the authors proposed a semi-supervised learning model for the sentiment analysis problem called interpolative self-training. As the name indicates, this model is an extension of the selftraining algorithm, and its main difference is the concatenation of training and test data. In [42], the authors presented a new version for self-training used in the recognition of manuscripts which is based on neural networks.

IV. THE PROPOSED METHODS
As discussed earlier, the main objective of this work is the proposal to extend the semi-supervised algorithms (selftraining and co-training), aiming to improve their efficiency. Therefore, the proposed approach consists of including a confidence parameter, adjustable to each iteration, which can be used as a threshold for the inclusion of new instances in the labeled data set.

A. CONFIDENCE ADJUSTMENT METHODS
As mentioned earlier, in traditional semi-supervised learning algorithms, a single classifier is iteratively trained with a growing set of labeled data. It starts with a small portion of labeled instances, adding new instances to each iteration. However, it is observed that the automatic label assignment process is a difficult task and the main issue is related to the choice of unlabeled instances to be labeled.
In [17] a confidence parameter was included in the labeling process. In the extension of the self-training presented in the cited work, unlabeled instances whose confidence rate of the prediction is greater than the confidence threshold are added to the training set, along with their predicted labels. However, in the above cited algorithm, the authors used a static value for the minimum confidence rate, which may not take advantage of the full potential of a semi-supervised method. Hence, if the parameter has a low value, it is possible for instances with incorrect labels to be selected. If the parameter is too high, on the other hand, it is possible that instances with correct labels are discarded.
In this context, the objective of this work is to provide flexibility in the confidence threshold to include new instances in the labeled dataset, allowing the labeling process to be carried out according to a real practical scenario. In this work, several methods will be proposed to calculate, at each iteration, the confidence threshold for the inclusion of new unlabeled instances, among them: FlexCon-G, FlexCon and FlexCon-C. These three methods will be described in the next subsections.

1) FlexCon-G Method
The FlexCon-G method (Flexible Confidence with Graduation) adjusts the confidence value gradually, with the user initially setting a high confidence threshold (value close to 100%) to be used in the first iteration. Then, a fixed rate (d) is defined and, in each iteration, the confidence threshold is decreased by this rate d. The formula for calculating confidence in the current iteration, conf (t i+1 ), is described in Equation (1): where conf (t i ) is the confidence value in the previous iteration and d is the rate in which the confidence threshold decreases; The purpose of this method is to start with a high threshold (confidence value) and to decrease it gradually throughout the labeling process. In the initial phase, the classification algorithms have a small set of labeled data (D L ) and a restrictive threshold is defined. As the labeling process progresses, the labeled dataset (D L ) increases and a lower confidence limit is then used.
In addition, each instance included in the labeled dataset has its label defined by the output of the used classifier, in the same way as the semi-supervised learning methods in their original form do.

2) FlexCon Method
In the method for adjusting confidence in a flexible way, called FlexCon (Flexible Confidence), the equation that calculates the new confidence rate is based on three different aspects: 1) The confidence rate of the previous iteration; 2) The accuracy of a classifier which uses the labeled instances in the previous iteration as a training set (with the labels predicted by the classifier) and the dataset initially labeled as a test set; 3) The percentage of instances labeled in the previous iteration. Based on these aspects, Equation (2) calculates the arithmetic mean between the three parameters mentioned, being then defined as: where conf (t i+1 ) is the confidence value in the current iteration, conf (t i ) is the confidence value in the previous iteration, |L t | is the number of instances labeled in the previous iteration t, s j is each of the patterns in the set L t ; prec(s j ) is the precision of the s j pattern in time t i ; |D U | is the number of instances of the unlabeled dataset.
This method is aimed at adjusting the confidence threshold based on its value in the previous iteration, as well as in the precision (prec) and coverage (L t /D U ) of the classifiers generated with a basis on the information from the previous iteration. Figure 2 presents an example of calculating the confidence threshold using FlexCon. Assuming that the second FlexCon iteration will be performed, it is necessary to know about the first iteration (box on the left side of the figure). The box on the left displays three pieces of information: 1) The value of the confidence threshold on the first iteration (conf (1)); 2) The labeled dataset, with two columns: instance (x 1 , x 2 , ..., x 5 ) and label (A or B); 3) The classifier prediction for the still unlabeled dataset, with the following columns: instance (x 6 , x 7 , ..., x 11 ), predicted label (A or B) and the confidence in the classifier's prediction, which must lie in the [0, 1] interval and it represents the certainty which the classifier achieved that the instance belongs to the predicted class. Still in the left box, the instances of the unlabeled dataset that are shaded (x 6 , x 7 and x 9 ) are those whose confidence values in the prediction are greater than or equal to the confidence threshold (conf (1)). In other words, the instances that will be included in the labeled dataset. The boxes on the right show the calculation of the new confidence threshold value that will be used in the second iteration, according to Equation 2.
As it can be seen, the labeling process has two steps: As it can be seen, the labeling process has two steps: instance selection (from all classes) and label assignment. The use of the confidence rate must occur in the selection of the instances to be labeled. The labeling itself can be done by a classification method. In this work, we will use two classification structures, individual classifiers and classifier ensembles. For the case of using ensembles, one way to further improve the performance of the labeling process is to use an ensemble that is composed of classifiers built in all iterations carried out up to the current iteration. For example, in the fifth iteration, the classifier ensemble is composed of four individual classifiers, built in the previous four iterations. These individual classifiers are combined in two ways, to lead to two different versions of this method: (1) simple sum (FlexCon (s) version), the partial confidence values are aggregated using the individual classifier predict, and it is selected the label which has the highest aggregated value; and (2) majority vote (FlexCon (v) version), electing the most voted label among those presented by the individual classifiers. Majority vote and sum methods have been selected in this work since they are robust and widely used in the literature to combine classifier results. However, there are several other approaches for combination methods which are more efficient, such as alpha integration for an optimal fusion of individual classifiers [45], stacking [46], among others. The use of more efficient combination methods is the subject of an ongoing investigation.

3) FlexCon-C Method
In the method for adjusting confidence on a flexible way using classifiers -FlexCon-C (Flexible Confidence Classifier) -the confidence threshold is adjusted by increasing or decreasing its value based on a change rate (cr). In this way, the confidence threshold is decreased by the change rate (cr), when the classifier precision (acc) is greater than the minimum acceptable precision (mp). Otherwise, the confidence threshold is increased by the change rate cr, if the classifier's accuracy is less than the minimum acceptable precision. When the classifier precision (acc) is in an acceptable minimum precision range (mp), then the confidence threshold will not be changed and the confidence rate for the current iteration remains the same as in the previous iteration. In all cases, it is necessary to consider, as a safety margin, an acceptable variation (e) of the minimum precision.
Equation (3) presents the calculation of the confidence in the current iteration, conf (t i+1 ), using FlexCon-C: where conf (t i+1 ) is the confidence value during the current iteration; mp is the minimum acceptable precision; cr is the change rate; and acc is the classifier precision obtained in the previous iteration. Finally, e represents an acceptable variation in precision. Figure 3 shows an example of calculating the confidence threshold using FlexCon-C. In order to do so, the minimum precision value is defined as the accuracy obtained from the initially labeled dataset, which is used for training and testing (box on the upper left side of figure 3). Assuming that the second iteration of FlexCon-C will be performed, it is necessary to know the information related to the first iteration, which is shown in the box on the lower left side: 1) The confidence threshold value in the first iteration (conf (1)); 2) The labeled dataset, with two columns: instance (x 1 , x 2 , . . . , x 5 ) and label (A or B); 3) The classifier prediction for the still unlabeled dataset, with the following columns: instance (x 6 , x 7 , . . . , x 11 ), predicted label (A or B) and the confidence in the classifier's prediction, which must be between [0, 1] and represents the certainty which the classifier achieved that the instance belongs to the predicted class. The calculation of the confidence threshold for the second iteration starts by calculating the precision which the classifier had in the previous iteration (acc), which is obtained using the dataset labeled in the 1st iteration as a training set and the data initially labeled as a test set (upper left box). Assuming that the user defined the values of e and cr being 0.01 and 0.05, respectively, then acc is less than mpe and, therefore, the confidence rate will be increased for the second iteration (bottom right of figure 3). Now, concerning the definition of labels for unlabeled instances (labeling process), this method was divided into two sub-methods: FlexCon-C1 and FlexCon-C2. FlexCon-C2 uses the label provided by the classifier generated in the first iteration. This choice comes from the fact that the training set is formed by the data initially labeled. In other words, the labels predicted by a classifier, whose training set has all the correct labels, are expected to be more reliable.
FlexCon-C1 uses classifier ensembles to define the label for each pattern. Such ensembles are composed of classifiers built in all the previous iterations, carried out until the current iteration; for example, in the fifth iteration, the ensemble is formed based on the output of the classifiers from the previous 4 iterations. In order to evaluate the ensemble in different ways, these individual classifiers are combined by two combination methods, sum and majority vote, leading to two versions of this method, FlexCon-C1 (s) and FlexCon-C1 (v), respectively.

V. EXPERIMENTAL METHODOLOGY
The experimental methodology used to apply the methods proposed in this work is similar to the process of labeling used by the self-training and co-training algorithms. Figure  4 represents how the data should be trained and labeled, according to the self-training process, using the methods explained above. Figure 5 corresponds to the co-training process. The dashed blocks represents the main difference between these process and the original self-training and cotraining.
It is possible to observe that these images differ from original self-training and co-training, respectively, once it includes the dashed blocks.

A. EXPERIMENTAL METHODOLOGY USING SELF-TRAINING
In the methodology based on self-training (Figure 4), a supervised classifier is initially generated based on the set of labeled data, which will be used to classify the unlabeled data. Then, the new confidence threshold value which is used in the selection of new instances to be labeled will be calculated. In the next step, the instances whose confidence value for the prediction is greater than or equal to the confidence threshold will be selected, which are labeled according to different strategies. Finally, the process will be restarted using the new labeled dataset until the unlabeled dataset is empty. Remove a subset S = {s 1 , s 2 , . . . , s n } from {D U }, so that the confidence rate in C(x) is greater than or equal to the minimum confidence rate for new instances to be included; 6: Use different strategies to choose the label for every instance in subset S The self-training presented in [17] is different from the proposal of this work in the following ways: • The extension of the self-training proposed in [17] does not change the confidence threshold to include new instances, whereas this proposal allows for variation of the said value for each iteration (see Algorithm 3 Line 4); • In the extension of self-training proposed in [17], the label provided by the classifier is directly assigned to an unlabeled instance when it is moved to the labeled dataset. In this work, different strategies are proposed to define the correct label, among them: classifier ensembles (see Algorithm 3, Line 6).

B. EXPERIMENTAL METHODOLOGY USING CO-TRAINING
The methodology proposed in this experiment, based on cotraining ( Figure 5), starts by creating two different views (1 and 2) of the dataset. Then, two supervised classifiers are generated based on the labeled datasets (C 1 : View 1 and C 2 : View 2), which will be used to classify the unlabeled data. Right after this, the new confidence threshold value is calculated for each of the classifiers separately, which are used in the selection of new instances to be labeled.
In the next step, the instances whose confidence value in the prediction is greater than or equal to the confidence threshold defined for the classifier by which the instance was classified will be selected. These same instances are, then, labeled using different strategies. However, the most reliable instances predicted by the classifiers C 2 and C 1 will be added to the labeled dataset of Views 1 and 2, respectively. Finally, the process is restarted using the new labeled datasets until both unlabeled datasets are empty. In this work, an extension of the co-training algorithm was developed following the same methodology as [17], i.e., using a static confidence threshold for selecting instances in the labeling process. The implementation of this new extension was necessary to make it possible to compare the performance of co-training using a fixed confidence threshold and the proposed methods whose threshold is flexible. A new extension of the co-training algorithm is presented in Algorithm 4. This algorithm uses a fixed threshold to select new training instances. As a co-training algorithm, this one uses two views with distinct attributes. The main difference between this method and original semi-supervised co-training is how the confidence parameter (threshold) is measured (Lines 4-5 of Algorithm 4). The semi-supervised co-training algorithm proposed by [13] selects and uses the highest confidence value as the threshold to select instances with the same value. On the other hand, our proposal uses a prefixed confidence parameter to select all instances whose prediction confidence rate is greater than or equal to this threshold. The lines in black represent the differences between this algorithm and the original co-training (Algorithm 2).
Algorithm 4: Algorithm for Co-training Using Static Threshold 2) ]. Let A the set of the attributes A = {a 1 , a 2 , . . . , a n }, where A (1) and A (2) represents the Views 1 and 2, respectively. In addiction, A (1) A (2) = ∅. Initially, we have all of the training instances as D (1) 2) , y l )} and the unlabeled instances as D Generate classifiers C (1) and C (2) based on training data D U using classifiers C (1) and C (2) , respectively. 4: Add to set D (2) L the instances classified by C (1) , whose confidence rate of the prediction is greater than or equal to the minimum confidence rate for new instances to be included. 5: Add to set D (1) L the instances classified by C (2) , whose confidence rate of the prediction is greater than or equal to the minimum confidence rate for new instances to be included. 6: Remove these instances from the unlabeled dataset. 7 The methods proposed in this work, FlexCon-G, FlexCon and FlexCon-C, were also applied to the extension of cotraining which uses a static threshold, previously presented in Algorithm 4. Algorithm 5 shows series of steps of the new version of the co-training, with confidence adjustment, to be implemented in the development of this work and the lines marked with black indicate the main differences of Algorithm 4.
The co-training algorithm using a static confidence threshold (described in Algorithm 4 in this Section) differs from the proposal of this work in the following aspects: • The extension of co-training using a fixed threshold (Algorithm 4) does not change the confidence threshold to VOLUME 1, 2021 include new instances, whereas this proposal allows the variation of this value for each iteration (see Algorithm 5, Line 10); • In the extension of the co-training using a static threshold (Algorithm 4), the label provided by the classifier is directly assigned to an unlabeled instance when it is moved to the labeled data set. In this work, different strategies are proposed to define the correct label, among them: classifier ensembles (see Algorithm 5, Line 13).
Algorithm 5: Algorithm for Co-training with Confidence Adjustment 2) ]. Let A the set of the attributes A = {a 1 , a 2 , . . . , a n }, where A (1) and A (2) represents the Views 1 and 2, respectively. In addiction, A (1) A (2) = ∅. Initially, we have all of the training instances as D (1) 2) , y l )} and the unlabeled instances as D Calculate the new value for the confidence threshold 5: Add to set D (2) L the instances classified by C (1) , whose confidence rate of the prediction is greater than or equal to the minimum confidence rate for new instances to be included. 6: Add to set D (1) L the instances classified by C (2) , whose confidence rate of the prediction is greater than or equal to the minimum confidence rate for new instances to be included. 7: Use different strategies to choose the label for every new instance included in the labeled datasets D Remove these instances from the unlabeled dataset. 9: until {D U } = ∅ Output: labeled data

A. DATASETS
To validate the feasibility of the proposed confidence adjustment methods, an empirical analysis was performed. In this analysis, 30 different sets of classification data were used. The datasets used in these experiments were obtained from several repositories: UCI Machine Learning (UCI) [47], Knowledge Extraction based on Evolutionary Learning (KEEL) [48], Kaggle Datasets [49] and GitHub [50]. Table  1 briefly describes the datasets used, in terms of the number of instances (#Inst), attributes (#Att) and classes (#Class) in each dataset. In addition, it indicates the data type (Type), either integer (I) and/or categorical (C) and/or real (R).  Figure 6 shows, as an example, how the datasets were organized before training. It is common to have a sample of data for training and another independent sample, with different data, for testing. As long as both samples are representative, the error rate in the test suite will give a good indication of performance [51]. Thus, in the experiments of this work, each dataset was divided into two sets: 1) A training set, with 90% of the instances; 2) A test set, with the remaining instances. (10%). From each dataset, 10 repetitions were obtained by a different training/test splitting (cross-validation). Given that all datasets were originally labeled, it was possible to perform, in a stratified manner, 5 different configurations using 55%, 10%, 15%, 20% and 25% of the data initially labeled  (similarly to [34]). In other words, from the 90% of the instances selected for training, the process started with 5% or 10% or 15% or 20% or 25% of the labeled data. In this way, it is possible to analyze the performance of the methods as the percentage of instances initially labeled increases. The choice of the labeled dataset is made randomly, but in a stratified manner, respecting the same proportion of classes as the initially labeled dataset.

C. CONFIGURATIONS
After pre-processing the data, the training/test procedure starts. In this analysis, four classification algorithms well known in the literature were applied: Naive Bayes (NB), decision tree (DT), rule based classification algorithm (Ripper) and k closest neighbors (k-NN). As explained earlier, such algorithms were chosen because of their popularity and use in machine learning studies. For all algorithms, Weka implementations available in the R language were used. Details on the operation of the classifiers and their implementation in the R language are described in [30] and [52]. Additionally, at the beginning of the labeling process for self-training and co-training algorithms, it is worth mentioning that the set of labeled data has a small number of instances. Therefore, it is necessary to assign labels more cautiously, which is precisely why a high threshold will be used. However, there is no way to guarantee that the same value can be considered high for all datasets, since each dataset has an intrinsic degree of difficulty. In view of the above, the initial value of the confidence threshold becomes difficult to estimate and requires further investigation. However, the investigation regarding the most suitable value for this threshold is not the objective of this work and, for this reason, a sample test was carried out using only the 90% and 95% thresholds. Considering the results obtained with these two threshold values, the decision was for setting it to 95% (0.95) in all cases.
For the FlexCon-C method, the minimum acceptable precision (mp) for Equation (3) was defined as the classifier precision obtained using the same dataset for training and testing. To better understand, consider the following exam-ple: at the beginning of the labeling process there is a subset of the data initially labeled, which is used both to train and to test a particular classifier. Therefore, the accuracy of this classifier is maximized and used as a minimum acceptable precision throughout the labeling process. In other words, this is an optimistic rating estimate and the accuracy during the labeling process should be as good as that optimistic estimate.
In order to validate the performance of the proposed methods in a statistically significant way, Friedmann and posthoc Nemenyi tests were applied. Since such tests are nonparametric, they are suitable for comparing the performance of different learning algorithms when applied to separate datasets (for a complete discussion on Friedmann test, see [53]). Friedmann test and its post-hoc test were used to compare the performance of all proposed methods with the results achieved by using the original self-training and cotraining algorithms and with the proposition in [17], as well as with the co-training version which implements the same fixed threshold idea proposed by [17].
The results collected with the experiments described above will be discussed in the next chapter.

VII. RESULTS AND DISCUSSION
This section presents and discusses the results of experiments which evaluate the performance of FlexCon-G, Flex-Con and FlexCon-C applied to self-training and co-training, both being semi-supervised learning algorithms. The results achieved with these methods were compared with those obtained with the original self-training and co-training, as well as the one proposed in [17], which uses a static confidence threshold. The section is divided into two parts: the first performs an analysis of the performance of the proposed methods and the second makes an evaluation from a statistical perspective.

A. PERFORMANCE ANALYSIS
This section will present the analysis of results of the experiments which evaluate the performance of the methods proposed in this work. In order to facilitate the discussion, the results are evaluated taking into account the average accuracy of the algorithms for all datasets, as well as the standard deviation of these measures. The tables showing the results obtained are organized as follows: the first column indicates the name of the method; columns 2 to 6 indicate the accuracy and standard deviation obtained by the semisupervised methods (row) according to the percentage of instances initially labeled, namely: 5%, 10%, 15%, 20% and 25%, respectively.
The original self-training and co-training methods were called original ST and original CT, respectively. The methods which do not vary the value of confidence for the inclusion of new instances throughout the process, when applied to selftraining and co-training are called, respectively, ST Fixed Threshold (STFT) and CT Fixed Threshold (CTFT). While for the methods proposed in this work, the same nomencla-ture was used for both self-training and co-training: FlexCon-G, FlexCon(s), FlexCon(v), FlexConC1(s), FlexCon-C1(v) and FlexCon-C2. In addition, results whose classification accuracy is superior to the original self-training and cotraining, are highlighted in bold, while the cells shaded in yellow represent the proposed methods whose accuracy was better than the method using fixed threshold.
The following sections present the analysis of the performance of each method when using self-training and cotraining semi-supervised learning algorithms, respectively. Table 2 shows the arithmetic mean of the accuracy and standard deviation of each method, using Naive Bayes, decision tree, Ripper and k-NN as classification algorithms. According to the data marked in bold in Table 2, it is possible to conclude that, while using the Naive Bayes and k-NN classifiers, all methods obtained better accuracy than the original ST, for all percentages of instances initially labeled that were analyzed. While using decision tree and Ripper, 71.42% (5 out of 7) of the methods achieved better results than the Original ST when 25% and 15% of the initially labeled data were used, respectively. Generally evaluating the results obtained using the self-training labeling process, it is possible to conclude that the proposed methods achieved better accuracy than the Original ST and the ST Fixed Threshold, respectively, in 85 and 82 of the 120 cases, which is equivalent to approximately 70% of cases.

1) Performance Analysis with Self-training
It is important to emphasize that, while using Naive Bayes, all the proposed methods -FlexCon-G, FlexCon and FlexCon-C -obtained better accuracy than the Original ST and the ST Fixed Threshold , when the percentage of instances initially labeled was 5%. In this way, it is possible that these methods adapt well in real-world datasets which have a small number of labeled instances. Furthermore, this same result can be observed for the FlexCon and FlexCon-C methods, when using 20% and 25% of the data initially labeled. Unlike the Naive Bayes classifier, when using a decision tree, the methods performed better than the Original ST and the ST Fixed Threshold when using 25% of instances labeled at the beginning of the process.
Still based on the data presented in Table 2, when analyzing the areas shaded in yellow, it is possible to verify that using Naive Bayes, decision tree, Ripper and k-NN the proposed methods are better than the ST Fixed Threshold method at 70% (21 of 30), 76.66% (23 of 30), 83.33% (25 of 30) and 43.33% (13 of 30) of cases, respectively. In addition, it is observed that the proposed methods obtained better performance than the ST Fixed Threshold using the Naive Bayes classifiers, decision tree and Ripper with 5%, 10%, 20% and 25% of the data initially labeled. These results demonstrate that the methods, when using these classifiers, adapt well to both the highest and lowest values of percentages of instances initially labeled. Table 3 identifies the method which achieved the best performance in relation to the others for each percentage of instances initially labeled.
Given the above considerations, it is clear that for the Naive Bayes and Ripper classifiers, the best method is always one of those proposed in this article. While for the decision tree and k-NN the proposed methods stood out as better in 4 of the 5 percentages of instances initially labeled. Table 2 present the standard deviation, which was calculated considering the average results for each dataset used in the experiments; after this, we calculated the average accuracy and standard deviation over all datasets. Thus, the high values of standard deviation can be justified because the datasets have different characteristics and, consequently, divergent accuracy. In addition, it is noticed that the lowest values of the standard deviation are identified when using the highest percentages of instances initially labeled, otherwise the highest standard deviations are located in the lowest percentages, except for the Naive Bayes algorithm, the values of each are similar for all percentages of instances initially labeled. Figure 7 presents an image which indicates the number of times that each method achieved the best performance in relation to all other methods, according to each percentage of instances initially labeled and each classifier. To explain it better, consider the following example, the FlexCon-C1(s) obtained greater accuracy than the other methods in two cases: 1) with the Naive Bayes classifier using 5% of the data initially labeled; 2) using k-NN as classifier, and 10% of the data initially labeled; so this method appears on the graph with the bar at number two on the y axis. Given the above, it is possible to observe that FlexCon-C2 stands out with superior accuracy in 5 out of 20 cases, 4 with DT and Ripper (using 10% and 25%) and 1 with NB (using 15%). Then, both methods FlexCon(s) and FlexCon(v) perform best in 4 of the 20 cases, and FlexCon-C1(v) in 3 cases.
After evaluating the performance of the methods separated by classifier, the results of each method are analyzed separately. As a way to explore all the results of each of the methods described in this work, Figures 8 and 9 present boxplot-type graphs, produced from the 300 accuracy measures (10 repetitions of each of the 30 datasets) achieved by each method, using self-training. These graphs were organized by method, so each graph has the accuracy of a single method, separated by classifiers and percentages of instances labeled at the beginning of the process. In these graphs, the x axis contains the percentage of instances labeled at the beginning of the process (5%, 10%, 15%, 20%, 25%), while the accuracy values are on the y axis. In addition, as each graph has the performance of the four classifiers, they were separated by colors: red, blue, yellow and green, which represent, respectively, Naive Bayes, decision tree, Ripper and k-NN.
By analyzing these graphs, considering the percentages of instances initially labeled, it seems clear that, in all methods, the accuracy values rise as the percentage of labeled instances increases. In other words, the higher the percentage of instances initially labeled, the better the performance of each method for all classifiers. Given the above, it is possible to observe that the greatest accuracy is obtained using 25% of the data initially labeled, for all methods (100% of cases). Additionally, it is noticeable that the classifiers with the greatest difference in accuracy from the smallest to the highest percentage of instances initially labeled are decision tree and Ripper, in most cases. Naive Bayes and k-NN have similar performance for all percentages of instances labeled at the beginning of the process. In addition, it becomes evident that Naive Bayes was the classifier with the highest amount of outliers in all methods. Table 4 shows a ranking with the datasets that achieved the best and the worst performance, using the average accuracy, during the experiments using the original self-training and the proposed algorithms. For simplification purposes, we present only three datasets with the best and worst ranking performance. The data was split horizontally for classifiers and vertically for each approach analyzed in this paper (ST-Original, STFT, and FlexCon-Family). In this case, the results of six variations proposed method were aggregated (using majority vote), grouped by an only column, and named FlexCon-Family.
In general, although the FlexCon-Family algorithms have been the best performance than the traditional approaches, we observed a similar behavior tendency with the raking of datasets, independently to the evaluated method. In other words, the datasets that achieved the best results using the FlexCon-Family algorithms are practically the same that achieved the best results using the traditional approaches, with minor variations.
On the other hand, we observed that the performance of each dataset in the ranking presented is the most relationship VOLUME 1, 2021 with the classifier used than the approach used. The alteration of the classifier has a direct influence on the accuracy of the process. With the Ripper and k-NN classifiers, according to Table 4, we observed that the best and the worst datasets are the same, independently of the approach used.

2) Performance Analysis with Co-training
The results related to the performance of each method using the co-training algorithm with the classifiers Naive Bayes, decision tree, Ripper and k-NN are presented in Table 5. According to the data presented, it is noticeable that all methods obtained greater accuracy than the Original CT in 100% of the cases (values in bold).
Comparing the accuracy of the proposed methods with those of CT Fixed Threshold (cells shaded in yellow in Table  5), Naive Bayes achieved better results in 3 out of 30 cases, and while using the other classifiers, none of the proposed methods exceeded the accuracy of this method. This result can be justified by the number of instances labeled during the process, as the proposed methods classify all of the unlabeled instances, while the CT Fixed Threshold does not classify instances whose confidence rate is lower than the threshold initially defined. This behavior of the proposed methods implies the possibility of including instances with low confidence, which can negatively influence the prediction of the classifiers. Otherwise, CT Fixed Threshold procedure may result in the inclusion of too few instances in the training set. In this  way, the training set can contain only those instances whose prediction is reliable, positively affecting the prediction of the classifiers.
To prove the justification above, two graphs were created, shown in Figure 10, which present the average percentage of instances labeled using the methods with fixed threshold and the ones proposed. In these graphs, the x axis represents the percentage of instances initially labeled and the y axis represents the average percentage of instances included in the labeled dataset during the labeling process. The red bars indicate 100% of instances labeled by the proposed methods, while the colors green, blue, yellow and brown represent the fixed threshold method using the classifiers Naive Bayes, decision tree, Ripper and k-NN, respectively. By analyzing Figure 10, it is possible to verify that ST Fixed Threshold -STFT (left graph) labels approximately 80% of the instances using the fixed threshold at 95%. Using this same threshold value, CT Fixed Threshold -CTFT (graph on the right) labels the worst and best cases, respectively, 30% and 70% of the instances of the unlabeled dataset. On the other hand, the proposed methods, which use a flexible threshold, label the entire set of unlabeled data and begin their labeling process with a threshold value of 95%; however, they need to decrease this value to include it the remaining instances. Given the above, it is possible to reaffirm that the small number of instances labeled by CT Fixed Threshold may have positively influenced the prediction of the classifiers due to the construction of a training set formed only by instances whose prediction is reliable.
Additionally, a sample experiment was carried out using the FlexCon(s) method (Stop Criterion -SC) in which the labeling process is interrupted at the moment when there are no new instances to be labeled, instead of labeling all instances of the unlabeled dataset. The accuracy of that experiment is shown in Table 6 in the row called FlexCon(s)-SC. The other rows of the referred table were replicated from Table 5 for comparison of results. The values in bold VOLUME 1, 2021 represent the cases in which FlexCon(s)-SC achieved greater accuracy than the Original CT and FlexCon(s), while the cells shaded in yellow reflect the situations in which FlexCon(s)-SC performed higher than CT Fixed Threshold. Analyzing the data in Table 6, it can be seen that the FlexCon(s)-SC method achieved greater accuracy than the CT Original and FlexCon(s) methods in all cases (values marked in bold). In addition, FlexCon(s)-SC performed better than the CT Fixed Threshold in 3 and 5 of the 5 percentages of instances initially labeled using Naive Bayes and Ripper, respectively (yellow shaded cells). In all other cases, Flexcon(s)-SC showed a performance similar to CT Fixed Threshold's. Considering the above, it is possible to conclude that the proposed methods may be labeling instances whose confidence in the prediction is very low and therefore their performance is not being superior to CT Fixed Threshold's. However, there is an evident trade-off in which the proposed methods label all instances of the unlabeled dataset, although their performance in relation to CT Fixed Threshold decreases.
Analyzing the results of Table 5, comparing only the accuracy of the proposed methods without considering CT Original and CT Fixed Threshold, it is possible to observe that FlexCon-C2, using Naive Bayes, FlexCon(v), using decision tree and Ripper and FlexCon-C1(v) with k-NN, were the ones that showed the best performance. In other words, these methods were more accurate than the others in most of the 5 percentages of instances initially labeled. Following the same dynamics as self-training, the standard deviation shown in Table 5 was calculated considering the average accuracy of the 30 datasets used in the experiments of this work, therefore, the high values of this measure can be justified due to the datasets having different characteristics and consequently different performances. In addition, it is noticeable that, while using the Naive Bayes and Ripper algorithms, the standard deviation is similar for all percentages of instances initially labeled, whereas for Decision Tree and k-NN the values are equivalent when the percentages are greater than 5%. Figure 11 presents a graph which indicates how many times each method achieved better performance, according to each percentage of instances initially labeled and each classifier. Analyzing the data in general, it is evident that FlexCon(v) stands out with the best accuracy in 9 out of 20 cases, 3 with NB, DT and Ripper (using5%), 2 with DT and k-NN (using 10%), 2 with DT and 2 with Ripper (both using 15% and 20%). Then, the FlexCon-C2 method with the best performance in 5 out of 20 cases, FlexCon-C1(v) in 4 cases, FlexCon(s) with 2 cases and FlexCon-C1(s) with 1 case.
Following the same approach as with self-training, for co-training, boxplot-type graphics were also generated, produced from the 300 accuracy (10 repetitions of each of the 30 datasets) achieved by each method. Such graphs were organized by method and are shown in Figures 12 and 13. Given the above, each graph has the accuracy of a single method, separated by classifiers and percentages of instances labeled at the beginning of the process. In these graphs, the x axis contains the percentage of instances initially labeled (5%, 10%, 15%, 20%, 25%), and the y axis are the accuracy values. In addition, each graph has the performance of the four classifiers, which were separated by colors: red, blue, yellow and green represent, respectively, Naive Bayes, decision tree, Ripper and k-NN.
By analyzing the graphs presented above, considering the percentages of instances initially labeled, it becomes evident that the results are similar to the ones with self-training, for the higher the percentage, the higher the accuracy values. Given the above, it is possible to notice that the greatest accuracy is obtained using 25% of the data initially labeled, for all methods (100% of cases).
Additionally, it is noticeable that the classifiers with the greatest difference in accuracy from the smallest to the highest percentage of instances initially labeled are decision tree and Ripper, in most cases. Naive Bayes and k-NN have similar performance for all percentages of instances labeled at the beginning of the process. Now considering the divergent values, it becomes apparent that Naive Bayes has a large amount of them in all methods and percentages of initially labeled data, except in the CT Original and FlexCon-G. k- Similarly, as presented in the previous section, Table 7 shows the rank of the datasets using the average accuracy during the experiments using original co-training and Flex-Con methods. Table 7 shows a ranking with the datasets that achieved the best and the worst performance, using the average accuracy, during the experiments using the original co-training and the proposed algorithms. For simplification purposes, we present only three datasets with the best and worst ranking performance. The data was split horizontally for classifiers and vertically for each approach analyzed in this paper (ST-Original, STFT, and FlexCon-Family). In this case, the results of six variations proposed method were aggregated (using majority vote), grouped by an only column, and named FlexCon-Family.
Like the self-training, In general, although the FlexCon-Family algorithms have been the best performance than the traditional approaches, we observed a similar tendency toward raking datasets independently to the evaluated method. In other words, the datasets that achieved the best results using the FlexCon-Family algorithms are practically the same that achieved the best results using the traditional approaches, with minor variations.
From these results, analyzing the ranking of each dataset concerning the classifiers, it is possible to observe that the best and the worst datasets are the same for all classifiers, except for some variations using k-NN and Ripper. Using Naive Bayes and decision tree, we observed in Table 7 that both the first three and last three datasets of ranking are the same, independently that the approach used, except for any changes in ranking positioning.

B. STATISTICAL ANALYSIS
After evaluating the performance of each method using Naive Bayes, decision tree, Ripper and k-NN as classification algorithms in the labeling procedure, a statistical analysis of the results was performed. As explained earlier, Friedmann and post-hoc Nemenyi tests were used to compare the performance of different methods applied to different datasets.
The statistical test was applied separately for the selftraining and co-training algorithms. However, each percentage of instances initially labeled and the four classifiers were considered together in order to facilitate the visualization of the results. First, Friedmann test was performed, which showed that the performances of the different methods are statistically different. The significant difference was detected by Friedmann test, with p-value < 0.001, for all proportions of data initially labeled.
Given the statistical difference presented by Friedmann test, the paired test post-hoc Nemenyi was, then, applied to compare the different methods, two by two, in each percentage of instances initially labeled. The result of this test is detailed in the next sections using the Critical difference diagrams presented in Figures 14 and 15. The methods located on the left are considered better than those on the right, from a statistical point of view. The methods connected by the horizontal bar are those with a similar performance and therefore have no statistical difference. Otherwise, the methods that are not linked by the horizontal bar are statistically different, with the method on the left being superior to the one on the right. 1) Statistical Analysis with Self-training Figure 14 presents the critical difference diagrams obtained from the post-hoc Nemenyi statistical test for the self-training algorithm. This figure contains the diagrams separated by the percentage of instances initially labeled. The first observation that can be made regarding these diagrams is that methods FlexCon and FlexCon-C obtained the best rankings, i.e., they appear more to the left in most cases. FlexCon-C2 stands out as superior from a statistical point of view, as it is always to the left of the diagram and has a critical difference in relation to at least one of the methods in 4 of the 5 percentages of instances initially labeled. In addition, this method achieved the lowest ranking when it used the lowest percentages of instances initially labeled (5%, 10% and 15%).
Analyzing the diagrams in Figure 14, it is clear that the difference between all methods and the original are statistically significant in all percentages of instances initially labeled, except for FlexCon-G with 5%, 15% and 25%. Considering the statistical performance of the proposed methods in relation to the fixed threshold method, it is noticeable that the proposed methods are always positioned more to the left. This means that the proposed methods have better ranking than the fixed threshold, although they are statistically similar. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  Figure 15 presents the critical difference diagrams obtained from the post-hoc Nemenyi statistical test for the co-training algorithm. As with self-training, diagrams are separated by the percentage of instances initially labeled. Analyzing the diagrams of that figure, it is clear that the difference between all methods and the original are statistically significant, except FlexCon-G with 5% and 20%. By analyzing the diagram that uses 5% of instances initially labeled, it is apparent that method FlexCon(s and v) is statistically similar to the fixed threshold method. This can be considered a good result, because in addition to this similarity, FlexCon has adapted well, from a statistical point of view, using few instances initially labeled. As justified previously, the method that uses fixed threshold achieved better performance than the others, due to the fact that it labeled fewer instances. Therefore, its training set can be formed only by instances with high reliability. The same justifies its good position in the ranking of the statistical test.
Corroborating the performance analysis performed previously, despite the fixed threshold obtaining better statistical performance, it is possible to notice that methods FlexCon and FlexCon-C, in all diagrams, obtained good positions in the ranking, i.e., they were positioned on the left of the diagram.
The results (accuracy and standard deviation) from the application of self-training showed that, with the Naive Bayes and k-NN classifiers, all methods achieved better performance than ST Original. In addition, most methods, 68.33% (82 of 120), obtained greater accuracy than ST Fixed Threshold. On the other hand, using co-training, all four classifiers performed better than ST Original.
The evaluation from the statistical point of view was performed using Friedmann and post-hoc Nemenyi tests, which compared the methods proposed to the original ones, separating them by percentage of instances initially labeled. From this investigation, it was observed that the proposed methods are better from a statistical point of view in most cases. In summary, the results presented in this study suggest that the proposed methods have, on average, superior performance to the original methods, both in terms of accuracy and standard deviation, and in statistical analysis.

VIII. CONCLUSION
The present work is included in the machine learning field, more specifically, semi-supervised learning. One of the main limitations of semi-supervised learning algorithms is related to the selection of new instances to be included in the labeled dataset. In this context, several studies have been carried out to try to solve this problem. However, none of them employed the approach presented in this work, which uses a dynamic confidence threshold, to include new instances in the training set at each iteration.
This work proposed three methods, namely: FlexCon-G, FlexCon and FlexCon-C, which include a dynamic confidence rate calculation for choosing labels used in the semi-supervised labeling process with self-training and cotraining. In this way, while the original self-training and cotraining (as well as their derivations) use a static procedure to include new instances in the labeled dataset, methods FlexCon-G, FlexCon and FlexCon-C aim to make the labeling procedure more flexible. Such a strategy allows the proposed methods to be able to more deeply explore the full potential of a semi-supervised technique.
To assess the feasibility of this proposal, experiments were carried out using 30 sets of classification data, organized in 5 different scenarios, with regard to the proportion of instances initially labeled (5%, 10%, 15%, 20% and 25%). In addition, four different classification algorithms were used in the selftraining procedure, Naive Bayes, decision tree, Ripper and k-NN.
The results of the experiments were evaluated from two perspectives: 1) performance of the methods regarding accuracy and standard deviation; 2) statistical analysis. Based on this analysis, it was possible to conclude that the proposed methods -the objective of which is to make the confidence rate dynamic -were very promising, as they, in most cases, improved the classification accuracy when compared to the methods in their original form and in the fixed threshold form.
Finally, by exploring the accuracy of the methods proposed by the classifier, it was possible to conclude that, when using the classifiers Naive Bayes and k-NN, the proposed methods stand out from the others, as they obtained greater accuracy than the original methods, both for self-training and for cotraining. By Investigating the performance of the methods according to the percentage of instances initially labeled, it is apparent that the greatest accuracy is achieved when using 25% in all cases, except with self-training using the Naive Bayes classifier.
The following are some works that can be developed in future research involving the proposed methods: • The methods were applied to the self-training and cotraining algorithms, however there is the possibility of using other semi-supervised learning algorithms, as well as other classifiers; • Investigation of new strategies to address or avoid the inclusion unreliable instances, i.e., those whose confidence rate in the prediction is very low; • Development of a data stratification process to include new instances in the labeled dataset, which must be applied using co-training; • In this work, five percentages of instances initially labeled were used, therefore, other percentages can be used and compared with this research; • It is possible to conduct an investigation of the best value for the initial confidence threshold; • The confidence parameter can be replaced by confidence intervals instead of using a single threshold value, allowing for greater flexibility in the algorithm. VOLUME 1, 2021