Multi-class Classification with Fuzzy-feature Observations: Theory and Algorithms

The theoretical analysis of multi-class classification has proved that the existing multi-class classification methods can train a classifier with high classification accuracy on the test set, when the instances are precise in the training and test sets with same distribution and enough instances can be collected in the training set. However, one limitation with multi-class classification has not been solved: how to improve the classification accuracy of multi-class classification problems when only imprecise observations are available. Hence, in this paper, we propose a novel framework to address a new realistic problem called multi-class classification with imprecise observations (MCIMO), where we need to train a classifier with fuzzy-feature observations. Firstly, we give the theoretical analysis of the MCIMO problem based on fuzzy Rademacher complexity. Then, two practical algorithms based on support vector machine and neural networks are constructed to solve the proposed new problem. Experiments on both synthetic and real-world datasets verify the rationality of our theoretical analysis and the efficacy of the proposed algorithms.

Rademacher complexity is a crucial tool to derive generalization bounds, which measure how well a given hypothesis set can fit random noise.A Rademacher complexity based bound was first proposed by Koltchinskii and Panchenko [8].Subsequently, this bound was improved in [7].Then, Maximov, Amini and Harchaoui [9] presented a new estimation error bound using Rademacher complexity for multi-class classification issues.In addition, to ensure multi-class PAC learnability, a series of estimation error bounds based on VCdimension and Natarajan dimension were proposed in [10], [11].Because of the dependence on dimensions, these VCdimension based bounds rarely apply to large-scale issues.
To conduct theoretical analysis of neural networks for multiclass classification problems, Hardt et al. [12] and McAllester [13] introduced the new bounds based on stability and PAC-Bayesian.Further, tighter and sharper bounds were proposed in [14], [15] by using local Rademacher complexity.According to these theoretical analyses, it illustrates that we can always learn a good classifier for multi-class classification problems to predict the test set when the instances are precise in the training and test sets with same distribution and enough instances can be collected in the training set.
However, there is one limitation with multi-class classification that the existing methods can not handle the scenario that only imprecise observations are available.For example, the readings on many measuring devices are not exact numbers but intervals because there are only a limited number of decimals available on most of these measuring devices.Thus, this scenario has inspired us to consider a further realistic problem called multi-class classification with imprecise observations (MCIMO).With the MCIMO problem, we aim to train a classifier with high classification performance for multi-class classification problems when the features of all the instances in both training and test sets are imprecise (e.g., fuzzy-valued or interval-valued features).
The main challenge to solving the MCIMO problem is how to handle observations with fuzzy-valued or interval-valued features.Existing well-known machine learning methods can not be directly used to address the MCIMO problem.Recently, combining fuzzy techniques with machine learning methods (especially for transfer learning methods [16]- [20]) has drawn increasing attention.In the literature review section, we will give a brief review of these machine learning methods with fuzzy techniques [21]- [26].According to these fuzzy-based methods, it demonstrates that fuzzy techniques are powerful tools to analyze imprecise observations and provide better interpretability to handle the uncertainty of different issues.Therefore, we consider using fuzzy techniques to address the MCIMO problem because they can represent the imprecise features of the instances in both training and test sets and can handle different types of uncertainty issues.
In this paper, we consider using fuzzy random variable, which was proposed in [27], [28], to represent the imprecise feature of the instances.Then, we give the theoretical analysis and obtain the estimation error bounds for the MCIMO problem.In the MCIMO problem, these bounds are really important as it ensures that we can always train a fuzzy classifier with high classification accuracy when the instances are drawn from the same fuzzy distribution and enough fuzzyfeature instances can be collected.
Subsequently, we construct two fuzzy technique-based algorithms, which combine fuzzy techniques with SVM and neural networks to analyze fuzzy data.The proposed algorithms contain two main parts.The first part aims to extract the most significant crisp-valued information from imprecise observations, which is the main difficulty of the proposed algorithms.In this paper, we compare the performance of different defuzzification methods on synthetic datasets to find the optimal defuzzification function for the proposed algorithms.The second part is to classify the extracted crisp-valued information by two well-known machine learning methods: SVM and neural networks.In addition, interval-valued data is also a common type of imprecise data in real-world scenarios.In this paper, we give one approach to apply the proposed methods to analyze interval-valued data.Finally, experimental results on both synthetic and real-world datasets reveal the superiority of the proposed algorithms and demonstrate that the proposed fuzzy-based methods can obtain better performance to analyze fuzzy data or interval-valued data than non-fuzzy methods through comparisons with seven baselines.The main contributions of this paper are as follows.The remainder of this paper is structured as follows.Section II presents a brief review of the methods which combine fuzzy techniques with machine learning methods.Section III introduces the related definitions.Section IV introduces and gives a formal definition of the MCIMO problem.Section V gives the theoretical analysis of the MCIMO problem.Section VI proposes a novel framework to address the MCIMO problem and constructs two algorithms based on this framework to analyze fuzzy-feature observations.In Sections VII and VIII, the experiments on both synthetic and real-world datasets are constructed to show the superiority of the proposed algorithms.Section IX concludes this paper and outlines future work.

II. LITERATURE REVIEW
In this section, a brief review of the methods which combine fuzzy techniques with machine learning methods is presented.
On the one hand, for classification tasks, Colubi et al.
[21] integrated fuzzy L 2 metrics [29] with the discriminant analysis approach to analyze fuzzy data.Yang et al. [30] proposed a novel fuzzy SVM algorithm based on a kernel fuzzy c-means clustering method to deal with the classification problems with outliers or noises.Rong et al. [31] introduced a new classification method, which applies the defuzzified Choquet integral to address heterogeneous fuzzy data classification issues.Wang et al. [22] presented a novel deep-ensemblelevel-based Takagi-Sugeno-Kang (TSK) fuzzy classifier to address imbalanced data classification tasks, which achieved both promising classification performance and high interpretability of zero-order TSK fuzzy classifiers.Liu et al. [32] used fuzzy vectors to model imprecise observations of distributions and help address the two-sample testing problem that is a core problem in the machine learning field [33]- [35].
In addition, in the area of transfer learning, Behbood et al. [36], [37] proposed a series of novel fuzzy-based transfer learning methods for long-term bank failure prediction, which use the fuzzy sets and the concepts of similarity and dissimilarity to modify the labels of the target instances.Deng et al. [38]- [41] proposed several new approaches that integrate TSK fuzzy system (TSK-FS) with transfer learning to recognize epileptic electroencephalogram signals.To solve the heterogeneous unsupervised domain adaptation (HeUDA) problems for classification tasks, Liu et al. [42] introduced a novel HeUDA approach utilizing shared fuzzy equivalence relations via fuzzy geometry, which can measure the similarity between the features of the instances in the source and target domain.Further, [23] enhanced this method, which called shared-fuzzy-equivalence-relations neural network, to analyze another challenging problem called the multi-source heterogeneous unsupervised domain adaptation.
In contrast, for regression tasks Deng et al. [43], [44] proposed several novel transfer learning approaches utilizing the Mamdani-Larsen fuzzy systems and TSK-FS.Further, the authors [45] improve the above model to construct a new transfer learning model that uses two knowledge-leverage strategies, learning from the TSK-FS model, to enhance the two types of parameters for the target domain.In addition, Zuo et al. [46] applied granular computing techniques to transfer learning and proposed a comprehensive domain adaptation framework based on the T-S fuzzy model.Subsequently, [24] presented a novel fuzzy rule-based transfer learning model, which integrates an infinite Gaussian mixture model with active learning.Applying these two techniques, researchers can identify the data structure and select an appropriate source domain when multi-source domains are available, and choose labeled data for the target model with high efficiency when the target domain contains insufficient data.Hence, Lu et al. [25] presented a novel fuzzy rule-based transfer learning approach that merges fuzzy rules from multi-source domains in both homogeneous and heterogeneous scenarios.Besides, some new fuzzy-based clustering methods were presented in [47], [48] to analyze fuzzy data.
In our previous work [26], we proposed one algorithm to solve a novel classification problem that the instances in training and test sets are all imprecise and we give the theoretical analysis of this problem.However, there are two drawbacks in our previous works.First, one gap has not be solved that there is no research to explore properties of different defuzzification methods.Second, we only verified the performance of the proposed algorithm on the synthetic dataset, while the performance of the proposed algorithm on real-world datasets is indispensable.In this paper, we address both drawbacks in our previous work.

III. PRELIMINARY
In this section, some related definitions are introduced, including the definitions of fuzzy probability density function and fuzzy probability distribution.
Definition 3.1 ( [28]): Let R be the universal set, X is a fuzzy random variable.Suppose f Xα (x) is the probability density function of X L α and X U α , where [ X L α , X U α ] is the α-cut of X.We define f ( x) as the fuzzy probability density function of X.Then, the membership function of f ( x) is defined as: where Definition 3.2 ( [26]): We denote D as the fuzzy probability distribution of X ∈ F R (denoted as X ∼ D), which contains the value range and fuzzy probability density function of X, where D represents the value range of real-valued variable x which induce all fuzzy real numbers in D.
Let X = ( x 1 , x 2 , • • • , x p ) ∈ F p R p be p-fuzzy random vector, where x 1 , x 2 , • • • , x p ∈ F R are i.i.d fuzzy random variables.Suppose the probability density function of x j is f j ( x), j = 1, • • • , p.We denote the joint probability density function of and its membership function is defined by where Then, we denote D as the fuzzy distribution over X ⊂ F p R p , where D contains the value range and the joint probability density function of any fuzzy vector belongs to X .

IV. MULTI-CLASS CLASSIFICATION WITH IMPRECISE OBSERVATIONS
In this section, we introduce the MCIMO problem.Let X ⊂ F p R p be the input space and Y = [1, K] be the output space, and let D be an unknown fuzzy distribution over X .Suppose S = {( X i , y i )} m i=1 be a sample drawn from X × Y, where We noticed that if X i ∈ X belongs to the kth class, then f ( X i ) = k .Let H ⊂ {h : X → R K } be the hypothesis set of the MCIMO problem and ∀h ∈ H, where each h k ( X i ), k = 1, • • • , K represents the probability of the instance X i belongs to the k-th category.Then, we give the definition of the loss function with respect to h, Let L H = {l(h( X), y)| X ∈ X , h ∈ H, y ∈ Y} be the class of loss functions associated with H.
The traditional multi-class classification problems aim to use the sample S to find a hypothesis h ∈ H which can cause as small as possible risk R(h) with respect to f .In the MCIMO problem, the purpose is similar to traditional multiclass classification problems.Then, we give the definition of the risk with respect to h, where the notion of E X∼ D [l(h( X), y)] can be fund in [26].Thus, to address the MCIMO problem, we are committed to find the optimal hypothesis function h * to minimize the risk, i.e., h * = arg min h∈H R D (h).

V. THEORETICAL ANALYSIS OF THE MCIMO PROBLEM
In this section, the theoretical analysis of the MCIMO problem is presented.Firstly, the notion of fuzzy Rademacher complexity is introduced.Then, we obtain the estimation error bounds of the MCIMO problem, which guarantees that we can always obtain a fuzzy classifier with high classification accuracy when infinite fuzzy-feature instances are available.
Definition 5.1 ( [26]): Let L H be a family of loss functions and S = {( X i , y i )} m i=1 a sample drawn from F p R p × Y.Then, the empirical fuzzy Rademacher complexity of L H and H with respect to the sample S and S X = { X i } m i=1 are defined as: where , with σ i s independent random variables drawn from the Rademacher distribution, i.e.P r( and D denote the fuzzy distribution according to S and S X .Then, the fuzzy Rademacher complexity of L H and H are defined as follow: Using related lemmas and theorems (shown in [26]) and the theoretical analysis of traditional multi-class classification algorithms (show in [7]- [10], [15]), the estimation error bounds with hypotheses H} are show in the following theorem.
Theorem 5.1 ( [26]) , and suppose that there are C l , C h > 0 such that sup h∈H h ∞ ≤ C h and sup h ∞ ≤C h max y l(t, y) ≤ C l , and ∀l ∈ L H is L l -Lipschitz functions.For any δ > 0, with fuzzy probability at least 1 − δ, each of the following holds for all l ∈ L H : Because ∀l ∈ L H is L l -Lipschitz functions, we have Then, The detailed proof of theorem 5.1 can be found in [26].In Section VI, we decompose the hypothesis function into defuzzification function and optimization function.We let the loss function l(h( X i ), be the class of optimization functions associated with M, and L G = {l(g(M ( X i )), y)|M ∈ M, g ∈ G, y ∈ Y} be the class of loss functions associated with G.Then, we have: Then, we can get the following theorem using theorem 5.1.
, and suppose that there are C, C l > 0 such that sup g∈G g ∞ ≤ C and sup g ∞≤C max y l(t, y) ≤ C l , and ∀l ∈ L G is L l -Lipschitz functions.For any δ > 0, with fuzzy probability at least 1 − δ, each of the following holds for all g ∈ L G : Then, The proof of theorem 5.3 is similar to theorem 5.1.Next, we consider the estimation error bounds for kernelbased optimization functions such as support vector machine (SVM).Let K : R p × R p → R be a PDS kernel function, Φ : R p → H be a feature mapping associated to K and w 1 , • • • , w K ∈ H are weight vectors.For any p ≥ 1, the family of kernel-based hypotheses is denoted as: Hence, the fuzzy Rademacher complexity of G K,p can be bounded as follow.
Lemma 5.1: Let K : R p ×R p → R be a PDS kernel function and Φ : R p → H be a feature mapping associated to K. Assume that there exists r > 0 such that Proof : Then, the fuzzy Rademacher complexity of the hypothesis set G K,p can be bounded as follows: (using Cauchy-Schwarz inequality) (using Jensen's inequality) which yields the result.Next, combining theorem 5.2 and lemma 5.1 directly yields the following generalization bound.
Theorem 5.3: Let K : R p × R p → R be a PDS kernel function and Φ : R p → H be a feature mapping associated to K. Assume that there exists r > 0 such that K(M ( X), M ( X)) ≤ r 2 for all X ∈ X .Let S X = { X i } m i=1 , X i ∼ D ∈ X and suppose that there are C, C l > 0 such that sup g∈G K,p g ∞ ≤ C and sup g ∞≤C max y l(t, y) ≤ C l , and ∀l ∈ L G K,p is L l -Lipschitz functions.For any δ > 0, with fuzzy probability at least 1 − δ, each of the following holds for all h ∈ G K,p : According to equations ( 8), (12), and ( 14), we notice that fix some constants, as m → ∞, R D (h) → R D (h).Therefore, these bounds demonstrate that we can always obtain a fuzzy classifier with high classification accuracy when enough fuzzyfeature instances can be collected.These theoretical analyses reveal that fuzzy classifiers can be constructed to effectively and accurately handle the MCIMO problem.

VI. CONSTRUCT FUZZY CLASSIFIERS FOR SOLVING MCIMO PROBLEM
In this section, two fuzzy classifiers are constructed to handle the MCIMO problem.The framework of the proposed algorithms is shown in Figure 1.In the MCIMO problem, we aim to train a fuzzy classifier for fuzzy-feature input prediction.Let fuzzy-feature input, where A trapezoidal fuzzy number x can be characterized by (a 1 , b 1 , b 2 , a 2 ) and the membership function of a trapezoidal fuzzy number x is shown as follow: Finally, when b 1 = b 2 , a trapezoidal fuzzy number is become a triangular fuzzy number.Thus, a triangular fuzzy number x can be characterized by (a 1 , b 1 , a 2 ).
To address the MCIMO problem, we need to construct a hypothesis function h ∈ H which mapping the input space X ⊂ F p R p into R K .A hypothesis function h can be decomposed into a composition of two functions.The first function M , called defuzzification function, is defined as follow: Next, four different defuzzification methods are introduced: 1) The first method is called Mean/Middle of Maxima (MOM) [49] which is widely-used due to its calculation simplicity.MOM is defined as: 2) The Centre of Gravity (COG) [50] is another widelyused defuzzification method.The definitions of COG for discrete and continuous situations are show as follow: 3) The third approach, called averaging level cuts (ALC) [51], is defined as the flat averaging of all midpoints of the α-cuts.ALC is defined as : 4) The final method is called value of a fuzzy number (VAL) [52] which uses α-levels as weighting factors in averaging the α-cut midpoints.VAL is defined as : In Section VII, we compare the performance of different defuzzification methods on synthetic datasets.The experimental results illustrate that VAL outperforms than other three defuzzification methods.Therefore, equation ( 19) is used as the defuzzification function in all subsequent experiments.
Through the first progress, the initial issue becomes a traditional multi-class classification problem with crisp data.Therefore, the second function, called the optimization function, is a hypothesis function that maps R p into R K to solve the traditional multi-class classification problem.Since support vector machine and neural networks have gained great achievements on multi-classification problems, we decide to apply both algorithms as the optimization method.Next, we will introduce both algorithms for multi-classification problems.

A. Defuzzified support vector machine
Firstly, support vector machine (one-vs-rest SVM [53]) with PDS kernel function is used as the optimization function to solve the MCIMO problem.Suppose D tr = (( The −l indicates that X i does not belong to category l, and the +l represents that X i belongs to category l.In the first step, defuzzification function ( 19) is used to transform fuzzy input Hence, we need to solve K optimization problems separately, and the lth problem is shown as follows: The optimal solution is − Algorithm 1 DF-SVM and the decision function in (22).
Algorithm 2 DF-MLP [26] 1: Input training data Dtr, learning rate η, fixed epoch Tmax, loss function (cross-entropy loss function is selected) and optimization algorithm (Adam algorithm [54] is selected); ), y) according to Eqs. ( 19) and ( 23); Finally, the decision function is given as follow: The following algorithm called defuzzified support vector machine (DF-SVM) is shown in Algorithm 1.

B. Defuzzified multilayer perception
Secondly, a multilayer perception model, which contains two hidden layers and an output layer (softmax), is used as the optimization function to complete the second progress.We denote the parameters of the two hidden layers are W 1 , b 1 and W 2 , b 2 respectively, and the parameters of the output layer are W 0 , b 0 respectively, and the activation function is φ.Then, the outcome of the constructed multilayer perception model can be expressed as when we get a fuzzy-feature input X: where The following algorithm called defuzzified multilayer perception (DF-MLP) is shown in Algorithm 2.

VII. EXPERIMENTS ON SYNTHETIC DATASETS
In this section, we first compare the performance of different defuzzification methods on synthetic datasets to select the optimal defuzzification function for the proposed algorithms.Then, we verify the efficacy of the proposed algorithms for solving the MCIMO problem by comparing seven baselines in terms of classification accuracy on synthetic datasets.

A. Dataset generation
In this section, we introduce how to construct the synthetic dataset (Balanced data) which contains N fuzzy instances distributed in five categories.Each instance has 20 fuzzy features.Firstly, we generate the real-valued vectors five categories by a random number generator as the true value of the instance.Then, we use the generated real-valued vectors to construct the observation datasets

B. Experimental setup
In this section, baselines and experimental details of all baselines, DF-SVM and DF-MLP are introduced.
1) Baselines: Firstly, we introduce the first five baselines which called Meanlogistic, MeanSVM, MeanDecisiontree, MeanRandomForest and MeanMLP.For fuzzyfeature dataset, a fuzzy feature is denoted as x = (inf P 0 , sup P 0 , inf P 1 , sup P 1 ).We use M 1 ( x) = (inf P 0 + sup P 0 + inf P 1 + sup P 1 )/4 to transfer fuzzy features to crisp features.For interval-valued datasets, x = [A, B] is denoted as a interval-valued feature.Similarly, M 2 (x) = (A + B)/2 is used to transfer interval-valued features to crisp features.Then, those baselines apply five well-known machine learning methods (logistic regression, SVM, decision trees, random forests and neural networks) to classify crisp-valued data obtained with the above-mentioned methods.Secondly, the last two baselines called DCCF and BCCF are presented in [21].
2) Experimental details: For DF-MLP, we let momentum = 0.9 and weight decay = 0.0001.Finally, for the DCCF and BCCF algorithms, ϕ is selected to be the Lebesgue measure on [0, 1] and θ = 1/3, K(u) = 15  8 (1 − u 2 ) 2 I (u∈[0,1]) is used as the kernel function.All these settings of DCCF and BCCF algorithms can obtain the best performance from [21].However, DCCF and BCCF algorithms can only process the fuzzy data with one fuzzy feature, whereas the generated synthetic datasets contain multiple fuzzy features.Therefore, we consider using the average distance between each fuzzy feature to represent the distance between the fuzzy feature vectors in the DCCF and BCCF algorithms.
For each algorithm on each dataset, we randomly divide each dataset into the training set, the validation set and the test set, which contain 60%, 20% and 20% of the data, respectively.First, we select the hyperparameters that can obtain the highest average classification accuracy on the validation set.The average classification accuracy on the validation set is the average of the results of 10 repeated experiments on the validation set.The hyperparameters that need to be selected are shown in Table I.Then, the selected optimal hyperparameters are used to test the performance of each algorithm on the test set.We repeat the entire experiment process 20 times.Thus, the final results are shown in the form of "mean± standard deviation."To avoid random errors, we randomly scramble the data before each experiment.Classification accuracy is used to evaluate the performance of the proposed model.The definition of classification accuracy is shown as follows: where f ( X) is the ground truth label of X, while h( X) is the label predicted by the presented algorithms and the baselines.
In the first experiment, we compare the performance of the proposed two algorithms with different defuzzification functions on the test set when the number of synthetic data increases.The number of synthetic data N is selected from {200, 400, • • • , 3000, 3500, 4000}.In the second experiment, we generated 2000 synthetic data and analyzed them using the proposed methods and baselines, respectively.In addition, the Wilcoxon rank-sum test results of the method, which obtains the best performance, with other methods are given.

C. Experimental results analysis
The results of the first experiment are shown in Figure 2. From Figures 2(a) and 2(b), we find that COG and VAL have better performance than another two methods in terms of convergence speed and classification error and VAL is more stable than the other three methods.The reason why VAL can achieve better performance than other methods is that VAL uses all information from fuzzy sets so that some key information is not discarded.In addition, VAL gives less importance to the lower levels of fuzzy sets, which is reasonable from the perspective of the concept of membership function.Therefore, we use VAL as the defuzzification method in the following experiments.Moreover, from Figure 2(c), it illustrates that the convergence rate of the two proposed algorithms with VAL defuzzification method is O(1/ √ m).Therefore, we confirmed the theoretical analysis results in Section V that we can always obtain a fuzzy classifier with high classification accuracy when sufficient fuzzy-feature observations are available.
The results of the second experiment are illustrated in Table II, and Figure 3 shows the classification accuracy curve of Algorithm 2 on the synthetic datasets vs. the number of epochs.From the results, DF-SVM and DF-MLP obtain better performance than the most other baselines on the synthetic dataset.Further, the results of the statistic test show that DF-SVM outperforms other methods significantly at the 0.05 significance level, which demonstrates the superiority of the proposed algorithms.In addition, we present the experimental running times for the proposed algorithms and all baselines.

VIII. EXPERIMENTS ON REAL-WORLD DATASETS
In this section, five real-world datasets are used to verify the efficacy of proposed algorithms for solving the MCIMO problem by comparing with seven baselines in terms of classification accuracy.Besides, we show how to apply the proposed algorithms to analyze interval-valued datasets.

A. Real-world datasets
In this section, we briefly introduce the five real-world datasets used in the experiments.1) Perceptions experiment dataset: The 1st dataset, called the perceptions experiment dataset, contains 551 observations with one fuzzy feature.The fuzzy feature is a trapezoidal fuzzy number characterized by (inf P 0 , sup P 0 , inf P 1 , sup P 1 ).Each observation is the perceptions experiment result for one person.The description of perceptions experiment can be found in the following URL: http://bellman.ciencias.uniovi.es/SMIRE/Perceptions.html.
In the perceptions experiment, the one black line that people will see is shown in Figure 4. Once participants see a black line, they will be asked to give a trapezoidal fuzzy number characterized by (inf P 0 , sup P 0 , inf P 1 , sup P 1 ) to describe it.For the first dataset, we consider using the fuzzy feature (i.e., the trapezoidal fuzzy number) to predict the category (very small; small; medium; large or very large), which will be selected by the participants according to their perception of the black line.
2) Mushroom dataset: The 2nd dataset is the California mushroom dataset1 that contains 245 instances in 17 fungi species categories.There are five interval-valued variables: the pileus cap width (X 1 ), the stipe length (X 2 ), the stipe thickness (X 3 ), the spores major axis length (X 4 ), and the spores minor axis length (X 5 ).Some instances of the mushroom dataset are shown in Table III.The goal of our experiment on this dataset is to predict the species category of the California mushroom using five interval-valued features.
3) Letter Recognition dataset: The 3rd dataset is the letter recognition dataset, selected from UCI Machine Learning Repository (https://archive-beta.ics.uci.edu/), which contains 20000 instances in 26 categories.This dataset contains 16 integer features extracted from raster scan images of the letters.We use the same methods described in Section VII to transfer integer features into fuzzy features.Then, we obtain one realworld dataset with fuzzy-valued features.The goal of our experiment on this dataset is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet.
4) London weather dataset: The 4th dataset is the meteorological data of London (from March 1, 2016 to December 31, 2021), provided by the 'Reliable Prognosis' site (https://rp5.ru/),which contains 2131 instances.Each instance is meteorological data of one day in London, which described by five interval-valued variables (air temperature T , atmospheric pressure at weather station level P 0, atmospheric pressure reduced to main sea level P , humidity U and dewpoint temperature T d) and one category variable (Precipitation or not: 0 ≡ No Precipitation, 1 ≡ Precipitation).Some instances of this dataset are shown in Table IV.We aim to use the five interval-valued features for precipitation prediction.
5) Washington weather dataset: The 5th dataset is the meteorological data of Washington (from January 1, 2016 to December 31, 2021) in the 'Reliable Prognosis' site as well, which contains 2191 instances.Each instance is meteorological data of one day in Washington, which described by five interval-valued variables (same as the 4th dataset) and one category variable (same as the 4th dataset).We aim to use the five interval-valued features for precipitation prediction.

B. Preprocessing of interval-valued data
We notice that the features of the 2nd, 4th and 5th datasets are interval-valued.Therefore, in this section, we present  an approach to transform interval-valued features into fuzzyvalued features.Suppose [A, B] is denoted as a feature of one interval-valued instance.Thus, we use one approach that maps [A, B] to a triangular fuzzy number x characterized by (A, βA + (1 − β)B, B), where β ∈ [0, 1] is a hyperparameter to control the shape of the membership function of x.
Through the above preprocessing, the DF-SVM and DF-MLP algorithms can be used to classify dataset with intervalvalued instances.In addition, we realize that the second dataset is an imbalanced dataset which means that each category contains a different number of instances.Therefore, a random oversampling technique (KMeansSMOTE [55]) is used to improve the performance of the proposed algorithms.After the process of the random oversampling technique, the data of each category in the second dataset is expanded to 30.

C. Experimental setup
We use the same baselines in Section VII, and the experimental details of all methods are basically the same as in Section VII.The only difference is that one more hyperparameter β needs to be selected when analyzing the second dataset.We select the shape parameter β from {0, 0.05, 0.1, • • • , 1}.Further, we complete the Wilcoxon rank-sum tests of the method, which obtains the best performance, with other methods on real-world datasets.Since DCCF and BCCF can not well handle the dataset with a large number of instances, we only compare the proposed algorithms with the first five baselines on the last three datasets in our experiments.
In addition, since the second dataset is an imbalanced dataset, we use balanced accuracy [56]  where TP is true positive, TN is true negative, FP is false positive and FN is false negative.AU C is equal to the compute area under the receiver operating characteristic curve.

D. Experimental results analysis
All the experiment results on the five real-world datasets are illustrated in Tables V, VI, VII, VIII, IX, X and how the evaluation metrics varies with the number of epochs for Algorithm 2 are shown in Figure 5. From these results, the proposed two algorithms achieve better performance than other baselines on all five real-world datasets, which illustrates the efficacy of the proposed algorithms in addressing real-world datasets with fuzzy-valued or interval-valued features.Moreover, the results of the statistic test show that the proposed two algorithms outperform most other methods significantly at the 0.05 significance level, which demonstrates the superiority of the proposed algorithms.Further, for the 1st, 2nd and 5th datasets, DF-MLP obtains the highest average performance on the test set.While, for the letter recognition dataset and London weather dataset, DF-SVM is more prioritized than other methods, which means that the proposed algorithms are applicable to different types of datasets.

E. Parameters sensitivity analysis
In this section, we analyze whether the value of the shape parameter β in DF-SVM and DF-MLP affects the balanced accuracy and AU C on the mushroom dataset.
We conduct the same preprocessing for the mushroom dataset.We select the shape parameter β from {0, 0.05, 0.1, • • • , 1}.Then, for each value of β, the results are obtained using the same experimental operation in Section VII.Figures 6(a) and 6(b) show the mean and standard deviation of the balanced accuracy and AU C of the test sets on the mushroom dataset when the shape parameter β of both algorithms changes from 0 to 1.These figures illustrate that a different value for the shape parameter β will affect the classification performance since the value of β determines the shape of the triangular fuzzy number.A value of β that can   achieve high performance means that the proposed algorithms with this value of β can extract more significant information from the datasets with fuzzy-valued or interval-valued features.Therefore, we can improve the performance of DF-SVM and DF-MLP by finding a suitable value of β.In our experiments, we find the optimal value of β in the validation set.

IX. CONCLUSION AND FUTURE WORK
In this paper, we identify a new problem called multi-class classification with imprecise observations (MCIMO).In the MCIMO problem, we need to train a fuzzy classifier when only fuzzy-feature observations are available.
Firstly, we identify a novel problem called MCIMO in Section IV.Since there are no existing papers for theoretical analysis of fuzzy classifiers, we give the estimation error bounds for the MCIMO problem in this paper.These bounds illustrate that we can always train a fuzzy classifier with high classification accuracy to solve the MCIMO problem as long as sufficient fuzzy-feature instances can be collected.
Hence, two algorithms are constructed to handle the MCIMO problem.In addition, the optimal defuzzification function for the proposed fuzzy technique-based algorithms is found by comparing the performance of different defuzzification methods on synthetic datasets.Finally, experimental results on synthetic datasets and three real-world datasets show the superiority of the proposed algorithms.Moreover, through comparisons with several non-fuzzy baselines, the experimental results demonstrate that the proposed fuzzybased methods can obtain better performance in analyzing fuzzy data or interval-valued data than non-fuzzy methods.Since they use fuzzy vectors to express the distribution of imprecise data and apply different defuzzification methods to extract crisp-valued information from imprecise observations.
In future research, we plan to study more complicated issues, for example, covariate shift and domain adaptation with imprecise observations.We can get the theoretical analysis and solutions of these issues based on the introduced theoretical analysis and algorithms in this paper.In addition, we found that the proposed two algorithms can obtain better performance in processing interval-valued data.Therefore, we consider analyzing interval-valued data based on the proposed two algorithms in future studies.
4] and U [a, b] denotes the uniform distribution over [a, b].

Fig. 3 .
Fig. 3. Accuracy curve on the synthetic datasets vs. the number of epochs.

Fig. 4 .
Fig. 4. Software to evaluate the visual perception of a line segment.

(
and AU C instead of classification accuracy to compare model performance on the second dataset.The definition of balanced accuracy is Balanced Accuracy = 1 K K k=1 Recall of k-th class), Recall = TP/(TP + FN), DF-MLP on the London weather dataset.DF-MLP on the Washington weather dataset.

Fig. 5 .
Fig. 5. Evaluation metrics varies with the number of epochs.

Fig. 6 .
Fig.6.Evaluation metrics of the test sets varies with the value of shape parameter β.

TABLE IV SOME
INSTANCES OF THE LONDON WEATHER DATA

TABLE V EXPERIMENT
RESULT OF PERCEPTIONS EXPERIMENT DATASET.

TABLE VII THE
P-VALUE OF THE STATISTIC TEST ON MUSHROOM DATASET.

TABLE VIII EXPERIMENT
RESULT OF LETTER RECOGNITION DATASET.The bold value represents the highest accuracy in each column.p: The p-value of the Wilcoxon rank-sum test between the performance of DF-SVM and other algorithms.The bold value represents the highest accuracy in each column.p: The p-value of the Wilcoxon rank-sum test between the performance of DF-SVM and other algorithms.The bold value represents the highest accuracy in each column.p: The p-value of the Wilcoxon rank-sum test between the performance of DF-SVM and other algorithms.