LR-BCA: Label Ranking for Bridge Condition Assessment

Bridge condition assessment (BCA) plays an important role in modern bridge management. Existing assessment methods are time-consuming, labor-intensive and error-prone. The use of machine learning for BCA can effectively solve the above problems. However, the large amount of label noise in the dataset severely affected the performance of the BCA model. In this paper, we present an effective label ranking approach for BCA (LR-BCA). Our proposed LR-BCA method considers the natural order relationship between bridge condition ratings. Moreover, a heuristic data cleaning (HDC) approach is proposed for cleaning bridge condition dataset. The HDC method firstly identifies all the label conflict examples, then iteratively filters out the noise. Experimental results on real-world dataset confirm the effectiveness of the HDC method and demonstrate that our proposed LR-BCA method achieves 99% Top-2 accuracy, which is highly competitive compared to baseline methods.


I. INTRODUCTION
Bridges play an important role in societies as they assist the daily movement of people and wares. However, the collapses of bridges happen occasionally because of the deteriorating of bridge components along time. Due to the crucial importance of bridges, some countries used bridge management systems (BMS) and intelligent transportation systems (ITS) to manage and to utilize bridge condition information for traffic management [1]. Bridge management aims to determine the optimal strategy for operating bridges under the consideration of adequate security, thereby minimizing operational costs during the bridge's life cycle. Therefore, BCA is an important part of BMS and ITS. It is possible to provide decision support for bridge management through BCA. Table 1 illustrates the attributes of bridge condition data. Generally, the bridge is subdivided into several components in inspection phase. Bridge inspectors observe at site to record bridge components condition data and assess components condition with visual inspection. However, the accuracy of the recorded bridge condition data is not high because of the involvement of inspectors' experience and amounts of fuzzy bridge condition information [2]. Inaccurate recorded The associate editor coordinating the review of this manuscript and approving it for publication was Li He . data may make it difficult to evaluate bridge condition correctly. BCA is a data-driven operation, which needs high-quality data. On the one hand, the wrong assessment of bridges can lead to unnecessary maintenance and reinforcement. On the other hand, any delay or erroneous assessment of bridge conditions may lead to higher maintenance costs and significant safety hazards in the future.
We observe that there are numerous noise in the bridge dataset. Based on the reasons above, it is vital to eliminate ''incorrectly labelled'' data (or noise) in the data cleaning process after the acquisition of bridge condition data. This paper constructs conflict-pairs to determine noise, and then iteratively filters out them based on the values of the number of conflicts ranking. The cleaning method works well in real-world datasets with significant improvements in model performance. Besides, this paper models the BCA problem as a label ranking problem, which is able to effectively evaluate the bridge condition. The main contributions of this paper are summarized as follows. • We propose an effective label ranking approach for bridge condition assessment (LR-BCA), which models the bridge rating problem as a label ranking task, and it can effectively evaluate bridge condition and improve error-tolerance ability of the model. To handle conflict examples in BCA dataset, a heuristic data cleaning (HDC) approach is also developed.
• Based on a real-world bridge condition dataset, we experimentally evaluate our LR-BCA approach and HDC approach. Extensive experimental results show that our LR-BCA approach is highly competitive compared to baseline methods. Also, our proposed HDC approach can effectively reduce conflict examples in BCA dataset, thus improving assessment performance.
The rest of the paper is organized as follows. After a brief literature review on existing BCA methods and cleaning methods in Section II, we present a heuristic data cleaning method based on conflict-pairs in Section III. Section IV introduces a label ranking method for bridge condition assessment. Section V provides a set of experiments, as well as the experimental results and analysis. Finally, concluding remarks and potential research directions are given in Section VI.

II. RELATED WORK A. DATA CLEANING METHODS
The noise in this paper refers to the data with incorrect class labels, i.e. label noise. Data cleaning techniques aims to detect and repair the noise in datasets. Existing methods can be divided into three categories, and we address them as labelnoise-robust, datasets-cleaning, and threshold-based filtering approaches, respectively.

1) LABEL-NOISE-ROBUST APPROACHES
The goal of these methods was to modify existing algorithms such that they are less sensitive to noise. An example was C4.5 algorithm [3], which reduced the complexity of the decision tree by pruning strategies. It decreased the probability of decision tree being over-fitting. However, C4.5 algorithm achieved poor performance on datasets with large numbers of attributes. Fürnkranz and Widmer [4] proposed an incremental reduced error pruning (IREP) algorithm in 1994, which could effectively deal with large noise datasets, but its classification accuracy was lower than C4.5 algorithm. Subsequently, Cohen [5] proposed a repeated incremental pruning to produce error reduction (RIPPER) algorithm, which was an improved IREP algorithm. Compared with C4.5, RIPPER was not only competitive in classification accuracy but also more robust to noise. Nettleton et al. [6] experimentally compared the robustness of several algorithms for handling label noise. Experimental results showed that naive Bayes was the most robust algorithm, and support vector machine (SVM) [7] could be easily altered by noise.

2) DATASET-CLEANING APPROACHES
These methods aimed to eliminate noise from training set and train classifiers with ''clean'' data. These approaches, also known as noise filters [8], were based on the idea that the learning algorithms would benefit from the separation of noise detection and model learning. A standard approach was to use a classification algorithm as the 'filter' in noise detection and to use another algorithm as the 'learner' in the learning phase. Owing to the fact that k-nearest neighbors (KNN) is highly sensitive to label noise, there were many KNN-based filters where KNN was used to detect noise [9]. Ensemble filters were known to detect noise by the voting result of several different classifiers. Brodley and Friedl [10] proved that detecting noise by the voting results of different classifiers are more effective than collecting noise detection result from a single classifier. Consequently, they proposed an ensemble filter based on C4.5, 1-NN, and linear machine. Similarly, some researchers [11], [12] proposed iterative partition filters that iteratively removed detected noise. In these methods, N filters were learned based on N group of N − 1 partitions and voted to detect noise on the entire dataset. García-Gil et al. [13] proposed an homogeneous ensemble method (HME) and an heterogeneous ensemble method (HTE) to deal with noise in big data. HME was based on a partition scheme and implemented random forest as a single classifier. HTE, inspired by [10], used random forest, a linear model, and KNN as base classifiers. Consensus Voting Scheme (CVS) and Majority Voting Scheme (MVS) were two classical ensemble settings [14]. The CVS setting is known to achieve a high precision, but it tends to missing a lot of noisy instances. However, The MVS setting often yields high recall, but typically falsely classify numerous clean instances as noise. Samami et al. [15] proposed to use a mixed scheme, namely High Agreement Voting Filtering to compensate for the drawbacks of the two settings. They removed the detected strong and semi-strong noise, and then relabeled the weak noise instead of deleting them.

3) THRESHOLD-BASED FILTERING APPROACHES
This category of methods tried to find a way to measure the estimated noise level of examples. For example, Sluban et al. [16] used random forest algorithm to identify noise. The agreement levels of decision trees in the forest were used for noise detection. The more the trees had the VOLUME 9, 2021 same predictions for an example, the more likely it is a clean example. Garcia et al. [17] decomposed the multi-class problems into binary sub-problems, and binary filter was used for each sub-problem to calculate the confidence of examples. The confidence levels of examples which were predicted by binary filters were combined. They were used to remove the examples whose confidence levels were higher than the threshold. The threshold-based filters were believed to allow the practitioner to control the removal level. Inspired by the above methods, Sáez et al. [8] proposed an iterative noise filter (INFFC) based on the combination of classifiers and noise sensitivity control. The method mainly utilized the ability of k-nearest neighbors based metric approaches with a variety of thresholds. Zerhari et al. [18] modified INFFC by using partitioning strategies and removing portion of good examples in each iteration. Sabzevari et al. [19] proposed a two stages noise detection method based on the sampling rate of bootstrapping and the degree of disagreement between the individual ensemble members. After determining the optimal sampling rate and the optimal threshold of disagreement, an ensemble method was constructed afresh. Nematzadeh et al. [20] proposed to adopt an ensemble filter for noisy instances detection. They computed the average euclidean distance between the suspected noise detected by the ensemble filter and the K -nearest neighbors in the clean set. Finally, they removed and relabeled the suspected noise based on the distance scores. Guan et al. [21] introduced to utilize feature selection for better noise filtering performance. They adopt MVS to obtain a noise score for each instance, and then generated noise candidates based on the noise scores. The noise scores and noise candidates were used as the target values and the input instances respectively in the feature selection phase.
The first type of approach depends on the particular modifications of the learning algorithm, and the modification varied from one algorithm to another. The second type often consists of sub-classifiers. They can be either trained on the same dataset using different algorithms, or initialized on different subset of the training set using the same algorithm. As a result, the samples that are indistinguishable to the filter are labeled as noise and removed from the dataset. In other words, the filter only removes the samples that does not match the classification patterns of the sub-classifiers. Problems may rise when the samples are mislabeled as noise when they are hard to distinguish by the filter. As for the third type, a key feature is the removal level, which is subjectively set to select noise sample. This type of method is similar with the ones using noise filter. Therefore, some researchers also refer it as threshold based filter.
To improve the performance of classifiers, different type of methods can often be used together in practical applications. For example, one may use a filter for cleaning in the preprocessing stage, and then use the adapted algorithm in the learning phase. Although noise filters could achieve good performances in some datasets, there are also deficiencies. In real-world scenarios, cleaning through filters may remove a large amount of normal data, resulting severe information loss when building analytical models. Owing to the same reason, filter related methods did not perform well in bridge condition datasets.

B. BRIDGE CONDITION ASSESSMENT METHODS
Bridge condition assessment (BCA) has attracted enormous attentions from both academia and industry. Various methods have been proposed in the past few decades, and they can be roughly divided into two categories: Domain knowledge based approaches. This type of method evaluated bridge condition using criteria and indicators based on expert knowledge. For example, Mitsuru and Sinha [22] used Delphi study to develop unified guidelines of bridge condition assessment for inspectors. However, their discussion was limited to bridge deck, i.e. a component of bridge. Kushida and Miyamoto [23] proposed an effective method for correlating empirical knowledge with membership functions based on fuzzy theory to express expert knowledge of bridge condition assessment. Wang and Elhag [24] proposed an evidential reasoning (ER) method for bridge condition assessment to model uncertainties in human subjective assessment. Lately, Dabous and Al-Khayyat [25] introduced a method of combining Monte Carlo simulation with ER, which used pairwise comparison to determine the weights of bridge component condition ratings in whole bridge condition. Alsharqawi et al. [26] introduced the extended quality function deployment theory into bridge condition assessment based on visual inspection and ground penetrating radar data. Björnsson et al. [27] considered to use the Bayesian decision theory for decisions of bridge condition assessment based on three domain knowledge, namely, modelling sophistication, consideration of uncertainty, and knowledge content. In addition, analytic hierarchy process [28] was also widely used in bridge condition assessment. In the type of methods, a bridge was divided into multiple components to evaluate the whole condition.
Soft computing based approaches. Soft computing technology mainly includes fuzzy logic, genetic algorithm and artificial neural network. These methods were able to directly process the input data without the need to build logical structures. Some researchers [29], [30] introduced the artificial neural network for the bridge condition assessment using the dataset selected from the National Bridge Inventory database. Liu and Zhang [29] considered three primary components as input features of the Convolutional Neural Network, while Nguyen and Dinh [30] utilized Forward Neural Network as the model with eight input features. Flintsch and Chen [31] summarized the applications of the three most common soft computing techniques in infrastructure management, namely artificial neural networks, simulation systems and genetic algorithms. Liu et al. [32] used fuzzy c-mean clustering algorithm to evaluate the condition of bridge superstructure, and particle swarm optimization was used to optimize the algorithm. Yusuf and Hamid [33] used artificial neural network and multiple regression analysis to model limited data, respectively. By comparing and analyzing the performance of two models under various conditions, it concluded that the artificial neural network was more suitable for the bridge condition assessment. Lu et al. [34] proposed an ordinal logistic regression algorithm for bridge condition assessment. The algorithm had the advantage of handling the ordinal nature of bridge condition ratings. Li and Burgueño [35] developed four models (i.e. multi-layer perceptron, support vector machine, supervised self-organizing machine and radial basis function network) for the assessment of damage in bridge abutment. Martinez et al. [36] studied the performance of five types of predictive models in bridge condition assessment, and concluded that decision tree outperformed the other models including linear regression model, KNN, and neural networks.
Compared to soft computing based approaches, domain knowledge based approaches generally required more expert knowledge. However, expert knowledge was usually very expensive. The soft computing based approaches required enough bridge condition data to learn an adequate model. Although, with the passage of time, a large number of bridge condition data has been accumulated. Due to the subjective factors and the errors in the observations of bridges, there was plenty of incorrect data in the existing dataset, which severely affected the performances of bridge condition assessment models.

III. A HEURISTIC APPROACH FOR DATA CLEANING
Generally, structure of bridge is described by its components, and the main structure includes a series of basic components, such as the main deck, hanging beam, and girder. Owing to defects and damages such as rust, cracks, and deformation, the components condition deteriorated over time. The rating of bridge condition can be determined through the severity of defects and damages. In this section, we introduce the features of bridge condition data, and then illustrate the general form of the heuristic approach proposed in this paper.

A. THE ELEMENT OF BRIDGE CONDITION DATA
An example of data that is used for bridge condition assessment has numbers of attributes and a discrete label.
The attributes include basic bridge information (e.g. year of completion, traffic per data, and whether prestressed bridge or not) and defects (e.g. rust, crak, and deformation). The label is the rating of bridge condition. For the k-th sample in the dataset, the relation between attributes and label can be formulated as equation (1) . . , X k M 1 +M 2 are basic bridge information, and R k is the condition rating.
Symbolically, dataset of bridge condition can be organized in Table 2. Without basic bridge information attributes, the bridge condition data can be regarded as monotonic ordinal data. In conventional assessment, human inspectors divide the entire bridge into several main components, and combine the assessment results of components to compute the rating. A main component of bridge contains a number of minor structures, and human inspectors assess the main component based on all minor structures. Each main component condition is categorized into five levels, each of which indicates a distinct level of defects and damages. However, discrete condition levels tend to cause wrong assessment. For instance, the main beam transverse cracks that are opined as slight by one human inspector may not be slight for another.
It is difficult to acquire a clean test set because of the complexity of bridge condition assessment. Moreover, a major problem with state-of-the-art cleaning methods is that they always require a clean test set to evaluate its cleaning effectiveness. It makes the state-of-the-art methods inapplicable to the bridge condition assessment, since we cannot easily acquire a clean test set. If we implemented noise filters in test set, we might risk removing the clean data just because it cannot be classified correctly by classifiers. To clean label noise in the dataset, we propose a heuristic data cleaning (HDC) approach, which is model-agnostic.

B. HEURISTIC DATA CLEANING
The HDC includes three main steps, which are detailed in the following subsections, i.e. subsection III-C, III-D, and III-E. Fig. 1 shows the scheme of our approach to filter noisy data. Firstly, the approach does a pairwise comparison in the entire dataset to form a conflict-pairs group. Secondly, the approach counts the number of times that examples appear in the conflict-pairs group and then eliminates noise iteratively based on conflict frequencies. We stop the iteration when the cleaning ratio ≥ 0.3 or the number of conflict examples is equal to zero. Finally, this work uses LR algorithm to evaluate HDC by comparing the performance of the model trained in clean dataset with the one initialized in noisy data.
In the rest of the section, D 1 represents the initial dataset, D 2 represents the dataset at the beginning of the cleaning, and D 3 represents the new dataset after cleaning noisy data. The three main steps are summarized as follows.
• Constructing conflict-pairs. It preprocesses the dataset D 1 to obtain D 2 by deleting useless attributes. Considering the examples in D 2 , a conflict-pair is formed when they conform with a given condition by pairwise comparison. All conflict-pairs form a conflict-pairs group.  • Training models. In this step, LR algorithms are used to generate an ordered set of bridge labels to strengthen the models' error-tolerance ability. The LR models M 1 and M 2 are trained in the dataset D 2 and the new dataset D 3 , respectively. The effectiveness of the cleaning method is evaluated by the comparison of two models.
Note that the noise filtered out by HDC are merely suspected label noise. In other words, these identified noise may contain clean examples. In this paper, HDC is based on following important judgements and observations: 1) There are pairwise conflicts between bridge examples.
According to simple domain knowledge, the more serious the defect level is, the higher the bridge condition rating is. However, for some examples in the dataset, they have higher defects levels, but lower bridge condition ratings. The phenomenon of examples conflicts is very common. As a matter of fact, there are 183, 166 conflict-pairs in our dataset, which accounts for 0.59% of the total example-pairs. 2) The attributes of bridge defects are ordinal attributes.
Specifically, a bigger defect attribute value indicates a more serious bridge condition, if other attributes hold still.
3) The more frequent a bridge example conflicts with others, the more likely it is label noise. For instance, given a bridge example A, if it is conflicted with n examples, the number of conflicts is recorded as n. The bigger n is, the more likely it is label noise. 4) Label noise has a negative influence on the performance of models. Apart from decreasing the test accuracy of algorithms, label noise also prolongs the training time and increases model complexity. The model gains better performance when training in a clean set.

C. CONSTRUCTING CONFLICT-PAIRS
We construct the conflict-pairs group G using initial dataset D 1 . Firstly, preprocessing is conducted on D 1 , including recovering missing values and removing useless columns.
The missing values are mostly from basic bridge information attributes. The hot deck method is used to fill the missing basic information attributes with the corresponding values of the nearest bridge sample. The process can be formulated as Formula 2: where e i,j is the j-th variable value of i-th example, e q,j is the j-th variable value of the missing example, i 0 is the serial number of the nearest example, M 1 is the number of the defects attributes, and M 2 is the number of the basic bridge information attributes. As for the defects attributes, we merely fill the missing values with zeroes in case the process produces propagating noise. Then, categorical type of attributes are deleted temporarily to obtain D 2 . Obviously, with the purpose of traversing the entire dataset, HDC is required to compare N 2 times where N is the size of dataset. The concept of conflict-pair is defined as follows: where A and B form a conflict-pair (A, B), if f (A, B) is true.
To better explain conflict-pairs, a positive example and a counterexample are introduced in Table 3 and 4.  To detect all conflict-pairs according to Definition 1, we traverse all example-pairs of the dataset. Note that, the construction of conflict-pairs group G ensures that there is no identical conflict-pair.

D. FILTERING OUT NOISE EXAMPLES
In a conflict-pair (A, B), at least one of the examples has a wrong label according to the Definition 1. Suppose that there are N k conflict-pairs contain the k-th example. To compute the probability that the label of k-th example is correct, the probability is formulated as follow: where p k,i is the probability that the label of k-th example is correct when it belongs to the i-th conflict-pair. Note that P k decreases as the increment of N k . It illustrates that the more times an example appears in conflict-pairs, the more likely it is label noise.
The bottom of Figure 1 illustrates the process of iteratively filtering out label noise. In the i-th iteration, we count the occurrence frequency f k of the k-th example in G and construct the dictionary Dict = {k : f k }, k = 1, 2, . . . , N . The examples in Dict are sorted in descending order of frequency f k . If the stop condition is not satisfied, the top t% examples are recorded as label noise and then eliminated from D 2 . At the same time, the relevant conflict-pairs in G are eliminated to update G. The next iteration is performed until the stop condition is triggered. New clean dataset D 3 is obtained after the iterations.
Note that the stop condition used in this paper is not unique. In the process of filtering noisy data, users can flexibly set stop conditions according to actual needs.

E. MODEL TRAINING
Since this paper models bridge rating as a LR problem, this work mainly applies LR algorithms. LR algorithms can reveal tendency of examples at bridge condition ratings. Besides, it also increases the error-tolerance abilities of models. We discuss the effect of label ranking detailedly in Section IV.
To illustrate the effectiveness of HDC, we uses different algorithms and initializes the training on D 2 and D 3 , respectively. Stacking, the integrated learning strategy, is used to learn the model. The stacking algorithm learns data distribution from different data spaces through various base models, and then integrates predictions of the models to obtain the final results. The first layer used in this paper adopts random forest (RF), k-nearest neighbor (KNN), and support vector machine (SVM). The second layer meta-model adopts RF. We use stacking algorithm to prove that HDC can benefit both from single learning and integrated learning algorithms. Furthermore, stacking is a kind of noise-robust algorithm. As discussed in section II.A, different types of noise-dealing methods can often be used together.
Considering the difficulty of obtaining a clean test set, this paper cannot deliberately create some label noise examples to evaluate the proposed cleaning methods, which other papers [37] often do. They compared the determined suspicious noise with the standard set to evaluate the cleaning accuracy. Therefore, in this paper, We utilize the cleaned dataset and the noisy initial dataset to train models M 1 and M 2 , respectively. The performances of M 1 and M 2 are compared to evaluate the effectiveness of HDC.

IV. LABEL RANKING FOR BRIDGE CONDITION ASSESSMENT
The bridge condition assessment (BCA) problem is generally regarded as a classification problem, where a rating is treated as a bridge label. However, the bridge condition ratings are ordinal. The higher the rating of bridge condition is, the more serious the bridge defects condition is. Label ranking problem extends the conventional classification and multi-label classification in the sense that it needs to predict a ranking of all class labels instead of only one or several class labels [38]. A ranking contains more label information than a single label or several unordered labels.

A. BRIDGE CONDITION ASSESSMENT MODEL
In real-world scenarios, predicting a single bridge condition rating is not enough to describe the bridge condition. Bridges with similar conditions may be assessed as different ratings because of a slight difference in attribute values. Meanwhile, noisy data can easily lead to erroneous predictions. The LR method predicts a set of ordered labels of a bridge example so that the above problem can be effectively solved.
For a sample whose bridge condition is adjacent to the rating boundary, as long as the real label appears in the first N labels in the set of the ordered labels (even if the first ranked label is wrong), it benefits the decision-making after the bridge condition rating process. The probability that the real label ranks in the top N labels can be compute as follow: where M (X ) i is the i-th ranked label that predicted by the ranker model M with the input instance X . Fig. 2 demonstrates the architecture of the label ranking for bridge condition assessment (LR-BCA). Specifically, it applies HDC to clean the bridge condition data, and then utilizes LR as the learning algorithms. To increase robustness to residual noisy data, an ensemble model is constructed. Given a single bridge example, an ordered of labels will be predicted.

B. LABEL RANKING
Label ranking (LR) is a key task of preference learning and its goal is to map instances into a set of ordered finite labels [38]. LR aims to learn a ''label ranker'' in the form of a mapping from the instance space χ to the label space . It can be seen as an extension of the traditional classification task. Unlike the traditional classification task, LR tries to map instances to an ordered set of all class labels, instead of mapping instances in instance space χ to one or more labels in label space .
In LR, λ i x λ j denotes that instance x prefers label λ i to λ j , i.e., the ordering of λ i on the instance x precedes λ j . This preference relationship is transitive and asymmetric. Therefore, ranking can be considered as a special preference relationship. Fig. 3 shows the differences between three types of tasks, where instances space χ is {x 1 , x 2 , . . . , x n } and label space is {λ 1 , λ 2 , . . . , λ c }. For example, it is possible to assume all bridges can be described in an instance space (including three attributes: completion time, daily traffic volume, and maximum pressure). Similarly, bridge condition ratings can be described in a label space (including four labels λ 1 , λ 2 , λ 3 , and λ 4 ). For a bridge instance x (2006,76,321,15) in instance space (years, vehicles, tons), its preference for each rating may be λ 2 x λ 1 x λ 3 x λ 4 . Similarly, the other bridge instances have their unique preferences (i.e., a set of well-ordered labels). By training the model with different attribute values and preferences, a mapping from instances to bridge labels can be obtained.

C. SOLVE BRIDGE CONDITION ASSESSMENT AS A LABEL RANKING TASK
Given the ranking (λ 2 x λ 1 x λ 3 x λ 4 ), it is easy to know that the condition rating of the bridge instance x is located between λ 1 and λ 2 and deviates to λ 2 . For example, assume that the scores of λ 1 , λ 2 , λ 3 , λ 4 are 1, 2, 3, 4, respectively. Given the real score of the bridge condition as 1.7. If the bridge condition rating is only predicted as λ 2 , the information that the bridge score is 1.7 may not be obtained. Meanwhile, if the true rating of the bridge condition is λ 1 , even if the optimal label is predicted to be λ 2 , the sub-optimal label is still the correct rating λ 1 .
There is an order relation λ 3 λ 2 λ 1 among three bridge ratings. Table 5 shows the transformation from a single label to a set of ordered labels. It is easy to know that bridge instance x i with rating λ 2 prefers λ 2 comparing to λ 1 and λ 3 , but its preference between λ 1 and λ 3 cannot be determined. Due to natural order relation between bridge ratings, it is easy to know the preference of instances whose labels are λ 1 or λ 3 as shown in Table 5. Label ranking has been widely-studied in the literature [39]. Existing LR algorithms can be classified into three categories: reduction approaches, probabilistic approaches and tree-based approaches [40]. The reduction approaches convert the LR problem into multiple simple binary classification problems. The probabilistic approaches aim to calculates the probability of a label belonging to the instance to determine an order of all class labels. The tree-based approaches use decision tree as base algorithm. Considering the characteristic of our bridge condition assessment and the Ockham's razor principle, we use a reduction approach in this work.
We model the BCA problem as a label ranking task, and solve it by a well-known pairwise comparison ranking algorithm. Table 6 shows the preferences of each instance, and the preference is split into three pairs of comparison labels. x 5 is the instance to be predicted. Fig. 4 shows the  pairwise comparison ranking algorithm on bridge condition assessment tasks. Firstly, three classifiers are trained separately using the examples with corresponding preferences. And the second step is to combine the prediction results of the three classifiers to form an ordered label set. The first label of each set of sorted labels is the rating of the bridge instance.

V. EXPERIMENTAL STUDIES
This section details the experimental studies carried out to investigate the performance of HDC and LR-BCA.

A. DATASET AND EXPERIMENTAL SETTINGS
The BCA data is provided by several highway bridge inspection agencies in China (e.g., Jiangsu Huatong Engineering Testing Co., Ltd. and Jiangsu Modern Road and Bridge Co., Ltd.). It is composed of condition data from sixteen components, such as upperparts, supports, piers, abutments, expansion joints, and drainage systems. The whole bridge condition dataset consists of 7,870 examples with 422 attributes and 3 labels.
Our proposed methods 1 are implemented by Python 3.65 32-bit running on a server with Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz. We use scikit-learn 0.20 in our experiments. The algorithms and main parameters settings are shown in Table 7. While for other parameters of scikitlearn, default values are used. In the following experiments, all experiments results are obtained based on five runs of a six-fold cross-validation on the dataset. For the simplicity of expression, we record an average of the results. The widely-used metric F 1 -score is used to evaluate the performance of algorithms. 1 Both source codes and dataset will be made available in our website.

B. COMPUTATIONAL RESULTS ON HDC
To evaluate the performance of our HDC method, we have performed two groups of experiments. The goal of these experiments is to answer the following questions: (1) How does the F 1 -score vary with the cleaning ratio increases? (2) Can the HDC method still provide an advantage when it is only applied to the training set or the test set?

1) INFLUENCE OF CLEANING RATIO ON HDC
Cleaning ratio cr is an important parameter of HDC which indicates the proportion of the removed examples. This experiment aims to investigate the F 1 -score of models according to various levels of cr in the cleaning phase. We evaluate the performance of HDC by implementing four algorithms, including RF, SVM, KNN and Stacking. To demonstrate that the increase of F 1 -score is not caused by the over-fitting of maximum class, we focus on the F 1 -score of all labels. Since only few examples are available with the label λ 3 , we only consider the cases that distinguishing examples with label λ 1 and λ 2 . Fig. 5 shows the computational results (in terms of F 1 -score) obtained from four classification algorithms. For every algorithm, HDC has an advancement of the performance on every cr level with respect to no data cleaning. Furthermore, as we can observe from Fig. 5, the increase of cr level helps the progress of F 1 -scores of both labels.

FIGURE 5.
Comparative performance of classification algorithms with various cleaning ratios. The sub-figures (a) and (b) respectively present the experimental results by distinguishing the label λ 1 from other labels and distinguishing label λ 2 from other labels. VOLUME 9, 2021 It indicates that HDC is not to be over-fitting on any label. Conclusion can be drawn from these results that HDC promotes the performance of models compared to no data cleaning.

2) EFFECTIVENESS OF HDC
The purpose of this group of experiments is to check that whether HDC still performs effectively when it is only applied to the training set or the testing set. We conduct two experiments (denoted as exp1 and exp2) to evaluate the effectiveness of HDC. In exp1, we only apply HDC to the test set; In exp2, we only apply it to the training set. Since both the training set and testing set were noisy, if we did not clean the testing set, the incorrectly labelled examples would influence the evaluation results in test set. In a real world scenario, HDC would be applied in labelled data to generalize a clean dataset as training set, and the rating of unlabelled data would be predict by trained models.
As illustrated in Fig. 6 (a), compared with the testing accuracy without applying HDC to any dataset, the one of exp1 has been significantly improved, up to 5.77%, 3.99%, 8.29%, and 6.15%, respectively. SVM is a noise-sensitive algorithm [6], and its predictive accuracy is greatly improved by 8.29%. The advancement of testing accuracy in exp1 implies that in spite of the noisy training dataset, the learning algorithms can learn the spatial distribution characteristics of clean data to a certain extent because of their noise tolerance abilities. However, compared with cleaning from the entire dataset, only cleaning from the test set has a limited impact on the test accuracy. Only cleaning from the training set harms the testing accuracy of the learning algorithms as shown in Fig. 6 (b). It appears that it is counter-productive to only eliminate noise from the training set. Zhu and Wu [42] executed experiments on sixteen benchmark datasets, testing accuracy were reported to decrease on four datasets when only eliminate noise data from training set. The evidence from Fig. 6 (b) may suggest that classifiers only learn the spatial distribution of data in the training set (which is clean) while the testing set is noisy. The classifiers can identify the clean data in test set accurately while the noise data in the test set reduce the performance of the classifiers. Moreover, it is inevitable to eliminate a small amount of clean data in training set when performing HDC. In other words, it affects the performance of models indirectly.
In brief, the HDC method performs effectively when only cleaning from the test set and it decreases the testing accuracy when only cleaning from the training set.

C. COMPUTATIONAL RESULTS ON LR-BCA
This section presents two groups of experiments to evaluate the effectiveness of the HDC method under LR tasks, and the benefit of LR method for BCA.

1) COMPARISONS BETWEEN OUR LR METHODS UNDER VARIOUS BASELINE CLASSIFIERS
To evaluate the effectiveness of HDC in LR task, we compare the performance under three different baseline classifiers. Table 8 indicates the testing accuracy of various algorithms (Stacking, RF, KNN, SVM) in distinguishing the preference of examples between λ i and λ j , regarding to different cr levels (0%, 5%, 10%, 15%, 20%, 25%, and 30%) in the context of LR task. It is worth noting that there are a small number of examples with label λ 3 , so the result related to λ 3 may not be convincing. As illustrated in Table 8, when the cr level increases, the performance of classifiers increases as well. Among these base classifiers, RF achieves the best performance in all ratios of cleaning and stacking combined the advantages of three classifiers with achieving the best performance. With the application of the HDC method, all binary sub-classifiers have achieved an advancement on the performance and the final ranking accuracy has also increased.

2) BENEFIT OF LR METHOD
To demonstrate the benefit of LR, we present the Top-N test accuracy of LR with various cr level. Table 9 shows the Top-N testing accuracy of the label ranker under different levels of cr and the Top-2 testing accuracy are as high as 99%. For a bridge example that is indistinguishable between two labels, LR can indicate the rank of its preference. Considering the existence of noise data and the fuzzy boundary of two ratings, the first ranked label of the example may be wrongly predicted. However, the second ranked label is likely to be a true condition assessment label. With the implementation of LR, the error-tolerance ability of bridge condition assessment model has strengthened.

VI. CONCLUSION AND FUTURE WORK
Bridge condition assessment (BCA) is an important part of bridge management systems and intelligent transportation systems. In this paper, we propose an effective label ranking approach for bridge condition assessment (LR-BCA). In addition, a heuristic data cleaning (HDC) approach is developed for filtering BCA data. Extensive experimental results on real-world dataset confirm the effectiveness of our proposed LR-BCA and HDC method. Since HDC does not use any classification algorithm as a filter, it can be combined with existing noise filters and noise-tolerant algorithms. By modeling the bridge condition assessment problem as a label ranking task, the bridge condition ratings can be predicted effectively, and the Top-2 accuracy rate is able to reach 99%. For future work, a potential research direction is to verify the effectiveness of HDC on cleaning noisy datesets in other domains.
KAI WANG was born in Huangshi, Hubei, China, in 1997. He received the B.S. degree in mathematics and applied mathematics from the East China University of Science and Technology, Shanghai, China, in 2019, where he is currently pursuing the master's degree with the School of Information Science and Engineering.
His main research interests include machine learning and data mining. He is a Student Member of China Computer Federation (CCF).
TONG RUAN was born in 1973. She received the B.S. and master's degree from the East China University of Science and Technology, Shanghai, China, and the Ph.D. degree from the Institute of Software Chinese Academy of Sciences.
She is currently a Professor and a Ph.D. Supervisor with the East China University of Science and Technology. Her main research interests include knowledge graph, data mining, and data quality assessment. She is a member of China Computer Federation (CCF). He is currently an Associate Professor with the College of Civil and Transportation Engineering, Hohai University. His main research interests include Bridge engineering and concrete bridge structure.