Unsupervised Discrepancy-Based Domain Adaptation Network to Detect Rail Joint Condition

Damage to maglev rail joints, which connect adjacent rail segments, threatens the safety and comfort of railway systems. Machine learning methods have been used in combination with online monitoring data to assess the health conditions of maglev rail joints. However, most of the existing methods rely on the data collected in controlled scenarios, such as those involving constant train operation speeds. Given the diversity of operational conditions, a model learned from one known case (source domain) cannot be directly applied to the case of interest (target domain). Therefore, this article proposes a domain adaptation (DA) approach to diagnose the health conditions of maglev rail joints in complex operational conditions. The DA is unsupervised because the source and target domains are characterized by labeled and unlabeled samples, respectively. DA is implemented by integrating the sample moments with different orders into the transfer loss of a neural network. By minimizing the transfer loss, the domain shift caused by the difference in the operational conditions can be reduced, and the knowledge of features learned from the neural network is transferred from the source domain to the target domain. The proposed approach is validated over a dataset of time–frequency spectrograms (TFSs) derived from the experimental acceleration data of maglev rail joints in two operation modes: stable passing and braking. The proposed approach can successfully identify the conditions of the maglev rail joints, i.e., bolt-looseness-caused rail step, misalignment-caused lateral dislocation, and normal condition, even when the operation mode of the maglev train changes.


I. INTRODUCTION
M AGLEV is a kind of noncontact transportation system with the advantage of less noise and friction.In such systems, suspension and guidance are realized through the The authors are with the Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University, Hong Kong, China, and also with the National Rail Transit Electrification and Automation Engineering Technology Research Center (Hong Kong Branch), Hong Kong, China (e-mail: gao-feng.jiang@connect.polyu.hk;may.sm.wang@polyu.edu.hk;ceyqni@polyu.edu.hk;wqiang.liu@polyu.edu.hk).
Digital Object Identifier 10.1109/TIM.2023.3316221electromagnetic force provided by U-shaped magnets and F-type rails [1].The fluctuation between electromagnet and F-type rails should be confined within 2-3 mm to ensure the stability of maglev trains, and such a small fluctuation invokes a high requirement for the condition of F-type rail.However, the F-type rail is prone to deformation due to temperature changes, foundation settlements, and force actions.Temperature changes play the most influential role among those causes.The growing temperature difference of the maglev guideway will lead to a great temperature gradient and cause a significant increase in deformation.Hence, a seam, known as the rail joint, is considered in the design to allow for the deformation.Maglev rail joints are typically used to connect F-type rails to satisfy the control requirement of the suspension gap between the electromagnet and rail and enable slight movement between two adjacent F-type rails due to temperature-induced expansion and contraction [2], [3].Notably, maglev rail joints are prone to structural damage because of environmental changes, train excitation, and installation errors [4].This structural damage typically manifests as bolt-looseness-caused rail step and misalignment-caused dislocation, which may lead to rail irregularity and decreased electromagnetic force, respectively [2], [5].According to the experimental results on several current maglev lines, the rail step and lateral dislocation often occur at maglev rail joints in practice.Such damage scenarios may lead to rough suspension gap fluctuations, suspension control failure, and even a sudden clash between the electromagnet and the rail.Large impacts on the rail are generated by repeated suspension gap fluctuations [6], causing maglev rail joints to become the weakest part of the maglev rail and reducing the ride comfort of maglev trains [7].Moreover, the large impacts and dynamic suspension forces acting on the maglev rail joints aggravate structural deterioration [6], [8]; thus, bolts get loose, rail ends become battered, and cracks develop in the F-type rail.
To avoid such scenarios, the condition of maglev rail joints is typically visually inspected.Such manual observations may be unreliable, intrusive, and unsafe [8].Therefore, to maintain the safe operation of maglev systems, intelligent techniques for rail joint monitoring must be developed.The successful practices of intelligent rail joint monitoring have been witnessed.For example, the axle box acceleration data measured from the rail vehicle are used to monitor the conditions of rail joints [8], [9].In [8] and [9], the wavelet transform algorithm is employed to extract the characteristics of rail joint damage for rail joint monitoring.Chang et al. [10] measured the dynamic response of rail to identify the misalignment-caused damage at rail joints.The abovementioned studies have proposed effective methods to thoroughly investigate the damage mechanisms and recognize rail joint damage.However, the damage detection methods in [8], [9], and [10] are realized through the track inspection wagon, which still relies on labor to collect massive data.Besides, these methods require manual feature extraction from data for damage detection.Thus, there is still room for developing convenient and cost-effective rail joint damage detection and classification methods.
Deep learning (DL) algorithms can be applied for the damage detection of rail joints as they can automatically extract discriminative features from a massive amount of data.Among DL algorithms, convolutional neural networks (CNNs) are an effective feature extractor.Surface defect detection based on CNNs is a popular topic.Different surface defects can be classified through camera images [11], [12].However, the damage to a rail joint is not always visible.Vibrationbased CNNs become an alternative by extracting features from numerous raw vibration signals [13].In the field of structural health monitoring, CNN and vibration signals have been used to inspect structural damage [14], [15], [16], [17], [18].For example, a CNN model trained with time-series data of bridge acceleration responses from a set of shake table tests was used to identify and quantify four types of concrete bridge damage [16].In addition to directly using the data in the time domain, the implicit information in the frequency domain can be used for structural damage detection.Duan et al. [17] trained a CNN model by using the Fourier amplitude spectra of acceleration responses to detect the damage of a tied-arch bridge.Using a CNN model built with the time-frequency spectrogram (TFS) of acceleration responses, Wang et al. [18] detected multiple damages to maglev rail joints.Notably, in these studies, the training and testing data are independent and identically distributed (i.i.d).In other words, the existing approaches ignore discrepancies in the distribution and have been validated for only a given data distribution.In the real world, the training data are often acquired from specific cases, whereas the testing data might be collected considering various operational and environmental conditions.Consequently, the i.i.d hypothesis fails, creating the domain shift between the training and testing data [19].This problem can be overcome by collecting new labeled data and building an updated model.However, these processes are time-consuming and impractical for most industrial scenarios [20].
Recently, transfer learning (TL) has emerged as a promising approach to solving the domain shift issue.TL can help enhance the model performance by allowing the model to learn the knowledge from previous tasks and apply this knowledge to new and similar tasks.As a type of TL, domain adaptation (DA) realizes knowledge transfer by reweighting the samples in model training or identifying a shared space to match the inconsistent data distribution [21].In recent years, DA has been widely applied for structural damage detection, such as the damage of multistory buildings, with the source and target domains corresponding to numerical and experimental data, respectively [20], [22].In addition, many researchers have used DA for damage detection considering changes in the operational state of structural components, such as in the fault diagnosis of rolling bearings [23] and power plant thermal systems [24].In railway engineering, DA has been used for the damage detection of rail vehicles.Yu et al. [25] used conditional adversarial DA to predict the faults of a gearbox and shaft at different running speeds.Qin et al. [26] developed a stepwise adaptive CNN to classify the faults of a high-speed train bogie with a continuously varying vehicle speed.Chen et al. [27] established a semisupervised adversarial DA to assess the condition of high-speed train wheels under different surrounding environments.
In practice, there are also different working conditions in maglev lines.Variations in the maglev train speeds and running status (e.g., stable passing, suspension, and braking) affect the structural response of maglev rail joints.Consequently, a large domain shift [29] exists in the vibration signals collected from one maglev rail joint under different operation modes.The performance of a model trained by data from one operation mode may deteriorate when it is applied to another mode due to the change in the external excitation in different modes.Thus, the above evidence motivates us to use DA for maglev rail joint damage detection.However, regarding rail joint damage detection, only a few novel studies used DL algorithms [18], [28], and even none of the studies used TL algorithms, e.g., DA algorithms.This may be because the common research objects using DA for damage detection are structural components such as bearings and gears.The cross-domain features of bearing and gear damage usually appear continuously at certain frequencies, while the cross-domain features of rail joint damage are hard to capture since they appear shortly and usually at nonfixed and high frequencies.In this study, an unsupervised discrepancy-based DA network (UDDAN) is proposed to detect the maglev rail joint damage condition considering the actual operation modes.The data, i.e., vibration signals from maglev rail joints, are often collected with a high frequency, which makes the data numerous.Meanwhile, the data for damaged cases are much fewer than for normal cases.In addition, the labeling is time-consuming and requires appropriate observations, especially for the damaged data.Unlike supervised algorithms, unsupervised algorithms eliminate the need for labeling.Therefore, the UDDAN is proposed to classify the maglev rail joint condition even if the labels are unavailable.
The UDDAN implements the following steps.First, the acceleration responses of two types of damaged maglev rail joints and an undamaged maglev rail joint are collected from a monitoring system installed on the maglev test line in Shanghai, China.The acceleration responses are processed to samples reflecting the time-frequency features of maglev rail joints in two operation modes (stable passing and braking).Subsequently, a series of samples are input to the UDDAN model, which derives domain-invariant time-frequency features of maglev rail joints in two operation modes.The adaptation layer is placed at the top of the model to ensure that the data distribution of one mode (source domain) is like that of the other mode (target domain) to minimize the domain discrepancy caused by the operation mode.The classification layer allows the model to detect the condition of maglev rail joints in both modes after sufficient alignment of the data distribution.Compared to those existing DA algorithms developed and verified through benchmark datasets covering a wide range of categories, e.g., Office-31 and Office-Home, UDDAN is trained from the dataset of TFSs.This dataset is much smaller in variety and has less cross-category difference than benchmark datasets.Therefore, the existing algorithms may fail to extract discriminative features between TFSs.In contrast, UDDAN enables feature extraction in a different way to those existing algorithms.Hence, UDDAN is tailored for the maglev rail damage detection problem.
The key contributions of this research can be summarized as follows.Section II presents an overview of DA and describes the discrepancy-based DA framework and different types of data distribution.Section III introduces the UDDAN architecture.Section IV describes the experiment conducted on maglev rail joints to obtain the dataset.Section V describes the different methods with different discrepancies and data distributions for the UDDAN and discusses the results.Section VI presents the conclusion.

A. Domain Adaptation
In machine learning methods, the domain (D) is a set of feature spaces X with marginal distributions P(X ).Samples in the feature space satisfy X = {x 1 , x 2 , . . ., x n } ∈ X .Each domain has one task (T ) aimed at learning the conditional distribution P(Y |X ) (also known as the predictive function).Y is a label space in which the samples satisfy Y = {y 1 , y 2 , . . ., y n } ∈ Y. Two domains are considered in machine learning: the source domain (D s = {X s , P(X s )) with task (T s ) and the target domain (D t = {X t , P(X t )) with task (T t ).Conventional machine learning methods learn the same task (T s = T t ) over identical domains (D s = D t ) using labeled data x i , y i .Therefore, the performance of conventional machine learning methods deteriorates when task variation (T s ̸ = T t ) and/or domain shift (D s ̸ = D t ) occurs.This problem can be solved by applying DA and using the knowledge in D s and D t [30].
DA assumes that the task in the source domain is the same as that in the target domain (T s = T t ), but the two domains are different (D s ̸ = D t ).DA can be divided into two types depending on the domain divergence [21]: homogeneous DA has an identical feature space (X s = X t ) but different data distributions (P(X s ) ̸ = P(X t )), and heterogeneous DA has nonequivalent feature spaces (X s ̸ = X t ).In addition, DA can be categorized as supervised, semisupervised, or unsupervised DA based on whether the data in task T t are fully labeled, partially labeled, or unlabeled, respectively.Fig. 1 shows how DA addresses domain shift.The feature space consists of the data marked as points with various shapes according to different categories.If the model is trained using the data in the source domain, misclassification may occur in the target domain.To resolve the problem, DA aims to map the features from the source and target domains to a shared feature space and build a model with a low generalization error in the target domain.

B. Discrepancy Alignment of Data Distribution
To accomplish DA, the domain-invariant feature representations must be learned to improve the task performance.Discrepancy-based DA can learn feature representations by minimizing the discrepancy between data distributions in the source and target domains.Generally, in a neural network considering DA, the discrepancy between data distributions is equivalent to transfer loss [31].The discrepancy can be measured through the maximal mean discrepancy (MMD) [32] or correlation alignment (CORAL) [33].
To determine the discrepancy of data distributions in the source and target domains, the MMD maps the data in the two domains into a reproducing kernel Hilbert space (RKHS).According to the Reisz representation theorem and unit ball property of RKHS, the unbiased empirical estimate of the MMD is where x s i and x t i represent the ith samples in the source and target domains, respectively; N s and N t are the number of samples in the source and target domains, respectively; φ is the kernel function for mapping data into the Hilbert space; and ∥.∥ 2 F is the second Frobenius norm, which represents the Euclidean distance between two distributions.
Unlike the MMD, which aligns the discrepancy only with sample means, CORAL can exploit rich statistical information and align the discrepancy with the sample mean and covariance values.The covariance can be used to measure the joint variability of samples.Assume that there exist N samples in observation X = (X 1 , X 2 , . . ., X N ) T , and each sample is represented as a K -dimensional vector X N = {x 1 , x 2 , . . .,x K }.The sample covariance between the jth and kth variables is where x j and x k are the sample means of the jth and kth variables, respectively.Note that cov j,k is equal to the sample variance when j = k.COV is the sample covariance matrix sized K ×K Hence, the transfer loss of CORAL is defined by two covariance matrices derived from samples in the source and target domains [33] where COV s and COV t are the covariance matrices of samples in the source and target domains, respectively.The sample mean and covariance values are two types of sample moments for identifying the data distribution.
Specifically, the sample mean and covariance values reflect the first-and second-order sample moments, respectively.Notably, the distribution of real-world data may be too complicated to be completely described using the first-or second-order sample moments [34].In such cases, high-order statistics (HOS), i.e., the third-order or higher order sample moments that contain more discriminative information, can be used to estimate the data distribution [35].Chen et al. [36] presented a universal representation of HOS as where θ (x i ) = {θ (x i,1 ), θ (x i,2 ), . . ., θ (x i,L )} represents one L-dimensional feature from the ith sample, the superscript ⊗ p denotes the pth power tensor product, and is the pth order moment calculated by N samples.
The MMD and CORAL can be considered as special cases of the HOS formulation with p = 1 and p = 2, respectively.By setting p = 3 in the HOS formula, Cheng et al. [37] proposed the tricovariance (TriCOV) measure to align the discrepancy between third-order sample moments.Notably, with increasing power in the sample moment, the computing complexity increases exponentially, leading to inaccuracies in the estimated data distribution unless the sample scale is large [38].Therefore, in this study, only cases with p ≤ 3 are considered due to the limited computing resources and small scale of the samples.

C. Data Distribution in DA
In most DA methods, discrepancy alignment is based on the marginal distribution between the source and target domains, and the conditional distribution is assumed to be constant.However, discrepancies may occur in the conditional distribution and in the joint distribution, which is a combination of the marginal and conditional distributions.As shown in Fig. 2, the shape of the feature space influences the performance of discrepancy alignment.Different results are expected to be obtained depending on the type of distribution used in DA.However, the type of data distribution is difficult to determine due to the inaccessibility to characterize the feature space.An appropriate data distribution must be assumed to narrow the discrepancy between the source and target domains.In this study, three types of data distribution for discrepancy alignment are considered: marginal distribution alignment (MDA), conditional distribution alignment (CDA), and joint distribution alignment (JDA).MDA is aimed at narrowing the discrepancy in the data distributions in two domains (P(X s ) and P(X t )).The optimization objective of MDA is where d(•) is an arbitrary domain discrepancy.
CDA is aimed at decreasing the discrepancy between the distributions of same-category data in two domains (P(Y s |X s ) and P(Y t |X t )).The goal of CDA can be formulated as follows [39]: where P(X s |Y s ) and P(X t |Y t ) are the sufficient statistics of distributions P(Y s |X s ) and P(Y t |X t ), respectively [39].As the target label Y t is unknown, a pseudo label [39], obtained by testing the target data in a classifier trained using labeled source data, substitutes the target label.Based on the CDA concept, MMD can be modified to measure the discrepancy between the category-conditional distributions where each category c ∈ {1, . . ., C}; N s(c) and N t (c) are the number of samples with the same label category c in the source and target domains, respectively; y s i is the ground-truth label of the ith sample in the source domain; and ŷt i is the pseudo label of the ith sample in the target domain.
JDA integrates the advantages of MDA and CDA by simultaneously minimizing the domain discrepancy in both marginal and conditional distributions [39].To apply the discrepancy as the transfer loss in a neural network, a deep transfer network aligning the discrepancy between joint distributions has been developed based on the MMD criterion [40] where ξ 1 and ξ 2 are adjustable terms for the marginal and conditional distributions.This concept has been used to develop several deep transfer networks with JDA [41], [42].

A. Architecture of UDDAN
This article proposes the UDDAN model for maglev rail joint condition detection.As shown in Fig. 3, the UDDAN consists of several backbone layers for extracting the discriminative features of every maglev rail joint condition, an adaptation layer for learning the cross-domain invariant features, and a classification layer for evaluating the maglev rail joint conditions.
The architecture of the backbone layers is based on ResNet18 [43], consisting of five convolution layers labeled Conv1 to Conv5.Conv1 contains one convolution calculation.The other convolution layers contain two residual blocks with two convolution calculations.The skip connection is set in the residual block to avoid the decrease in accuracy as the network deepens.Batch normalization and rectified linear unit are added in each convolution calculation to promote convergence in model training.In the Conv1 layer, max-pooling layer, and first convolution calculations of Conv3, Conv4, and Conv5, the stride is set as 2 to decrease the width and height of the feature maps by half.Thus, at the end of the convolution layers, the width and height are 1/32 of the original dimension.A 7 × 7 filter is used in the Conv1 layer and 3 × 3 filters are used in the max-pooling layer and other convolution layers.The number of filters increases gradually with the deepening of the feature map.Therefore, 64, 128, 256, and 512 filters are applied in the Conv2, Conv3, Conv4, and Conv5 layers, respectively.Consequently, a 512-D feature map is condensed into a 64-D vector in global average pooling, which is named the bottleneck feature as it is located at the bottleneck position in the model [44].
After the backbone layer calculations, the bottleneck features are exported to the classification and adaptation layers.As a fully connected (FC) layer, the classification layer nonlinearly maps the bottleneck feature to the prediction on probability for each condition of a maglev rail joint.The length of the FC layer is equal to the number of considered maglev rail joint conditions.In the classification layer, the damage classification loss (L D ) is calculated to compare the prediction with the ground truth.The adaptation layer stores the bottleneck features from various domains to calculate the discrepancies between domains that are measured by sample moments of various orders under different data distribution assumptions.In the adaptation layer, the discrepancy value is equal to the transfer loss (L T ).

B. Model Training
The model is trained based on alternating forward propagation of feature generation and backpropagation of loss calculation for updating the model parameters.In this study, the loss used in model training is L TOTAL , which is a combination of the damage classification loss (L D ) and transfer loss (L T ) [45] The damage classification loss aims to match the labels between the prediction and the ground truth For the ith sample in the source domain, y s i and f (x s i ) denote the probability of each category obtained from the ground truth and prediction, respectively.f (•) is the predictive function learned from the backbone layers.J (•, •) is the cross-entropy loss function that is used to match the difference between the true and predicted labels.
The transfer loss aims to realize DA from the source to the target domain.To consider different data distributions, a universal paradigm of the transfer loss is defined as where d M (•, •) and d C (•, •) are the discrepancies between two marginal data distributions and conditional distributions, respectively, and ξ 1 and ξ 2 are the adjustable terms for the considered data distributions.If only ξ 2 = 0 or if only ξ 1 = 0, the case corresponds to MDA and CDA, respectively.The case pertains to JDA if ξ 1 ̸ = 0 and ξ 2 ̸ = 0.In this study, the loss backpropagation is based on the stochastic gradient descent optimizer, which prevents the training procedure from falling into the saddle points in a minibatch.Backpropagation is aimed at optimizing the three model parameters (θ b , θ b , and θ a ) obtained from the backbone where η is the learning rate.Fig. 4 shows the training procedure.First, the model parameters, source label predictions, and bottleneck features are obtained.If JDA is applied, pseudotarget labels are required.Both the damage classification loss and the transfer loss are optimized to update the model parameters and pseudo-labels.After sufficiently optimizing the loss, a satisfactory model is obtained for directly testing the data from the target domain.Algorithm 1 summarizes the training steps.
To accelerate training, the learning rate of the model is typically set to a high value.However, the use of a high learning rate in the complete training process may lead to loss oscillation.Hence, in this study, the model learning rate is initialized with a high value and then gradually decreases as the number of training epochs increases.The learning rate η i at the ith epoch is where ε is the number of epochs and δ is the learning rate decay.
The tradeoff term λ i in (18) at the ith epoch is where w is the weight term.The magnitude of L D is stable but that of L T changes sharply because different sample moments are used as discrepancies.To ensure the reliability of model Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.To accelerate convergence and reduce the loss oscillations in model learning, the momentum term µ is used [47].The momentum is iterated as follows:

IV. EXPERIMENTAL STUDY AND DATA COLLECTION A. Condition Monitoring System
As shown in Fig. 5, the maglev guideway consists of a steel sleeper, an F-type rail, and a viaduct, and the maglev rail joints are located between adjacent F-type rails.Fig. 6 shows the JI-type maglev rail joint, which is the most used type of maglev rail joint.This maglev rail joint has three main geometrical parameters: height of the rail step along the z-axis, length of the lateral dislocation along the x-axis, and width of the longitudinal gap along the y-axis.According to observations on commercial and testing maglev lines, JI-type maglev rail joints can be damaged because of bolt looseness and installation errors, which can lead to deviations in the height of the rail step or the length of the lateral dislocation, respectively.Consequently, the condition of JI-type maglev rail joints must be accurately monitored.
To verify the effectiveness of the proposed UDDAN for maglev rail joint condition detection, an experimental study is performed using the data collected from the condition monitoring system installed on the Shanghai Lin-Gang maglev test line.This line includes a straight segment, a curve segment, a slope segment, and a turnout, and the total length is 1.7 km.This study focuses on the damage detection of maglev rail joints on the straight segment.The straight segment is a multispan simply supported guideway that consists of a viaduct and several F-type rails, and the JI-type maglev rail joint is used to connect two F-type rails, as shown in Fig. 7.
A customized online monitoring system (see Fig. 8) is used to monitor the condition of maglev rail joints.This system consists of a set of piezoelectric (PZT)  accelerometers with anti-electromagnetic interference (EMI) capability, multiple-channel data acquisition unit ( 16)-channel DEWESOFT-SIRIUS) for data collection, portable computer for data storage, and high-performance server for data processing.Ten PZT accelerometers are applied to monitor five maglev rail joints labeled J1-J5, covering a monitoring range of approximately 80 m.For each maglev rail joint, two accelerometers are mounted on the cantilevered side of the adjacent ends of two F-type rail sections to measure the vertical accelerations.All the maglev rail joints are considered to operate in the same weather condition.To avoid the EMI generated by the maglev system, the deployed sensors, signal cables, and data acquisition unit are insulated.Data are sampled at a frequency of 5000 Hz to ensure sufficient signal acquisition resolution to capture the high-frequency components resulting from the damage.The DEWESOFT-SIRIUS instrument can be triggered automatically to acquire and store data during the maglev train passage.A high-performance server with eight cores, 16 threads, and 64-GB memory is used to facilitate the multiple damage detection at various maglev rail joints.The maglev trains run at speeds of 20-60 and 10-20 km/h in the stable passing and braking modes on the test line, respectively.

B. Dataset
Within the experimental period from December 2020 to March 2021, two types of damage are observed, as shown in Fig. 9(a) and (b): a lateral dislocation of approximately 2 mm caused by the installation misalignment at J2, and a large rail step caused by the bolt looseness at J1 and J4.Joints J3 and J5 operate in damage-free conditions, as shown in Fig. 9(c).In other words, the data recorded from this experimental period cover three states of maglev rail joints.The maglev train with a length of 16 m runs on the rail line with two operation modes.To adequately record the rail response for each trial run, the data acquisition unit collects the recording every 10 s, and each recording is treated as one sample to be used in the training or testing of the UDDAN.
The extracted samples are preprocessed through signal analysis in both the time and frequency domains by using the time series and power spectral density (PSD), respectively.Figs. 10 and 11 show the results of time-and frequency-domain analyses for three maglev rail joint conditions in the two operation modes, respectively.Fig. 10 shows that, in both modes, the peak acceleration in the rail step is at least double that in the lateral dislocation and even 20 times that in the normal condition (no more than 10 m/s 2 ).Fig. 11 shows that the damaged maglev rail joints have higher PSD values in the two modes than the normal maglev rail joint.Overall, the vibration magnitude in the case of damage is larger than that in the normal condition.Moreover, the vibration is more severe in the case of rail step damage than in lateral dislocation.However, the maglev rail joint conditions cannot always be evaluated using the vibration magnitude from only the time-or frequency-domain analysis.Specifically, although the joint conditions may be clearly detected in the stable passing mode, it might be difficult to detect damage in the braking mode, due to the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.small difference in the PSD values between the normal and damage conditions, as shown in Fig. 11(b).In addition, the varying running speeds and train weights may affect the peak accelerations and vibration frequencies at a maglev rail joint.Hence, the instantaneous features of a maglev rail joint must be simultaneously extracted in the time and frequency domains.As the TFSs consist of feature information in both time and frequency domains, TFSs are derived in this study to effectively extract the time-frequency features from the collected data [46].Wang et al. [18] highlighted that the discriminative features among the three conditions of maglev rail joints can be explicitly derived using the TFSs.Fig. 12 shows the TFSs obtained from the data associated with the maglev rail joints in the three conditions and two modes.The discriminative features in the three conditions shift when the train moves from the stable passing mode to the braking mode.Moreover, the discriminative features in the braking mode are not as explicit as those in the stable passing mode.For example, TFSs for the normal condition [see Fig. 12(d)] and lateral dislocation [see Fig. 12(e)] are similar, and the classifier may not be able to easily extract and distinguish the discriminative features between these conditions.In other words, a classifier trained over the stable passing (breaking) mode may fail when applying to data from the braking (stable passing) mode.
Assuming that each domain represents an operation mode, the UDDAN can be used to extract the seemingly similar discriminative features between the two modes.The considered problem is a typical homogeneous DA problem, given that: 1) the source and target tasks are the same, i.e., to identify the  three conditions of maglev rail joints; 2) the feature space in the source and target domains is the same, i.e., both domain contain the TFSs extracted by maglev rail joints; and 3) the data, or samples, are distributed inconsistently in the source and target domains, i.e., the TFSs are collected from different operation modes.
Tables I-III present the scale of samples (TFSs) input to the UDDAN.The two datasets consisting of the data collected from the stable passing mode are labeled S 1 and S 2 .One dataset consisting of the data collected from the braking mode is labeled B 1 .Three tasks are designed: task A, S 1 → B 1 ; task  B, B 1 → S 1 ; and task C, S 2 → B 1 , where S → B denotes the transfer from stable passing to braking mode and B → S denotes the transfer from braking to stable passing mode.The designed tasks cover two potential scenarios typically encountered in real applications.
Scenario 1: The feasibility of using the different operation modes as the source domain is discussed because labeled maglev rail joint data may be available for only a certain operation mode.Among the two considered operation modes, one is set as the source domain and the other is set as the target domain.Therefore, this scenario involves tasks A (stable passing to braking) and B (braking to stable passing).
Scenario 2: The feasibility of setting different numbers of samples between the source and target domains is discussed because the data scale of the maglev rail joint may differ across different operation modes.To vary the number of samples, a portion of data is randomly extracted from dataset S 1 to form dataset S 2 , while dataset B 1 remains unchanged.This scenario involves the comparison of tasks A and C, which have a large and small number of samples, respectively.

C. Procedure of Maglev Rail Joint Detection
To demonstrate the process of the proposed UDDAN for maglev rail joint detection, a flowchart depicting the procedure of maglev rail joint detection is represented in Fig. 13.First, the condition monitoring system (described in Section IV-A) is installed to record the acceleration data at the location of maglev rail joints when the maglev train operates under the stable passing mode and braking mode.Then, the collected data are transmitted to the portable computer and divided into different segments according to the given operation mode.The flowchart takes an example when the stable passing mode is the source domain and the braking mode is the target domain.Using the raw acceleration data, the TFSs from both the source and target domain are generated (as shown in Section IV-B).The dataset of TFSs is used to establish a UDDAN model.The model performance is evaluated by inputting target data to compare the prediction labels and true labels.Finally, the well-trained model is saved and employed for the classification of three categories of maglev rail joints.

A. Comparison Methods
To illustrate the superiority of using discrepancy-based DA and evaluate the combination of different data distribution assumptions and domain discrepancies on the model performance of maglev rail joint damage detection, the following seven methods are used for comparison.In the domainadaptation-free (DAF) method, no information across domains is provided, and thus, the model is trained only from the source domain and then tested directly over the data from the target domain.In contrast, the other six methods adopt DA and obtain the cross-domain information by calculating the domain discrepancies as the transfer loss.The two data distribution assumptions are MDA and JDA.The source and target domains are aligned by three types of discrepancies Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.As the most important criterion in evaluating the model performance, the classification accuracy associated with the classification layer is determined as the percentage of the number of correctly predicted samples over the total samples in the target domain where sign(•) is the indicator function, which is equal to 1 and 0 if the condition is true and false, respectively; and f (x t i ) and y t i are the predicted and true labels for the ith sample in the target domain, respectively.The classification results are obtained using the model corresponding to the epoch with the smallest training loss.
To verify the performance of DA method, the domain discrepancy in the adaptation layer is quantified.Ben-David et al. [48] proposed the proxy-A-distance (PAD) to measure the similarity between the feature representations of samples from the source and target domains in DA problems.The samples obtained from the bottleneck features are used to calculate the PAD, as these features are typically used for discrepancy alignment in the adaptation layer.The PAD is calculated as where ϵ is the generalization error of the classifier tested on the merged samples from the source and target domains.A smaller PAD corresponds to a larger generalization error, which is attributable to less distinguished samples from the source and target domains.Hence, a smaller PAD indicates more similar feature representations between the source and target domains, that is, higher domain proximity.The PAD is calculated using a binary classifier that is based on a linear support vector machine.Table V and Fig. 14 present the classification accuracies for the three conditions of the maglev rail joint.The classification accuracies of the seven methods for the three tasks are all higher than 83%, which shows that the proposed model can effectively classify the different conditions of maglev rail joints.Moreover, the accuracies of the six methods based on DA are higher than those of the DAF model, which shows that the incorporation of DA enhances the model performance.
Notably, the accuracies of M-MMD in the three tasks are similar to those of DAF, which indicates that MMD in MDA does not significantly enhance the model performance.However, the accuracies of J-MMD are considerably higher than those of the DAF.These results show that the conditional distribution must be considered when using MMD as the discrepancy.
The accuracies of M-CORAL and J-CORAL are the highest among the six methods (higher than 94% for tasks A and B and approximately 90% for task C), which indicates that the consideration of CORAL in MDA and JDA can increase the classification accuracy.The accuracies of M-TriCOV and J-TriCOV for the three tasks are smaller than those of M-CORAL and J-CORAL.In other words, the use of TriCOV may decrease the accuracy.These results demonstrate that sample moments with the second order may be optimal for setting the discrepancy.
Further increase in the order can degrade the model performance, potentially because the alignment of high-order sample moments involves a complicated calculation, which may lead to overfitting in model training and decrease the classification accuracies.Moreover, the accuracies are nearly identical for M-CORAL and J-CORAL, and for M-TriCOV and J-TriCOV.This similarity indicates that the consideration of conditional distribution does not affect the classification accuracy when a high-order sample moment is used.
The classification accuracies of DAF, M-MMD, M-TriCOV, and J-TriCOV in task A are higher than those in task B, while the classification accuracies of J-MMD, M-CORAL, and J-CORAL in task A are close to those in task B. In other words, the domain shift affects the classification accuracies, but this influence can be eliminated using appropriate DA methods.The accuracies of all seven methods for task C are lower than those in task A, which indicates that the model performance deteriorates when the scale of the samples decreases.
The PADs of the seven methods for the three tasks are presented in Table VI and Fig. 15.The PADs associated with the DA methods are smaller than that from the DAF for all three tasks, except the PAD of M-MMD for task C.This finding demonstrates that discrepancy alignment minimizes the difference between the source and target domains.In addition, the PAD decreases with the increase in the order of sample moments (MMD > CORAL > TriCOV), especially with a significant decrease observed from the first-order sample moment (MMD) to the second-order sample moment (CORAL).In other words, the consideration of higher order sample moments enables the closer alignment of the source and target domains.Therefore, for each task, the smallest PAD is obtained by the methods using the third-order sample moment (TriCOV) as the domain discrepancy.
For a given order of sample moments, the PADs for the three tasks are similar in MDA and JDA when using CORAL and TriCOV as the domain discrepancy.However, the PADs   for tasks B and C are smaller in JDA than MDA when using MMD as the domain discrepancy.In other words, the distribution assumption affects only the extent of domain proximity when using a low-order sample moment.Hence, the use of high-order sample moments can compensate for the lack of alignment between the source and target domains caused by the distribution assumption.
Overall, J-MMD, M-CORAL, and J-CORAL exhibit the highest classification accuracies, and methods considering CORAL and TriCOV can decrease the domain proximity.If MMD is used as the discrepancy, the joint distribution must be assumed to avoid the decreased domain proximity.If TriCOV is used as the discrepancy, the classification accuracies are not satisfactory.Therefore, CORAL is preferred to be used as the discrepancy in further analysis.
The convergence during model training is evaluated considering the numerical variations in the classification accuracy and transfer loss with training epochs.Fig. 16 shows the change in classification accuracy of the DAF and two recommended DA methods (M-CORAL and J-CORAL), and Fig. 17 shows the change in transfer loss.
The classification accuracy (see Fig. 16) and transfer loss (see Fig. 17  With training continuing to be performed, the classification accuracy does not increase significantly for task A and fluctuates for tasks B and C when the DAF is used.In comparison, the classification accuracy significantly and continuously increases when the DA methods are used.In addition, the transfer loss no longer increases after the first five epochs.In other words, the application of DA gradually enhances the model performance as the number of training epochs increases.

C. Visualization of Model Performance: Sample Clustering Based on Bottleneck Features
To intuitively observe the model performance, the TFS samples are clustered using the features contained in the samples.However, these features are too high-dimensional to be observed directly.Using a nonlinear dimensionality reduction technique, t-distributed stochastic neighbor embedding (t-SNE) [49], the high-dimensional features can be visualized in a low-dimensional space.As shown in Fig. 18, the data distributions of the two domains have relatively clear boundaries between the three categories, and their feature space is similar but is not strictly the same.It implies the necessity of considering not only the marginal distribution but also the joint distribution.The features are distributed in a disorderly manner and cannot be clustered in either the stable passing mode [see Fig. 18

D. Evidence of Discrepancy Alignment: Data Distribution Comparison Between Source and Target Domain
To demonstrate the distribution shift between the source and target domains when using DA methods, the data distributions of the bottleneck features from the source and target domains are drawn.Because the bottleneck feature in the proposed model is a vector with 64 elements, it is difficult to draw the data distribution of all elements.Therefore, three elements Histograms, as approximate representations of the data distribution, are derived for the bottleneck features in tasks A-C, as shown in Figs.22-24, respectively.The x-axis shows the value of the element, and the y-axis shows the probability density.The element values are normalized for probability estimation.The x-axis ranges from 0 to 1 after normalization, and the histogram contains ten equal bins.Therefore, the width of each bin is 0.1.The area of each bin in the histogram reflects the probability of occurrence for elements in a specific interval, with the width of the interval being equivalent to the bin width.
The data distribution of the source and target domains exhibits several similarities.Except for the DAF in task B [see Fig. 23(a)], the data in the source and target domains exhibit positively skewed distributions.The data distributions obtained from the DAF and two DA methods are different.In general, a larger area of intersection sets in the histogram indicates higher data distribution consistency and lower discrepancy between the source and target domains.As shown in Fig. 22, the elements obtained from the DAF have the smallest area of intersection sets for the three methods in task A. In comparison, the histograms for the two DA methods are overlapped.Similar phenomena are observed in the results for tasks B (see Fig. 23) and C (see Fig. 24).These results demonstrate that DA methods can promote the discrepancy alignment of data distributions between the source and target domains.
For task A, the data distributions between the source and target domains are similar when M-CORAL [see Fig. 22(b)] and J-CORAL [see Fig. 22(c)] are used.In other words, the discrepancy between the source and target domains is similar in the cases of marginal and joint distribution assumptions.This finding also extends to tasks B (see Fig. 23) and C (see Fig. 24).Hence, the discrepancy alignment of the data distribution may not be related to the assumption of data distribution.
The necessity of selecting an appropriate sample moment can be surveyed through DAF [see Figs.22-24(a)], which explicates the unaligned data distribution.For DAF, the distribution of the histogram is flat, and the value of elements covers a broad scope, which means that the effect of covariance cannot be ignored in the distribution alignment.Compared to MMD which only cares about the means of data, CORAL cares about both means and covariances of data, and thus, CORAL performs better than MMD.The data distributions after using CORAL show that CORAL has almost aligned the data very well, which means that the alignment with a high-order moment, e.g., TriCOV, may not bring a remarkable improvement.Overall, this verifies that CORAL is the best choice for this study.

E. Comparative Study With Other Deep DA Neural Networks
Currently, several neural networks have been developed and functioned well in solving the domain shift issue.To verify the accuracy of the proposed UDDAN, a comprehensive discussion between the UDDAN and other state-of-the-art deep neural networks is conducted.Five kinds of deep neural networks are adopted for discussion, including the deep adaptation network (DAN) [50], DeepCoral [33], the domain adversarial neural network (DANN) [29], the dynamic adversarial adaptation network (DAAN) [51], and the UDDAN.ResNet18 serves as the most lightweight network in a series of ResNet backbones.However, in terms of the problem considered in this article, its effectiveness is required to be further verified between different ResNet backbones.As shown in Fig. 26, a comparison is conducted between ResNet18, ResNet50, and ResNet152.A unified method of J-CORAL under the UDDAN architecture is used in training.The results show that ResNet18 spends the lowest computing time in all three networks.This is because the training time spent for ResNet18 is the least as there are the least parameters to be learned.Therefore, the model adopting ResNet18 network is easier to be converged and can be used for real-time classification.As for classification accuracy, ResNet18 is the highest after learning 50 epochs in all tasks, especially in tasks A and C. Overall, the computing efficiency and classification accuracy of the model using ResNet18 are better than the other two networks.Thus, we adopt ResNet18 as the backbone of the network in this study.

VI. CONCLUSION
This article proposes a discrepancy-based DA network to overcome the domain shift issue in the structural assessment of maglev rail joint conditions across various operation modes of maglev trains.An unsupervised algorithm is used to ensure the transferability of the network in real applications.Using the data from the source and target domains, the network trains the domain-invariant time-frequency discriminative features from the backbone layers and domain-variant time-frequency discriminative features from the adaptation layer.The trained model can detect the condition of maglev rail joints across different operation modes.
The applicability of the UDDAN is validated over a dataset acquired from an in situ maglev monitoring system.The results demonstrate the potential of using discrepancy-based DA in maglev rail joint damage detection.The DA is associated with the higher average classification accuracies, smaller domain distances, better clustering of samples, and more consistent data distributions of bottleneck features than the DAF.Among the six DA methods, the second-order sample moment (CORAL) is found to represent the best discrepancy for the distribution alignment, regardless of whether the marginal or joint distribution is used.In addition, the model performance is verified in three tasks.The findings highlight that the proposed discrepancy-based DA network is robust against the operational conditions in the cross-domain maglev rail joint condition assessment.
The future study includes two aspects.First, there are three types (named J-I, J-II, and J-III) of maglev rail joints in a maglev line, and each type of maglev rail joints can be regarded as a domain.In this study, only the J-I type maglev rail joint is selected as research object.The model learned from the J-I type maglev rail joint may fail to predict the J-II and J-III type maglev rail joints due to the domain shift.To extend the feasibility of the established model, the model can be verified by using data from the J-II and J-III type maglev rail joint.Second, more domain shift scenarios in maglev transport operations will be considered.For example, the vehicle loadings are not always constant.Different weights of the vehicle may cause different acceleration responses of maglev rail joints.As a result, the TFS demonstrates different discriminative features, and the effectiveness of the established model with the influence of vehicle loading needs to be further studied.

Manuscript received 18
July 2023; accepted 23 August 2023.Date of publication 18 September 2023; date of current version 9 October 2023.This work was supported in part by the National Natural Science Foundation of China under Grant U1934209; in part by the Wuyi University's Hong Kong and Macao Joint Research and Development Fund under Grant 2019WGALH15, Grant 2019WGALH17, and Grant 2021WGALH15; in part by the Innovation and Technology Commission of Hong Kong SAR Government, China, under Grant K-BBY1; and in part by the Hong Kong Polytechnic University (PolyU) Startup Fund for Research Assistant Professors (RAPs) through the Strategic Hiring Scheme under Grant P0039260.The Associate Editor coordinating the review process was Dr. Ke Feng.(Corresponding author: Su-Mei Wang.)

Fig. 7 .
Fig. 7. Maglev rail joint on the straight segment of Shanghai Lin-Gang maglev test line.

Fig. 13 .
Fig. 13.Flowchart for the procedure of maglev rail joint detection.

Fig. 16 .
Fig. 16.Classification accuracy of DAF and two DA methods for (a) task A, (b) task B, and (c) task C with the change in the number of training epochs.

Fig. 17 .
Fig. 17.Transfer loss with the training epochs for tasks A, B, and C.
) converge after training for 50 epochs.In other words, the convergence of model training can be obtained

Fig. 18 .
Fig. 18.Feature clustering from raw TFSs with samples extracted in (a) stable passing mode of dataset S 1 and (b) braking mode of dataset B 1 .within50 epochs.In the first five epochs, the classification accuracy in all three tasks increases to approximately 80% for the three methods, and the transfer loss for all DA methods also increases.In other words, the model training focuses on classification instead of adaptation in the first five epochs.With training continuing to be performed, the classification accuracy does not increase significantly for task A and fluctuates for tasks B and C when the DAF is used.In comparison, the classification accuracy significantly and continuously increases when the DA methods are used.In addition, the transfer loss no longer increases after the first five epochs.In other words, the application of DA gradually enhances the model performance as the number of training epochs increases.
Figs. 18-21 show the results of t-SNE-based feature visualization of TFS samples.The features are mapped into a 2-D scatter diagram.Each point in the scatter diagram represents a sample and its category is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
(a)] or the braking mode [see Fig. 18(b)].In contrast, as shown in Figs.19-21, the distribution of features is relatively orderly, and the samples can be effectively classified.In other words, the discriminative features in different categories can be learned through the training of backbone layers.
(1)-(3) are randomly selected as representative examples.The bottleneck features are extracted through the models trained with DAF and with two recommended DA methods, i5.e., M-CORAL and J-CORAL.
Among them, DAN, DeepCoral, and the UDDAN are discrepancy-based DA networks, while DANN and DAAN are adversarial-based DA networks.Two transfer losses of M-MMD and M-CORAL are considered in the UDDAN, but only MDA-based discrepancies are used for comparison as DAN and DeepCoral are developed only under the assumption of marginal data distribution.DAN and M-MMD use MMD as a discrepancy, and DeepCoral and M-CORAL use CORAL as a discrepancy.Only M-MMD and M-CORAL adopt the tradeoff term in the transfer loss.The classification accuracies of the five methods together with the UDDAN methods considering the two transfer losses of M-MMD and M-CORAL are shown in Fig. 25.As can be seen, M-CORAL, which is the recommended method in this study, has the highest classification accuracy for all three tasks.It is also found that the classification accuracy of M-MMD is higher than that of DAN and the classification accuracy of M-CORAL is higher than that of DeepCoral.The results indicate that the tradeoff term designed in the UDDAN contributes to the accuracy improvement.Though adversarial-based DA networks (DANN and DAAN) are superior to the first-order discrepancy-based DA networks (DAN and M-MMD), they are inferior to the
Algorithm 1 Training Procedure of UDDAN Output: the predicted label for data from the target domain 1) Train the data from the source and target domains to obtain the bottleneck features, source label predictions { f (x s i )} N s i=1 , and model parameters θ b , θ c , θ a .2) Predict the pseudo-labels { ŷt i } Calculate the loss functions L D by (10), L T by (12) and L T O T AL by (13) 6) Update the model parameters by solving (14)-(16) 7) Update the pseudo-labels if using JDA 8) until the current epoch reaches the maximum value 9) Evaluate the model performance over data from the target domain layers, classification layer, and adaptation layer, respectively

TABLE I NUMBERS
OF TFS SAMPLES FOR TASK A (S 1 → B 1 )

TABLE II NUMBERS
OF TFS SAMPLES FOR TASK B ( B 1 → S 1 )

TABLE III NUMBERS
OF TFS SAMPLES FOR TASK C (S 2 → B 1 )

TABLE V CLASSIFICATION
ACCURACY (%) OF DIFFERENT METHODS Fig. 14.Classification accuracy (%) of different methods.

TABLE VI PAD
BETWEEN BOTTLENECK FEATURES FROM DIFFERENT METHODS