Cross-Domain Fault Diagnosis of Rotating Machinery Using Discriminative Feature Attention Network

In recent industrial applications, machine learning technology is proving useful in preventing equipment failures in advance through early failure diagnosis. In particular, we show that different domains can be linked through adversarial learning with data available in different working conditions to facilitate the training of the model, as it is impractical to acquire data for all conditions in real-world applications. Nevertheless, the initial failure is a difficult problem to diagnose because it does not show a significant difference from the normal data between different conditions. Moreover, if only the domain discriminator is judged when adapting the domain, it tends to easily cause misclassification, so the reliability of the detection result needs to be improved. In this study, we propose a new learning method that improves classification performance by sharing the classification characteristics of the classifier for each task with the target domain characteristic generator. The proposed mechanism uses spatial attention to extract the focused partial information of the feature generator and discriminator, and further enhances task-specific features using the attention mechanism between the two extracted information types. Addresses the challenge of implementing both domain adaptation and classification. Extensive experimentation demonstrates efficiency and improved classification performance on benchmark and real-world application datasets. In real machine cases, the classification accuracy is improved by almost 4%. In addition, the negative impact on false alarms was lowered by increasing the classification accuracy of minimum failures. Convincingly demonstrate model effectiveness by performing an empirical analysis of the method through ablation analysis and visualization.


I. INTRODUCTION
With the success of many machines learning tasks, machine learning has become widely used in fault diagnosis for real applications [1], [2]. A key factor in achieving this development is the availability of vast amounts of labeled data in the workspace of interest. However, in real-world applications, collecting training instances given the enormous domain or labeling costs is challenging. Domain adaptation is performed using available labeled source data to address the lack of labeled target domain data [3]. The data-collection range for fault diagnosis becomes wider with dynamic machine The associate editor coordinating the review of this manuscript and approving it for publication was Dazhong Ma . operation (vehicle, heavy equipment, etc.). Even if the model is designed to share the same label space for the source and target operations, domain adaptation still suffers from data distribution changes. Moreover, maintaining the classifier performance while satisfying domain adaptation is a challenging problem [4]. Thus, the main goal of domain adaptation is to extract the domain-invariant features so that task classifiers can learn from the source data and be readily applied to the target domain.
The most recent domain adaptation methods are based on modern deep architecture. They rely on the superior model capacity of the networks to learn hierarchical features that have been empirically proven to transfer between domains better than previous methods [5], [6]. One of the transfer the domain discriminator distinguishes between the source and target domains without considering the task-specific decision boundaries between classes. In particular, for fault diagnosis using a vibration signal, the problem of changing the characteristics of the original signal using strong ambient noise is more pronounced. Thus, the trained feature generator produces ambiguous features near class boundaries. Moreover, because only the loss value for domain classification is backpropagated to the feature generator, the model does not know which information to focus on to determine the difference between the two domains. This situation has the disadvantage that the classification performance is not guaranteed even if domain adaptation is successful. iSecond, the data affected by the mode collapse problem (i.e., the co-distribution of features and categories) are not properly aligned in the source and target domains [13], [14]. The first step toward solving the above shortcomings was co-parameterizing the task and domain classifier into an integrated classifier. However, due to the noise added to the vibration signal, it is easy to encounter extracting features by focusing only on a region that is easy to adapt to the source domain.
Thus, we propose a new learning method that can ensure the categorical prediction performance for all input instances for fault-relevant feature extraction, as presented in Fig. 1. Intuitively, training the proposed model is primarily performed by sharing the attention score between the encoding neurons (input) and real category neurons (output). Unlike the somewhat implicit work [15], the feature generator and discriminator share key information about the hidden layer of the classifier through the spatial attention mechanism to induce the feature focused on fault classification to be judged as the source domain data. We introduce a categorylevel transferable feature sharing module to learn an adaptive classifier boundary for the target domain, which can further improve the performance. The classification performance for the target domain can be guaranteed through this method.
Thus, our main contributions are summarized as follows: 1) In this study, we propose a model that can focus on the remarkable spatial information related to fault characteristics and an information-sharing mechanism that can maintain classification performance between classes. The model compensates for the relationship between categorization and domain adaptation to reduce the misclassification of target domains. 2) Misclassification due to information unrelated to classification performance (e.g., noise and peripheral disturbances) can be minimized through a combined model using feature extraction and an attention mechanism for classification models.
3) The proposed method was quantitatively verified using three datasets, the rolling bearing benchmark dataset and the gearbox of a real machine. Promising results were obtained by comparing the method with the existing method and related studies. 4) An additional experiment validates the robustness of the classifier in noisy environments. This aids in better understanding of the proposed method. The paper is organized as follows. The related work is described in Section II. Section III combines the overall ideas and proposes a new diagnostic method employing the sharing of classification information using spatial attention as a discriminative domain adaptation. Section IV describes the performance verification results using the public dataset and actual equipment dataset and analyzes the effectiveness of the proposed model. Finally, Section V presents the conclusion.

II. RELATED WORKS
In [16] and [17], shallow invariant feature extraction method proposed to extract domain-invariant features. The input signal is preprocessed through the spectral envelope extraction method, and the time domain synchronization averaging principle using transfer component analysis (TCA) was used for cross-domain feature extraction [16]. The domain-adaptive neural network (DaNN) is a pioneer algorithm that uses shallow neural networks to calculate the distance between the source and target domains [17]. However, according to the latest research, deep learning architectures for domain adaptation are more promising and can learn more transferable features [18], [19].
A domain adaptation method using a deep learning framework was applied to extract domain-invariant features [18]. In addition, a model that generalizes and optimizes the distance between and within classes by applying a deep distance metric learning method to diagnose rolling bearing defects was proposed [19]. A deep convolutional transfer learning network with adversarial training was proposed to learn domain-invariant features across domains to diagnose disabilities [20]. A multi-sensor fusion method was proposed using a deep belief network for ball screw fault diagnosis [21]. An image-based convolutional neural network diagnostic architecture was proposed for ball screw spindle failure [22]. A method to minimize the MMD between the source and target domain features based on a sparse autoencoder structure was proposed [7].

A. DOMAIN ADVERSARIAL NETWORK
A domain adversarial network [10] is an adversarial training approach to minimizing domain discrepancy. Inspired by the generative adversarial network [23], the approach aims to find the Nash equilibrium in which the domain discriminator cannot distinguish each domain through adversarial learning between the feature generator and domain discriminator. The feature generator learns to fool the discriminators, and the discriminators struggle to avoid being deceived. The adaptive generative network consists of a feature generator G (·), discriminator D(·), and classifier C(·).
Formally, this method is called a minimax game. Thus, the overall objective is defined as follows: where θ G , θ C , and θ D are the parameters of G S , G T , C, and D, respectively, and d i denotes the domain label of x i . In addition, L C (·) is the label prediction loss, and L D (·) is the domain classification loss. Further, β is a hyperparameter to trade off the two objectives L C and L D . Through minimax adversarial optimization training, the parametersθ G ,θ C , and θ D deliver a saddle point of (1):

B. ATTENTION MECHANISM
The traditional research methods primarily start with a single label for the entire fault signal. Therefore, training in-depth models generally require a wealth of domain knowledge and a vast volume of experimental data to increase the accuracy of the fault signal classification. However, collecting these data for practical applications is difficult. Unlike traditional domain-invariant feature extraction for the entire input signal, the attention layer focuses on the signal directly related to the discriminative failure signal, allowing the model to adapt easily even if the domain changes. Inspired by the field of image processing [24], we propose a model allowing users to focus on the fault area. The attention VOLUME 9, 2021 mechanism was initially proposed for learning the alignment between source and target tokens to improve the performance of neural machine translation [25], [26]. The mechanism was primarily used in the language and image fields for object detection. Several studies were done on basic and modified attention mechanisms for time series data. The long short-term memory fully convolutional neural network with an attention mechanism was proposed for multivariate time series data [27]. The transition functions of the attention mechanism are described in (3)- (5), where H is a matrix representing the features extracted by the predictive model. The matrix {h 1 , h 2 , . . . ,h n } · e n ∈ R n is the input vector, and v a denotes the embedded aspect of the attention mechanism. In addition, α is a vector representing the attention weight for feature H , and r is the attentive neural network representing the weighted sum of feature H :

III. METHODS
Subsections A, B, C, and D provide the problem formulation, overall idea, domain discriminator with attention, and spatial attention mechanism, respectively.

A. PROBLEM FORMULATION
We have a labeled dataset {X s , Y s } and an unlabeled target domain data X T . Fig. 2 shows the obvious differences between the four types of faults between the source domain and the target domain, regardless of the raw signal or frequency spectrum of the actual machine data. As shown in Fig. 2 (a), it can be seen that the 0.5mm crack case, which can be said to be an initial fault between domains, has a similar type and amplitude to the normal data of different domains. This acts as a factor that increases the misclassification rate with domain adaptation. Also, in Fig. 2 (b), the monolayer spectrum of 0.5 mm failure is similar to that of 1.0 mm failure. This can be viewed as a sensitive condition that is easily misclassified. Fig. 3 illustrates the misclassified area that can occur while performing domain adaptation by focusing on the area where the existing adversarial domain adaptation model can easily adapt. In addition, the figure presents a method of minimizing the misclassified part by focusing on discriminative feature extraction through the proposed model. The goal is to maintain fault classification performance through a trained model even in the target domain without a label. In this paper, the method was verified for the datasets on three cases. We propose an improved unsupervised domain adaptation method that attending to category-level transferable feature.
In Fig. 4, the distributions of the source and target domains of each dataset are different. The prefixes S_ and N_ indicate the source and target, respectively. For the actual machine data collected under real driving conditions, the alignment of the source and target data distribution is not uniform due to the surrounding noise and environmental factors compared to the data collected in the experimental environment. This factor makes domain adaptation more difficult.

B. OVERALL IDEA
Inspired by the generative adversarial network of the image processing domain, we introduced the attention module to the domain adaptation framework to learn a discriminative feature. We propose a model that can maintain fault classification performance even in the target domain by combining the adversarial domain adaptation model and attention mechanism to share the discriminative feature region directly related to the fault of the input signal. This model focuses on the tooth signal where the actual failures occurred, focusing on the failure signal even at different rotational speeds. Therefore, the characteristics of the target domain data are extracted with emphasis on key information for failure classification. The feature generator of the target domain creates a feature combined with the fault classification feature concentrated in the classifier and learns the feature so that it cannot be distinguished from the source domain data through a domain discriminator.
Through adversarial training between the generator (G) and domain discriminator (D), the source and target domain instances are reflected in the same learned feature subspace. In addition, through a new algorithm sharing the information's discriminatory characteristics of the source data with the feature extraction space of the target domain data, the defect diagnostic knowledge is shared, obtaining better classification performance in the target domain.
The attention layer is inserted into the corresponding layer between the feature generator and classifier, as depicted in Fig. 5, to create an attention model that learns the correlation to discriminate the fault signal. The spatial attention focuses on the informative part for classification or prediction. Spatial attention represents the attention score on the feature map or a single cross-sectional slice of the tensor. By refining the feature maps using spatial attention, we enhance the concentration to the subsequent convolutional layers, which improves the performance for fault classification of the model.

C. ADVERSARIAL LEARNING FOR DOMAIN ADAPTATION
The learning model consists of two steps. First, we trained the entire system using the source domain data and labels. Next, by combining the feature generator with the extracted attention map, the target domain data were learned, and the loss of the domain discriminator was maximized so that the distribution of the target domain data is matched to the source domain. Its main purpose is to focus only on the characteristics of the error signal obtained from the original signal so that time-varying regions of different distributions can be excluded as much as possible.
Overall, the proposed method consists of a feature generator such as G= {G s , G T } with the parameter θ G = {θ g s , θ g t }.
The front layers of each G are shared between the source and target domains. Furthermore, the method consists of the task-specific classifier C with the parameter θ c . In addition, C classifies the features into K classes, outputting a K -dimensional vector of logits. The method also consists of a domain discriminator D with θ d . The proposed method aims to learn a domain-invariant discriminative feature that is accurately classified in the unlabeled target domain.
Moreover, G ∼ D extracts feature for optimal classification from the classifier through each hidden layer of the feature generator. Further, G ∼ C plays the role of a classifier to classify instances. Therefore, our goal is to extract features in a form that can accurately perform classification and domain shift for the same signal. We aim to share the focused part in each hidden layer by applying a spatial attention mechanism between the hidden layers in the feature generator and classifier. This shared attention score is input back to the feature generator to extract features while minimizing the classification loss and maximizing the domain discrimination loss.

D. SPATIAL ATTENTION WITH CLASSIFIER
The classifier classifies the input to each class. We propose a spatial attention method to extract the fault identification areas on which the classifier concentrates for each classification in the source domain. These fault identification areas reveal the areas where the classifier concentrates to classify the input properly. If given an input feature f , then a map M (f ) of spatial attention equal to the size of the input feature f is computed by the classifier.
Moreover, M (f ) presents the concentration of the hidden unit to classify the input accurately. The classifier's attention map correlates with discrete, distinguishable defect areas and is shared with the domain feature generator for focusing. In the feature generator and classifier, average pooling, maximum pooling, and a 1 × 1 convolution are sequentially concatenated to compute the spatial map feature representation. Inspired by convolutional block attention module [28], we applied average pooling, maximum pooling, and a 1 × 1 convolution to the channel dimension to effectively represent the features. Then, from the convolutional layer, a spatial attention map M (f ) ∈ R 1×H ×W , which is focused on the salient region, is extracted. The above process applies to each feature generator and classifier. Finally, each attention map is connected to the element-wise dot production in the feature generator of the target domain. VOLUME 9, 2021 The spatial attention map is generated by applying average pooling, maximum pooling, and a 1 × 1 convolution, and the convolution layers are as large as 7×7. The operation for each layer is performed sequentially, and the output of each layer is expressed as f avg , f max , f 1×1 , f 7×7 ∈R 1×H ×W , respectively. Extracted spatial attention maps of the feature generator and classifier are denoted by M S C (f S ) and M G (f ), respectively. The merged spatial attention map M (f ) is calculated by a nonlocal neural network [25] to apply the attention computation. The feature maps f S and f T branch out into three copies corresponding to the concepts of the key, value, and query.
Additionally, M S C (f S ) is the query, and M G (f ) is the key and value. Thus, the dot-product attention is applied to output the attention feature maps: where σ is the softmax activation. The final output is as follows: As presented in Fig. 5, the attention maps for the target domain are a combination of spatial attentive information from the feature generator of the target domain data and classifier. The attention vector contains spatial information that applies the feature generator and low-level feature responses to the classification [29].

E. TRAINING STRATEGY
The flowchart of the proposed fault diagnostic method is illustrated in Fig. 6. First, the labeled source domain data and unlabeled target domain data are collected as raw vibration signals from the working equipment. The target domain data may be exposed to additional environmental noise from the source domain and may be collected under different working conditions than the training data. Second, feature extraction and domain adaptation training are performed on the source and target domain data. The fault classifier learns only from the source domain data and shares the source domain classification feature when extracting the target domain feature. Finally, we used the trained model to perform a final defect classification on the newly collected data.
The classifier was trained using the standard supervised loss below with source domain data, x s , y s : Then, using the source and target domain data, L D was optimized: where G (·) generates the input x, whereas D performs domain discrimination and C classifies. The objective can be expressed as follows: where L D and L C are losses for the discriminator network D and the classifier network C, respectively. Furthermore, λ C λ D are the hyperparameters used to balance the relative importance of different terms. During training, hyperparameters are obtained experimentally. To meet the same conditions as the previous methods, we set λ C = 0.01, whereλ D = 0.01. The calculation algorithm of the loss function is explained in Algorithm 1.
Finally, the total objective function incorporating the adversarial loss and modified classification loss with the attentive feature is expressed as follows:

IV. EXPERIMENTS
The proposed model was validated using a rolling element bearing dataset (two benchmark dataset) and a dataset of heavy equipment that works dynamically with various loads. Moreover, the proposed model was compared to state-ofthe-art domain-adaptive architectures, including TCA [16],

Algorithm 1 Learning Algorithm
Input: Input data X = x 1 , x 2 , · · · , x n Output: Trained models D, C, G 1: begin 2: Initialize parameters for D, C, G 3: for i = 1 → n do 4: # extract latent vector and classification with domain discrimination at each timestep 5: for t = 0 → T do 6: Compute M (f ) according to (9) 7: 13: Compute L D withx i and x i 14: Compute L C with C andx i 15: Update G with L G formulated in (10), (11)  16: Update D with L D formulated in (12)  17: Update C with L C formulated in (10) [15], and discriminative adversarial domain adaptation (DADA) [30]. In addition, a model in which the attention mechanism was removed was also added to verify the effectiveness of the proposed attention architecture. The experiment for the second case was set to the actual operating conditions, including noise and other environmental factors, to verify the feasibility and practicality of heavy equipment failure diagnosis. The DAFD architecture introduced domain adaptation for fault diagnosis for the motor application. The DADA architecture used a domain adaptation method to maintain discriminative performance in the image field. Adversarial multiple-target domain adaptation (AMDA) was proposed to transform multiple-target domain feature learning [31].

A. IMPLEMENTATION DETAILS
The comparison methods were verified using the same implementation strategy proposed in the original paper. The data for each driving condition were randomly divided into training and testing data. Training and testing data were used for model training and validation using a specific domain as the source domain. The rest of the domains were defined as target domains, and the training data were used as the target domain data while training. The testing data defined as the target domain were used for the final model verification. The detailed parameters for each experiment are listed in Table 1. Table 2 presents the details of the network in the experiment. Multiple overlapping samples were extracted using a sliding window approach. Each sample length is 1500 data points. The network consists of convolutional, attention, normalization, and fully connected layers. A leaky rectified linear unit (LReLU) was used as an activation function. The LReLU layer uses a backpropagation learning method to make it easier to learn the weights of the shallow layer when adjusting its parameters.
Considering that the vibration signal repeatedly occurs with positive and negative values, the LReLU function was adopted to minimize the limitations of the rectified linear unit (ReLU) activation function that stops learning at negative values. The LReLU is an improved variant of the ReLU, which is also suitable for feature extraction for negative values of the input signal by providing a small positive value of α when the input is negative, allowing the neurons to continue working [32]: (15) VOLUME 9, 2021 Network training and inferencing were performed on a workstation with a single AMD 7742, V100 GPU, and an Ubuntu 18.04 operating system. The learning model was implemented in the Keras library using Python 3.7. Each sample was divided based on its length and was used as input. We adopted a cross-entropy loss function during training with a learning rate of 0.0001 and a batch size of 300 using the Adam optimization algorithm.

B. CASE STUDY 1: CWRU DATASET
We validated the model using the 12-k drive end bearing fault data in the Case Western Reserve University (CWRU) bearing dataset [33]. The four types of bearing defects are normal, ball, internal race, and external race. Each fault type has three sizes: 0.007, 0.014, and 0.021 inches. Therefore, 10 fault label types exist. Each fault label contains four load types (0, 1, 2, and 3 HP) measured from motor speeds of 1,797, 1,772, 1,750, and 1,730 rpm, respectively. Each sample was extracted at each sampling point, as depicted in Fig. 7. Datasets A, B, and C are defined with different rotational speed conditions, each containing 12,300 training samples and 6,059 testing samples from Table 3. Each measurement dataset is divided into 70% training and 30% testing sets. One of them is defined as the source data, and one of the other datasets is defined as the target domain (e.g., A: source domain and C: target domain).  The proposed model performed similarly to or better than the state-of-the-art method in most cases, as revealed in Table 4. Interestingly, the results revealed that when learning from data obtained from a device rotating at high speed, the model evaluated the low-speed data quite well. However, when evaluating data from a device rotating from low to high speed, the reliability was slightly lower.
The failure impact energy depends on and decreases with the rotational speed. However, the background noise is the same regardless of the rotational speed, so the signal-to-noise ratio (SNR) decreases with the rotational speed. Therefore, the learning result with low-speed data acted as a factor of performance degradation in determining high-speed rotational data. Hence, learning is more effective using high-speed than low-speed rotational data [34].

C. CASE STUDY 2: XJTU-SY DATASET
Unlike the dataset CWRU, which collects data with apparent fault characteristics, the XJTU-SY bearing dataset collects run-to-failure bearing information with different operation mode such as inner race, outer race, and mixed fault [35]. The testbed illustrated in Fig. 8. In particular, data set C not only contains data on obvious defect characteristics similar to data sets A and B, but also collects early-stage lifecycle data with weak defect characteristics. The vibration signal was sampled at 25.6 kHz, and condition A with the fewest samples was met to match the data length. The XJTU-SY data set details are summarized in Table 5. Each data set was divided into A, B, and C, and experiments were performed by defining each as a source domain or a target domain. The results of comparative experiments are shown in Table 6. The comparative results prove that the proposed method is more effective than the others. In particular, C→A shows much higher results than other data distribution methods. The proposed method contributes to the improvement of disability classification performance through the sharing of classification-oriented domain information.

D. CASE STUDY 3: REAL MACHINE DATASET
Vibrations occurring in the gearbox of actual heavy equipment are very complex. Heavy equipment operates dynamically at various speeds and loads. The sensor is mounted on the outside of the gearbox and collects the gearbox vibration and the vibration (noise) from other equipment components in Fig. 9. There are three classes of failures, determined according to their crack sizes: 0.5, 1.0, and 2.5 mm. smaller crack sizes make it more difficult to classify the failure.   Unlike the CWRU dataset, the data were collected from actual heavy equipment operating under various conditions (speed, load, etc.) with exposure to different noise environments. The corresponding data did not contain accurate speed information (no tachometer), and the tests were conducted at speeds of about 100%, 75%, 50%, and 25% of max rotation speed about 9.1 rpm based on user manipulation. In addition, the data were collected only for some classes. Table 7 lists the operating conditions, number of fault classes, and count of datasets for training and testing. In this experiment, each training dataset is also used as the target domain data.
Because only Datasets A and C contain all task data, the experiment was conducted using Dataset A or C for training and other datasets for verification. The confusion matrix of the classification result of real machine, using the proposed model, is displayed in Fig. 10. The confusion matrix displays the predicted results of the samples on the columns and the actual labels of the samples on the rows. Approximately 88% to 90% accuracy was achieved for the new domain data. The performance was significantly reduced when the attention layer was not applied. Moreover, we confirmed that the recognition of minimum failure (abnormal 0.5/0.5 mm) and the rate of false alarm are significantly reduced. We compared the performance of the proposed model to that of several other of the latest deep learning methods, using the conditions in Table 8 as the accuracy and standard deviation values. In all cases, the proposed model outperformed other models.

E. INFERENCE TIME COSTS
A key consideration in the fault diagnostic method is the cost of time required to generate the classification results. An important factor in the practical implementation of an intelligent fault diagnostic system is inference time. The reason is that model training can be done offline, whereas inferences about sequences must be performed onboard in real time. The result of comparing the inference time cost was added to each experiment result table, such as  Tables 4, 6 and 8. No increase in inference time was found due to the proposed method because no additional preprocessor or feature extraction module exists. Therefore, the proposed method improves the reliability of the discrimination, and the discrimination time is effective for use in the industrial field in a similar way.

F. EFFECTIVENESS OF ATTENTIONAL GUIDANCE
In this section, we investigate the robustness of the proposed model to a noisy environment. The CWRU data were collected under limited conditions in a laboratory environment; thus, the data did not reflect the noise conditions for an actual industrial site. Considering that the target equipment is used in a harsh environment, the model performance was verified under noisy conditions.
We verified the performance of the trained model under noisy conditions using varying Gaussian white noise (−3 dB or −6 dB). The SNR is defined as follows: where P signal and P noise are the power of the signal and power of the noise, respectively. The comparison result is represented in Table 9.
The accuracy of the proposed model is better than the conventional model in all SNR scenarios. When SNR is −6dB, the proposed model still reaches a diagnostic accuracy of 94.89%, which is almost 8% improvement over the DADA, the best of the comparison models. This result reveals that the proposed model extracts feature focused on fault classification characteristics without additional noise removal preprocessing, which indicates that the proposed model is robust for a wide range of operating conditions in real applications.

G. ABLATION STUDY
The ablation study was performed to investigate the effect of classification information sharing the attention mechanism. We visualized the spatial attention score and map in each attention layer to further understand the feature learning mechanism in the generator, as illustrated in Fig. 11. All attention scores have the same length as the input signals. Colors  FIGURE 13. T-SNE visualization of feature alignment between models with and without attention networks in datasets of the CWRU and a real machine. In each (a-f), the left side is without the guide attention network, and the right side presents the proposed model. represent relative attention weights, and yellow represents high levels.
In general, vibrational peaks can represent defects, and variations and subtle differences in peaks can represent the failure type. By comparing before (attention layer of the generator in Fig. 10) and after (attention map in Fig. 10) sharing the classification properties, we confirmed that the attention regions are more attended and move according to the captured pattern in the classifier. Fig. 12 shows the probability density before and after domain adaptation in feature space by various domain adaptation learning methods. After domain adaptation, features were extracted with a similar distribution between the two domains. Among these, the proposed method is superior to the comparison method.
The attention mechanism focuses on specific information in the input signal, which indicates that it can learn differential features that are beneficial for diagnosing various bearing conditions. As depicted in Fig. 13, if the classifier's attentive information is not applied using the attention mechanism, the incorrect classification data increase despite the well-adapted domain (a left figure in each (a-d)). By sharing informative classification information using attention mechanism, the network can selectively learn more important feature information (a right figure in each (a-d)). As a result, the efficiency and ability of feature learning are improved, which improves the accuracy of fault classification in the target domain and effectively focuses on the attention fault signal segment.
In the case of misclassification data, the classification accuracy was lowered by extracting features from other areas or noise focused only on domain classification rather than features indicating actual failure. This is because the focused   area for domain classification may be concentrated on an area unrelated to task classification. However, it can be confirmed that when the re-weighting is applied to the failure classification feature area that the classifier concentrates on, feature extraction for failure classification is induced in Fig. 14.

V. CONCLUSION
This paper presents a domain adaptation approach capable of maintaining performance to classify and diagnose rotating machine failures under various operating conditions. The proposed algorithm solves the difficult problem of obtaining sufficient samples for various operating conditions, one of the main challenges in domain adaptation for fault diagnosis. The feature extraction method can share spatial information between the discriminator and hidden layer of the feature generator by applying an attention mechanism, extracting the characteristics of the attentive fault signal.
In addition, through a domain discriminator, the approach learns to minimize the discrepancy in the distribution between the source and target domains. The proposed model was validated through extensive experimentation by comparing its performance with that of the latest deep learning models using three datasets such as the CWRU, XJTU-SY and real machine datasets. The proposed model maintains a classification performance with an accuracy of over 89% in the target domain. The ablation study results indicate that our model effectively diagnoses faults while maintaining the desired classification performance under the new operating conditions.
The proposed model suggests a new feature extraction method for domain adaptation for the same task. Therefore, in future work, we will further refine the model to realize diagnostic tasks in which the target testing and source domains have different label categories.