Sound Source Localization Inside a Structure Under Semi-Supervised Conditions

We propose a method for applying a sound source localization (SSL) model trained on simulated data in a real-world environment, with a domain transfer (DT) model for the SSL inside a structure. The DT model transfers real data into pseudo-simulation data. The SSL model trained on the simulation data is then adapted to the real data using the DT model. Our method consists of an SSL model and a DT model. The SSL model predicts the position of a sound source inside the structure, whereas the DT model transforms the data. Because our simulation is not perfect, real data are extrapolated for use with the SSL model. However, the data transformed by the DT model are interpolated within the feature space. The outcome is that the performance of the SSL model in the real world is improved. In our study, the frequency spectra of accelerometers observed on the outer surface of the structure are the model input. The goal is to predict the position of the sound source. The SSL model is built using deep and convolutional neural networks, and the DT model is built using either an autoencoder, a deep convolutional autoencoder, or pix2pix. The two-dimensional distributions of the t-distributed Stochastic Neighbor Embedding indicate that using pix2pix as the DT model shows the best performance. Furthermore, our method's performance for SSL is improved by 57% for the classification problem and by 27% for the regression problem when compared to the case where no transformation is applied.


I. INTRODUCTION
S OUND source localization (SSL) is an important for reducing the noise of machines and electrical appliances. Currently, several SSL methods that use the correlation of time-frequency signals observed by multiple microphones have been proposed. Those methods are based on the time difference of arrival (TDOA) of acoustic signals [1], [2]. Many studies have reported improvements in TDOA problems such as in noise, reverberation, and the simultaneous emission of sound sources. In recent years, several methods have been proposed that incorporate deep learning and overcome different scenarios that are challenging for conventional methods [3], [4]. However, the applications of these methods are limited to circumstances where acoustic signals can be directly observed. These methods are not applicable for estimating, from outside, the position of a sound source inside a structure because the acoustic signals are observed as indirect sounds. The SSL inside a structure is an important problem because it leads to essential solutions for product noise reduction. For example, if noise is generated owing to damage to a component inside the structure, disassembling the structure or placing measurement equipment inside the structure is not an option because it would change the structure's response system. In other words, the resonant frequency changes because the disassembly of the structure or placement of the measurement device causes a change in the volume of the acoustic space. Therefore, the SSL has to be conducted outside the structure under normal operating conditions. Specifically, it can detect the position of noise owing to component defects, deterioration, and interference that occur in mass-produced home appliances, mechanical products, and prototypes. Other applications can be applied to SSL in situations that cannot be observed directly. For example, SSL can be applied to estimate the position of noise generated in gas or water pipes. This research deals with the problem of estimating the location or position of sources that cannot be directly observed.
Methods based on deep neural networks (DNN) and computer-aided engineering (CAE) have been proposed for estimating the sound source inside a structure [5]. Our method successfully estimated the position of the sound source inside the structure from the signals observed by accelerometers installed on the outer surface of the structure, in both the simulation and real domains. However, our method still faces the challenge of applying a DNN trained in the simulation domain to the real domain. The main problem is that there are differences between the simulation data and actual experimental data. These differences occur because the simulation data poorly simulate the actual experimental conditions of a structure's geometry, material parameters, and nonlinearity. For both the indirect and direct sound, it is still difficult to apply the trained model built on simulation data to real data because the simulation is not perfect [6].
To solve the SSL problem, our study focuses on a method to apply models built with simulated data to real data. Adapting a trained model to another task or data is called "transfer learning (TL)," which has been studied in the fields of visual categorization and natural language processing by a large number of This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ researchers [7], [8], [9], [10], [11]. With TL, the goal is to reduce the performance degradation caused by different distributions (called domain shifts) of the data used to train the model (source domain) and test data (target domain). Although there are a few studies on SSL in the TL, there are no methods for estimating the position of the sound source inside the structure. Previous studies on SSL have focused on classification problems and assumed weakly supervised labels and TL in visual categorization [12], [13], [14], [15]. We could not use these methods because our study included a regression problem for predicting the coordinates of a sound source. In addition, weakly supervised learning situations with missing or noisy labels were not targeted.
We focus on "transductive transfer learning" or "domain adaptation (DA)" because the task is the same and only the domain differs [16]. Within that learning condition, we use a "feature-representation-transfer approach" because the equivalency of conditional distributions in the source and target domains is not guaranteed [17]. The SSL in this study can use real data; that is, it is a semi-supervised condition. In the case of labeled and unlabeled data available in the target domain, there are discrepancy, adversarial, and reconstructive approaches for solving the DA with deep learning [18]. These methods mainly use invariant representations of the source and target data or assign pseudo labels to the target data.
We propose a DA method for SSL inside a structure under semi-supervised conditions. The datasets are labeled according to the area or position of the sound sources and not based on the data. Therefore, the model built in the simulation domain is used directly because the condition distributions differed between the simulation and real domains. The method consists of a domain transfer (DT) model and an SSL model. The DT model transforms real data into pseudo-simulation data and the SSL model predicts the position of the sound source from the transferred data.
The remainder of this paper is organized as follows. In Section II, the method used in our previous study on SSL inside a structure is summarized. In Section III, the proposed method for applying the models built in the simulation to real environments is described. Section IV presents the datasets of the simulation and actual experimental data. In Sections V and VI, the SSL and DT models are described. The results are described and discussed in Section VII, and Section VIII presents our conclusions.

II. FORMULARISATION AND FRAMEWORK FOR SSL INSIDE A STRUCTURE
The finite discretization equation for the forced vibration of the acoustic-structural coupled problem is as follows: where u, p, M, C, and K represent the displacement, sound pressure, mass matrix, damping matrix, and stiffness matrix, respectively. Suffixes S and F denote the structural and acoustic terms, respectively. Here, ρ 0 denotes the mass density constant of the acoustic fluid, and matrix R denotes the coupled term. In this study, the external forces on the structure are not assumed (F S = 0). The SSL inside the structure determines the input term for the inverse problem. There are three physical limitations to identifying the source inside the structure. First, the input-term estimation problem at resonance is ill-posed because the uniqueness of the solution is not guaranteed. Second, because the sound radiating from the structure is an indirect sound, the phase information of the internal sound source is lost. In the resonance state that causes noise, the resonance characteristics derived from the acoustic space are mixed with those derived from the structure. Third, because noise characteristics are determined by the resonance characteristics of the structure and its acoustic space, the structure cannot be disassembled. Therefore, in the framework of SSL inside the structure, the location or position of the sound source is stochastically estimated from observation data outside the structure, using machine learning techniques and simulations. We propose a new method using machine learning, which is required for SSL inside structures. As shown in Fig. 1, a framework for SSL inside the structure is implemented in the following three steps [5]. a) Data generation by simulation: A coupled acousticstructure analysis is used to generate datasets that consist of data observed outside a structure and the position of a sound source. For example, the finite element method (FEM) is used to generate analytical data, such as acceleration signals on the exterior surface of the structure and acoustic signals around the structure corresponding to the acoustic excitation of the sound source position. b) Training of SSL model.: The analytical data obtained from the coupled acoustic structure analysis are defined as input data for the DNN, and the positions of the sound sources paired with the input data are defined as the labels for the DNN. In other words, a combination (X, T) of the data observed outside the structure (input data X) and the location or position of the sound source (label data T) are treated as a dataset. The matrix X of D-dimensional input vectors x i and the matrix T of K-dimensional label vectors t i are given by where the subscript N indicates the number of samples. In this step, the input-output relationships are learned by the DNN. c) Prediction of the sound source positions: The trained DNN constructed with the simulation data is used in the real world for SSL. Because this method is based on sampling data from a simulation, it applies to objects of various sizes, and the resolution of the SSL can be set as needed. Narrowing the sampling interval of the simulation improves the resolution near the decision boundary for the classification problem and reduces the variance of the RMSE for the regression problem owing to the increase in the number of data points. For example, if a regression problem requires an SSL performance of 20 mm or less, it can be handled by setting the sampling interval for the simulation to 20 mm or less, thereby allowing flexibility in the setup. In the product development design step, the clearance of the components placed inside the product is designed by considering the intersection of each element. The noise source parts can be identified by setting the sampling interval below the clearance in the simulation.
Furthermore, because this method uses multiple sensors, the transfer functions between the sensors as input data for the model enable SSL without depending on the characteristics of the sound source. Specifically, Y 1 = G 1 S 0 holds if the characteristic of the sound source is S 0 , the signal observed by sensor "1" is Y 1 , and the transfer function due to the path from the sound source to the observation point is G 1 . Similarly, Y 2 = G 2 S 0 and Y 3 = G 3 S 0 for the other sensors. If Y 1 is the reference sensor, the ratios of the three observed signals are and these expressions are independent of S 0 . By defining these three ratios as F 1 , F 2 , and F 3 and by using them as training data, SSL can be applied independently of the sound source characteristics. This methodology may be affected by noise from accelerometer observations and is not applied in this study. We plan to treat the noise as an optimization problem for the location and number of sensors. This is because the amplitude of the FSA measured at each observation point is different; therefore, the effect of noise is likely to be different for each point. In view of this, the present study focuses on the domain transfer problem without applying the division method because we consider that the domain transfer can be properly evaluated without canceling the characteristics of the sound sources.

III. PROPOSED METHOD:UNDER SEMI-SUPERVISED CONDITIONS FOR SSL INSIDE STRUCTURE
In general, the generalization performance is significantly lower when training and test data are sampled from populations with different distributions [7], [8], [19]. This is important when using machine-learning models in real-world scenarios. For the same reason, when the model trained in the simulation is applied to a real environment, the generalization performance of the trained model is low because the simulation does not perfectly reproduce the real environment. Therefore, the difference between the simulation and real data significantly decreases the SSL model's performance that is trained during the simulation.
In this study, the DT model is applied to reduce distributional discrepancies under semi-supervised conditions. DT models are incorporated into the framework of the SSL inside the structure. The DT model transfers real data into pseudo-simulation data so that the SSL model constructed in the simulation domain can handle real data. In machine learning, the simulation domain corresponds to the source and the real domain corresponds to the target [20]. Fig. 2 shows a schematic of data transfer and decision boundary adaptation in a situation where the simulation and real domain distributions and their respective decision boundaries are different. In general DA techniques, the direction of the transformation is from the source domain to the target domain. However, our method predicts the position of the sound source using the SSL model built into the simulation domain; therefore, the direction of the transformation is the opposite. In other words, the target data are transferred to the source data. This reverse transformation strategy is being studied further in a field called DA for semantic segmentation [21], which is a recent development.
The contribution of this study is to show a domain transformation method to adapt the SSL model built on simulation to the real environment for the SSL inside structures, which has rarely been studied. Typical "source to target" domain transformation methods have been applied to photographic images and text data that are data-rich in both the source and target domains and have not been applied to SSLs inside structures. Our study requires the SSL in the target domain under the condition that a large amount of simulation data is available but real data is limited. Therefore, it is essentially impossible in this research to use "source to target." Because of this limitation, we use a "target to source" transformation direction. Furthermore, although our previous work [5] could not directly build SSL models with small amounts of real data, this inverse transformation contributes to the leveraging of SSL models trained on large amounts of simulated data.
Our method has the potential to use the numerous discriminative, regressive, and generative models that have been proposed. A flowchart of our method is shown in Fig. 3. The blue box represents the training and transfer phases of the DT model and the green box represent the training and prediction phases of the SSL model. The subscripts S and R denote the simulation and real domains, respectively, and X S and X R denote the input data. In addition, the paired sound source position labels are denoted by T S and T R . The goal is to estimate T R from X R by using the SSL model built in the simulation domain. In most cases, X S does not equal X R ; and so the SSLs f SS and f RR for each domain are different; f SS and f RR are the models built in the simulation and real domains, respectively. Consequently, the SSL in the real domain using the SSL model trained in the simulation domain failed.
Therefore, the DT model h RS is adapted to reduce the distributional discrepancies between the simulation and real domains. The model uses N R(T ) pairs of simulation data X S(T ) and real data X R(T ) as the training data ( Fig. 3(a)). The subscripts S(T ) and R(T ) represent the simulation and real training data, respectively. The DT model transforms the test data X R(V ) from the real data into pseudo-simulation data X RS(V ) (Fig. 3(b)). The subscripts R(V ) and RS(V ) denote the real and pseudosimulated test data, respectively. The SSL model f SS is built using the training dataset (X S(T ) , T S(T ) ) with N S(T ) datasets ( Fig. 3(c)). By providing pseudo-simulation test data X RS(V ) as input data to the trained-SSL model, the sound source positions in the real data are predicted (Fig. 3(d)). The training and test data meet the following criteria.
where N R is the total number of real data and N R(V ) is the real test data.
where N S is the total number of simulation data and N RS(T ) are the pseudo-simulation training data. Pseudo-simulation data are not guaranteed to be transformed according to the decision boundaries of the SSL model. Therefore, the SSL model is trained on both the simulation and pseudo-simulation data to adapt to discriminative boundaries.

IV. DATASETS OF SIMULATION AND ACTUAL EXPERIMENTAL DATA
A situation is assumed in which the acoustic excitation from "a single sound source" within the structure is measured using three accelerometers mounted on the outer surface of the structure. The frequency spectra of the accelerometer (FSA) are used as the observation data. The subject is an acrylic box as shown in Fig. 4. Fig. 4(a) shows the simulation and Fig. 4(b) shows a real domain. The datasets are FSAs observed by three accelerometers on the outer surface of the structure paired with the sound source position labels. These datasets are collected from both domains at the same sensor positions (Fig. 4(c)). The acoustic volume is 400 × 400 × 400 mm 3 and the thickness of the acrylic box is 3 mm.
The simulation conditions are listed in Table I. The simulation data are generated from a coupled acoustic structure analysis using FEM. The FEM solver is a full-harmonic analysis in ANSYS Mechanical [22]. (1) is solved using the FEM solver. The conditions for the position of the sound source are intervals of 50 mm for the simulation and 512 sound source points. The  frequency range is 0.01-1.5 kHz, and the increment range of the data is 10 Hz.
The experimental conditions are shown in Fig. 5. In the actual experiment, one loudspeaker (Visation FRS 7) is placed inside the acrylic box as the sound source. The acoustic excitation of the structure is measured using three acceleration sensors (Ono Sokki Co. Ltd. NP-3211) installed on the outer surface of the structure. The sound waves of the sweep signal are generated by a loudspeaker via a sound card (Fireface UCX) and a loudspeaker amplifier (LP-2024-A +). The bottom of the structure and the loudspeaker are covered with anti-vibration sheets to reduce structure-borne sound. The experimental conditions are listed in Table II  Defining an array of data in this manner results in twodimensional (2-D) data. The input and label data for the DT model are FSAs. The input data for the SSL model are the FSA, and the label data changes depending on the problem [5]. In other words, in the case of the classification problem, the problem is to estimate which of the eight regions of the acoustic space where the sound source is located. In the case of a regression problem, the problem is to predict the X, Y, and Z coordinates.
V. EXPERIMENTAL SETUP USING SSL MODELS SSL performance is tested by feeding pseudo-transformation data into the SSL model. The SSL model conditions are listed in Table III. A DNN is used when the input data are vectors, and a CNN is used when the input data are images. The optimization, preprocessing, and metrics are the same as those of the DT model. In the case of the classification problem, the total acoustic space is divided into eight acoustic sub volumes, and each acoustic space is labeled according to one of the K coding schemes. In the case of the regression problem, the X-, Y-, and Z-coordinates are directly defined as label data. Accuracy (Acc.) shown in (6) is used to evaluate the accuracy of the classification problem, and the RMSE shown in (7) is used to solve the regression problem.

Acc. =
The number of correct answers where N RS(V ) denotes the number of transformed test data points. The label data consider the X-, Y-, and Z-coordinates; hence, the RMSE is expressed by (7). The percentage of the actual experimental data used as the training data for the DT Fig. 6. DT model using the AE. The AE uses vector data as input and output data. Input data is the real data and output is the pseudo-simulation data. model varies from 20 to 80% of the conditions, and the SSL performance is measured in each case. The SSL performance for real data is tested by predicting the SSL model on pseudosimulated data, where the DT model transformed the real data into simulated data.

VI. EXPERIMENTAL SETUP USING DT MODELS
The effectiveness of the proposed method is tested by evaluating the transformation performance of the DT model. In this study, the DT model is selected using two learning approaches: an encoder-decoder model and a generative model. The encoderdecoder model is built as an autoencoder (AE) [23] or a deep convolutional autoencoder (DCAE) [24]. The generative model is built using pix2pix [25] based on conditional generative adversarial nets (cGAN) [26]. cGAN is a model that allows generative adversarial nets (GAN) [27] to use conditional probabilities. The transfer performance of these models is evaluated using root mean square error (RMSE) and t-distributed Stochastic Neighbor Embedding (t-SNE) [28] distributions.

A. Encoder-Decoder Models
The input and output data for AE and DCAE are the vector and image data, respectively, and both DT models convert real data into pseudo-simulation data. Figs. 6 and 7 show an overview of the AE and DCAE. Both models are given FSAs of the real domain as the input and the simulation domain as the label. The difference between these models is whether a fully connected layer or a convolutional layer is used. The fully connected layer executed its task based on the extraction of features by linear summation over the input data and mapping by nonlinear activation functions. Therefore, it is not guaranteed to be equivariant or invariant [29], and it is not robust to either frequency peaks  [30]. In addition, subsampling makes the convolutional layer robust against feature misalignments in an image.
The conditions for the AE and DCAE are listed in Table IV. "F" represents fully connected layers, "C" represents convolutional layers, and "CT" represents convolutional transpose layers. When both the input and label data are vector data, the AE is adopted as the DT model. When both the input and label data are image data, the DCAE is adopted. Batch normalization [31] is applied between the layers of each model. In the DCAE encoder, the 2-D convolution layers are set as (2, 3) kernel size, (1, 1) stride, and had the same padding. The hyperparameters of the convolutional layers transformed data of size (150, 3, 1) into (150, 3, 400) in the first layer. In the DCAE encoder, the 2-D convolution transpose layers are set to (2,3) kernel size, (2, 1) stride, and the same padding. In the last decoder layer, cropping 2D is applied because the reconstruction size differs from the desired size owing to the effect of the input data size. In [32], [33], the frequency-response data are normalized from zero to 2 16 − 1 after a hyperbolic tangent transformation. In this study, because the measured data are between 0 and 1, the only preprocessing step performed is min-max normalization for each dataset. Masking data augmentation is applied to build each model [34], [35].

B. Generative Models
An overview of pix2pix is presented in Fig. 8. Pix2pix is included in cGAN, which is the conditional model of GAN. Its structure comprises a generator composed of U-NET [36] and a discriminator composed of patchGAN. Both the generator and discriminator are convolution-BatchNorm-ReLu methods. The goal of this model, similar to the encoder-decoder model, is to generate pseudo-simulation data that are similar to the simulated data. In pix2pix, two images are paired and trained using adversarial training. Through adversarial learning, a generator can generate fake images that appear as true images. I.e., the generator is responsible for transforming the real data into more realistic simulated fake data. In contrast, the discriminator receives concatenated simulation-true and real-true data or concatenated simulation fake and real data generated by the generator.
The pix2pix generator generates an image G(x, z) from image x and noise z such that the discriminator cannot distinguish between true and fake images. The pix2pix's discriminator takes a pair of images x and conditional y (or G(x, z)) as input and determines whether the image is fake or authentic. The loss of binary classification is given by where "G" represents the generator and "D" represents the discriminator. "G" attempts to maximize this loss while "D" minimizes it in the adversarial training manner. The L1 distance s adopted for the generator loss.
Consequently, the final loss function is.
Note that the loss function of the discriminator uses the binary classification loss. In other words, the loss function is adapted to determine only whether the image generated by the generator is fake or authentic, and not a human-designed loss function such as AE or DCAE. Generally, it is difficult to design a loss function that best represents a dataset. For example, MSE cannot be used to evaluate the resonant frequency deviations, as is clear from its definition.
The conditions for pix2pix are listed in Table V. The pix2pix's discriminator takes the form of a patchGAN. The receptive field of the input data for one pixel of the output is a 6×3 patch. The pix2pix generator is in the form of a U-NET, and its structure is the same as [25]. As with AE and DCAE, min-max normalization is applied for preprocessing, followed by pix2pix specific [- 1 1] normalization. This process is based on the tanh activation function of the generator. In the discriminator and generator, the 2-D convolution layers are set to (2, 3) kernel size, (1, 1) stride, and the same padding. Furthermore, the noise z consist of dropouts. The AE, DCAE, and pix2pix transfer performances are not only evaluated by the RMSE but also visualized by t-SNE.
Note that the paired data is determined based on the X-, Y-, and Z-coordinates. As shown in Tables I and II, the number of sound source positions is 512 in the simulation and 64 in the real environment. The pair data are determined by selecting X-, Y-, and Z-coordinates that are the same as or close to each other.

VII. RESULTS AND DISCUSSION
First, the transformation performance of the DT model is described. AE, DCAE, and pix2pix are selected as DT models, and their transformative performance is visualized in two dimensions with t-SNE and quantified using the RMSE. Then, we describe the SSL performance. Here, we show the relationship between the amount of semi-supervised data and the SSL performance for each of the classification and regression problems. Finally, a comparison of the SSL performance of the proposed method to that of the conventional method and the non-adaptative case (from our previous work) is made. Fig. 9 shows an example of data transformation by AE. The red, blue, and green solid lines represent real, simulated, and transformed data, respectively. Fig. 9(a) and (b) show the training and testing data, respectively. This figure shows that the transformed data are shifted in the resonance frequency and transformed closer to the simulation data. To evaluate the transformation performance of each model quantitatively, the RMSEs values are shown in Fig. 10. Each solid line in the figure represents the RMSE for AE, DCAE, and pix2pix for each of the training and test datasets. This result indicates the conclusion that pix2pix's transformation performance is worse than DCAE's for the training data. Furthermore, the RMSE of the test data is the largest for pix2pix, and the RMSE did not seem to decrease as the number of semi-supervised datasets increased. However, the visualization of t-SNE leads to different conclusions regarding the transformation performance. Fig. 11(a) shows the visualization of t-SNE for the transformation performance of the AE. The red, blue, and green plots represent real, simulated, and transformed data, respectively. The numbers in the plots represent the classes in the classification problem. The number of classes in the classification problem is eight, implying that numbers 0 − 7 are given. The subscripts "T" and "V" represent training and test data, respectively. Clearly, visualization by t-SNE shows that domain matching by AE is not possible. This can be understood from the fact that AE cannot learn the local features of the data, and the transformed data have negative values. Fig. 11(b) shows the t-SNE visualization of the data transformed using DCAE. This distribution shows that the transformation by DCAE enables domain matching in most of the data (most of the plots overlap in the well-matched, and thus, a zoomed-in view is shown in the figure). Furthermore, the transformation by pix2pix appears to match better (Fig. 11(c)). Although evaluation using the RMSE is useful because of its quantitative aspect, the RMSE is not necessarily correct for transformational performance. As the RMSE cannot evaluate the amount of frequency misalignment, it is inappropriate as an indicator for evaluating the characteristics of the FSA. Similarly, the MSE set as the loss function for DCAE is also considered inappropriate. The measurement of the similarity between the two resonance frequencies is summarized in [37]. A method that uses these metrics as loss functions should also be considered.

B. SSL Performance With SSL Model
The performance of the SSL model in terms of classification and regression is shown in Fig. 12. The training data given to the model are all simulation data, and the real data are semisupervised; the validation data are semi-supervised real data, and the test data are the real data for testing. Although the validation data should be unknown, in this case, we use real training data to focus on learning domain-specific data in more detail. We use validation data to check for model overfitting on the simulation data because the simulation data are always larger than the real data. For the classification problem ( Fig. 12(a)), all models are nearly 100% accurate for the training data but differed in accuracy for the validation and test data. The accuracies of CNN-DCAE and CNN-pix2pix for the validation data are 100% consistent. However, the accuracy of DNN-AE concerning the validation data is unstable. The SSL performance for the test data is similar for the CNN-pix2pix and CNN-DCAE. In this dataset, CNN-pix2pix is the highest and DNN-AE is the poorest and below the baseline. The baseline is the probability that a subvolume is selected for a random selection of eight classes. In this case, it is 20%. In the CNN-pix2pix and CNN-DCAE cases, the accuracy improves as the amount of semi-supervised data increases and exceeds the baseline in all conditions. In particular, the performance of pix2pix shows an accuracy of approximately 70% for a semi-supervised data rate of 80%. The accuracy without the DT model is 12% [5]; therefore, the improvement in the accuracy of the DT model is approximately 58%.
In the regression problem ( Fig. 12(b)), the RMSE of DNN-AE for the training data is less than that of the spatial sampling of the sound sources in the simulation domain with high performance, but poor for the validation and test data. This result is similar to that obtained for the classification problem. The RMSEs of CNN-DCAE and CNN-pix2pix for the training and validation data are approximately equal to the spatial sampling of the sound source in the simulation domain. The RMSEs for the test data are almost identical for the CNN-DCAE and CNN-pix2pix. In this dataset, CNN-pix2pix has the lowest value. In addition, these values are close to the spatial sampling of the real domain. The RMSE without the DT model is 142 mm [5], whereas the performance improves to 100 mm when pix2pix is applied. These results indicate that the regression model is still underperforming and could require tuning the structure and hyperparameters of the CNN. However, tuning the model is inefficient, and it is considered more important for the DT model to directly learn the labels of the SSL model. Table VI shows the proposed method, conventional method, and non-adaptation SSL performance. The CC in the table represents the conventional method (cross-correlation method), and nonadaptation is the result shown in our previous paper. Our proposed method shows results for the case with 80% ratio of semi-supervised data. The adaptation models are effective in both tasks: CNN-Pix2pix for the classification problem and CNN-DCAE for the regression problem. In our previous work, an SSL experiment using small amounts of data without data augmentation in the real-world domain failed to learn the SSL model, resulting in poor performance. However, the use of the DT model improves the SSL performance without applying data augmentation to real-environment data.

VIII. CONCLUSION
The proposed method of transforming real data into pseudosimulation data using the DT model improves the performance of the SSL inside the structure. The 2-D distribution of t-SNE indicates that DCAE and pix2pix exhibit better transfer performance than AE for the FSA of the exterior of the structure. Both models incorporate a convolution-based layer, which is superior to a fully connected layer because it enables the learning of local features. This can be understood from the stability of the learning curves for both models. The SSL classification problem is less accurate when using the DT model with AE and more accurate when using DCAE or pix2pix than the baseline. In the test on the dataset used in this study, the overall accuracy is higher when pix2pix is used than when the DCAE is used. This seems to be because pix2pix utilizes binary cross-entropy as the loss function, whereas DCAE sets it as MSE. MSE cannot evaluate frequency response deviations, whereas binary cross-entropy is a simple metric that discriminates between true and real values. In particular, the performance of pix2pix shows an accuracy of approximately 70% for a semi-supervised data rate of 80%. The accuracy without the DT model is 12%; therefore, the improvement in the accuracy with the DT model is approximately 58%. Similarly, with the regression problem, the RMSE is lower when DCAE or pix2pix is used than when the AE is used. The RMSE without the DT model is 142 mm, whereas the performance improves to 100 mm when the DT model is applied. However, the RMSE of the SSL model using both transformations for the training data is approximately equal to the sound source spatial sampling of the simulation domain and requires further improvement. This indicates that the CNN structure and hyperparameters must be tuned.
Going forward, we aim to build a model that combines SSL and DT models. The proposed method separates the SSL model from the DT model, and the DT model cannot directly learn the discriminative bounds to solve the SSL problem. In other words, the DT model does not directly learn the domain transformations that are important for the SSL. Therefore, we aim to construct an SSL method that utilizes the discriminator of the GAN. Furthermore, we plan to develop a method for SSL under unsupervised conditions. APPENDIX STABILITY OF LEARNING CURVE FOR EACH MODEL Fig. 13 shows the learning curve of AE for each semisupervised system. Fig. 13(a) and (b) show the learning curves up to 100 and 500 epochs, respectively. This figure shows that the loss to the training data tends to converge faster as the number of semi-supervisors increases. However, the loss of test data oscillates significantly. After 500 epochs, there is no convergence, and the amplitude of oscillations is larger. This indicates that the transformation by AE fails to learn the features. In contrast, the learning curve of DCAE is stable without loss of oscillation, unlike AE (Fig. 14). Similar to the AE, the number of epochs leading to convergence tends to decrease as the number of training data increases. These results demonstrate that DCAE using convolutional layers is effective for domain transformations. Fig. 15 shows the respective loss progress in adversarial training of the generator and discriminator. The solid blue line is the loss to the generator, and "G" represents the generator. The solid red line represents the loss associated with the discriminator, and its value is expressed as an average of the true and fake data. "D" represents the discriminator. During training for up to 500 epochs, both losses oscillate and converge. After 500 epochs, the losses of both models are close to each other, indicating that these models are in equilibrium ( Fig. 15(a)). The loss of the test data is lower in the discriminator than in the generator, and the convergence is approximately 600 epochs. No mode collapse is observed in the output image of the generator.  Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.