DNN-Based Indoor Localization Under Limited Dataset Using GANs and Semi-Supervised Learning

Indoor localization techniques based on supervised learning deliver great performance accuracy while maintaining low online complexity. However, such systems require massive amounts of data for offline training, which necessitates costly measurements. The essence of this paper is twofold with the purpose of providing solutions to missing data of different nature: available unlabeled data and missing unlabeled data. In both cases, we rely on a few labeled available data, which is costly yet insufficient to achieve a high localization accuracy. To address the problem of available unlabeled data, a weighted semi-supervised DNN-based indoor localization approach leveraging pseudo-labeling methods in combination with real labeled samples and inexpensive pseudo-labeled samples is proposed in order to boost localization accuracy, while overcoming the high cost of collecting additional labeled data. As for the extreme case of unavailable unlabeled data, we propose an alternative localization system generating fake fingerprints based on generative adversarial networks (GANs) named ’Weighted GAN based indoor localization’. Furthermore, a deep neural network is trained on a mixed dataset containing both real collected and fake produced data samples using a similar weighting technique in order to improve location prediction performance and avoids overfitting. In terms of localization accuracy, our proposed localization approaches outperform conventional supervised localization schemes utilizing the same collection of real labeled samples. We have tested our proposed methods on both simulated data and experimental data from the publicly available UJIIndoorLoc database, which is built to test indoor positioning systems relying on Wi-Fi fingerprints. Results based on experimental data provide the localization accuracy increase compared to the classical supervised learning method using the same set of labeled collected data when using the weighted semi-supervised and the weighted-GAN approaches by $10.11~\%$ and $8.53~\%$ , respectively.


I. INTRODUCTION
Several localization based applications have been proposed in mobile communication systems [1], [2] such as enhanced emergency localization, personal navigation, and social networking. Different localization methods have been developed for 5G and, more specifically, for internet of things (IoT) applications [3], where it is imperative for users to receive The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . autonomous and accurate navigation services in challenging surroundings. In the near future, larger bandwidths and higher frequencies will be provided by beyond 5G systems, offering enhanced opportunities for achieving accurate location information [4], [5]. Traditional indoor localization systems [6] are mainly based on geometric and fingerprinting-based methods. Using such methods, the localization accuracy is heavily affected by geometric constraints introduced by multipath propagation and can be a high consuming task in terms of time and energy. As an alternative to traditional schemes, machine learning (ML) based-techniques shift the online complexity to an offline phase [7]- [10]. The ML-based localization model trained offline is stored and used online to predict the location information with low computational complexity.
In particular, deep learning (DL) methods, already provide a variety of advanced localization systems with high accuracy [11], [12]. However, they are data hungry requiring large labeled training databases. To overcome this problem, recent approaches have been developed based on semisupervised learning, which leverages a small number of expensive labeled data combined with a large amount of inexpensive unlabeled data to ensure an improvement and a refinement of a totally supervised solution without an additional expensive labeled data collection cost [13], [14]. Pseudo-labels are predicted and associated to unlabeled data in order to provide additional training information and enlarge the training dataset. However, unlabeled data may not be always available. For this scenario, where only a small amount of labeled data is available, data augmentation can be used to extend the training database. Accordingly, new fake data are generated to supplement real collected data in order to enhance the model training and boost the localization accuracy while reducing measurement time and human effort.

A. RELATED WORK
A generic class of fingerprinting methods [15], [16] rely on the spatial-temporal characteristics of the various fingerprints, thus exploiting delay and angular profiles of multipaths at given positions. Although such an approach maximizes discrimination at different positions, they require proper antenna array calibration, as well as accurate timing synchronization, in order to claim a healthy database for online use. The work in [17], [18] propose RSSI fingerprints as data to train an ML model. Even though the work relies on RSSI fingerprints, the work done is fully-supervised (ex. [17] uses kNN and Random Forests). In addition, the main essence of [17] and [18] is the extraction of important features from RSSI data via Principle Component Analysis and Kernel Direct Discriminant Analysis, respectively. Furthermore, the work in [19] focuses on localization on a room-level instead of an accurate coordinate level, where 6 classifiers were used to predict the user in 4 different rooms. Cited methods did not address the problem of insufficient data available for localization. The work in [20] relies on a matrix completion method to construct complete training maps via available RSSI fingerprints. Although [20] fills missing data, the solution lacks the capability of leveraging unlabeled data that are easily available at our disposal. Even more, during runtime, the method in [20] should solve a convex optimization problem, which is deemed heavy for online applications.
As already mentioned, semi-supervised learning has been used to overcome the problem of the large amount of training data needed in the classical supervised learning. To adapt the semi-supervised context to the localization field, manifold-based models, such as manifold learning [21] and manifold alignment [22], are combined with graph-based methods [23]. Classical semi-supervised methods use a supervised model trained on a small amount of labeled data in order to predict unknown labels referred to as pseudo-labels. The resulting pseudo-labeled samples are used subsequently to enlarge the labeled data set, i.e., providing additional training information, to build a more general model. Pseudolabels can be determined based on solving optimization problems as described in [24]- [27] or applying a supervised DL model on the labeled data as described in [28]. However, pseudo-labeled data can penalize the performance if they are not introduced to the training in an appropriate way. Also, the predicted labels may be noisy or may not really reflect the ground truth. Therefore, it is desirable to limit the reliance on pseudo-labeled data. Such principle has been applied to the classification problem of handwritten digit recognition in [29]. The unlabeled data collection task is less expensive than labeled data collection, but it is not always available. In such a case, data augmentation can be used to extend the training database.
Generative models have shown a good ability to generate additional realistic samples. In particular, generative adversarial networks (GANs) [30] aim at expanding and improving the diversity of a training database in different research fields including indoor localization. In [31], GANs use both labeled and unlabeled data when the former is insufficient in order to share weights with a localization classifier to benefit from useful information contained in the latter. In [32], [33], GANs are used to improve the diversity of the collected database generating fake received signal strength indicators (RSSIs) at known positions already used for data collection. Regarding the difficulty in collecting signal measurements under indoor space constraints, the authors in [34] propose to generate artificial data for the constrained space based on measured data collected in the free space. Thus, GANs aim to enhance the richness of a collected database to cover some regions where data collection is strenuous.

B. CONTRIBUTIONS
In this paper, we propose two solutions in order to deal with the problem of collected labeled data insufficiency to optimally train a localization model. The first solution is based on weighted semi-supervised learning combining labeled and unlabeled data, and the second one explores data augmentation using GANs based on the collected labeled data only. In Table 1, we mention existing works that deal with the problem of overcoming limited data used for indoor localization compared to our proposed schemes. Our contributions are summarised as follows: • We propose a localization method called 'weighted semi-supervised' that combines labeled and unlabeled collected data in order to boost the localization performance. First, deep neural network (DNN)-based pseudo-labeling is used to generate pseudo-labels for unlabeled data using a supervised scenario with labeled VOLUME 10, 2022 training data. Then, real labeled and pseudo-labeled data are mixed together to train a generic model used for localization according to a coefficient weight in order to up-weight the most confident samples to increase their impact on the training of the generic model.
• We propose a localization system 'W-GAN' generating fake supplementary data based on labeled data only. Unlike previous works, we do not assume the use of unlabeled data to enhance the training of the model as in [31], [35], and we do not assume having sufficient collected data in some regions as in [34]. Moreover, we do not use GANs to further diversify signal measurements at known positions, as considered in [32], [33]. Instead, our approach generates RSSI measurements and its corresponding new positions to cover new areas. Furthermore, we do not use GANs to generate fake RSSI vectors to be pseudo-labeled later as conducted in our previous work [36]. Such labeling process increases the computational complexity of the whole system, and the error on pseudo-labels prediction can lead to localization accuracy loss. In this paper, a GAN is used to produce artificial labeled RSSI vectors, i.e. both RSSI vectors and their corresponding coordinates. Then, the real collected and fake generated data samples are mixed to train a localization-based DNN employing coefficient weights to limit the impact of the least confident data samples, which are the GANgenerated samples, especially in the early stages of the training process. The rest of this paper is organized as follows: The studied problem is defined and described in Section II. The proposed weighted semi-supervised based localization system is detailed in Section III. The weighted GAN-based localization system is then provided in Section IV. The obtained results are presented in Section V and Section VI based on both simulated and experimental data from the UJIIndoorLoc database [37]. We discuss obtained performances in Section VII. Finally, conclusions are drawn in Section VIII.

II. PROBLEM DESCRIPTION
In this section, we briefly detail the classical indoor localization system based on RSSI fingerprints. This system consists of two main parts: a central unit (CU) connected to mobile users (MUs) through a network. The mobile users are equipped with different mobile devices to ensure the diversity of the database, given the heterogeneity of devices that can cause signal diversity. They perform a site survey task to collect RSSI data from different access points (APs) in the indoor environment. Then, RSSI fingerprints, that are composed of collected data associated with the coordinates of the corresponding mobile user, are transferred to the CU for the RSSI-fingerprint database construction and storage. It also determines the localization of a user node (UN) given a received RSSI vector, and sends back the estimated coordinates to the user. The localization can be performed by solving a linear equation or an optimization problem [38], or using DL techniques as considered in this work.
A classical supervised DNN system is composed of an offline phase and an online phase. In the offline phase, a trained model is constructed and validated based on an exhaustive set of data. In our case, the measured data are divided into training data and validation data to train and validate the DNN model. The collected RSSI vectors present the inputs of our DNN network, which takes as outputs the associated users location information (a room ID, a floor ID, a zone identifier, 2-D / 3-D coordinates, etc.) for training. Thus, the collected data is used by the CU for localization without any required pre-processing task. Once the DNN model is trained and well optimized, the system is able to localize a given UN based on the collected data taking as input an RSSI vector and given the estimated coordinates as outputs. In order to achieve a good localization accuracy, a large amount of labeled data samples is required to construct an efficiently trained localization model. However, the acquisition of labeled RSSI vectors is a time and cost consuming repetitive task. To solve this issue, a weighted semi-supervised indoor localization framework is first proposed in this work, which involves location estimation based on labeled and unlabeled data. This system treats mixed data to reduce the reliance on labeled data often costly to collect, unlike unlabeled data.
It is true that the acquisition cost of unlabeled data is cheaper than that of labeled ones, but the data collection cost can still be expensive and unlabeled data may not always be available. To address this issue, data augmentation based on GANs is proposed as a second approach in this work, in order to generate fake data which compliment real labeled collected data. In our previous work [36], a system combining selective GANs and semi-supervised learning is proposed to perform location prediction based on real collected data and fake selected-generated pseudo-labeled data. This system generates RSSI vectors to be pseudo-labeled from which we select the most realistic-fake pseudo-labeled positions. In this paper, the second proposed localization system is based on weighted-augmented process with GANs. It takes advantage of generating both RSSI vectors and their corresponding coordinates to be mixed with real collected measurements for localization. During this combination, a coefficient weight is associated to each measurement in order to reduce the reliance on the least realistic samples and increase the effect of the most realistic ones. This procedure is simpler and less complex than our previously proposed algorithm [36], as depicted in Figure 1, by preventing errors that occur due to the pseudo-labeling process.
In this paper, we consider M APs placed in an indoor environment and mobile sensor nodes placed at known training positions. These nodes collect T RSSI measurements with respect to each reachable AP to alleviate time-varying RSSI fluctuations. Collected fingerprints, composed of RSSI vectors associated with the corresponding coordinates, are sent to a CU to be stored and used for localization. A DNN model, trained on the training database, is applied online in order to predict the user coordinates.

III. PROPOSED WEIGHTED SEMI-SUPERVISED DNN-BASED LOCALIZATION
As mentioned above, we propose in this paper two localization systems to address the insufficiency of collected labeled data needed to optimally train a localization model. In this section, we describe the first proposed system which exploits labeled and unlabeled collected data samples as depicted in Figure 2. The classical semi-supervised indoor localization [28] is first presented, then, a detailed description of the proposed weighted semi-supervised system is provided.

A. CLASSICAL SEMI-SUPERVISED INDOOR LOCALIZATION SYSTEM
In this part, we introduce semi-supervised learning to deal with collected unlabeled RSSI vectors so as to improve the localization accuracy.

1) DATA COLLECTION
RSSI measurements are collected during the offline phase. Collected RSSI vectors are considered as labeled fingerprints when associated to the corresponding location identifier. In this paper, we consider the user coordinates as location identifier called label. To reduce the effort of data acquisition and environment labeling, a massive collection of unlabeled data can be performed by the acquisition of RSSI measurements when mobiles are moving in the covered indoor environment. In fact, they collect labeled data when placed at known positions. And, when moving from one position to another, they still collect unlabeled data. Thus, we consider an MU collecting RSSI data received from M APs at reference positions known by their labels and collecting data when moving to construct unlabeled database.
Let P t = [p t 1 , . . . , p t m , . . . , p t M ] T be the RSSI vector received by an MU at position t, where p t m is the RSSI measurement received from the m th AP with m ∈ {1, 2, . . . , M }. The coordinates or labels associated to the vector P t are C t r = [x t r , y t r ] T when collecting real labeled data, whereas no label is assigned to the RSSIs collected on-the-fly from massive measurements. For system evaluation purposes, we consider

2) PSEUDO-LABELING FOR INDOOR LOCALIZATION
Localization based on semi-supervised learning aims to address the costs and complexity of labeled data measurements as well as to avoid the over-fitting problem for DL based localization, which can be caused by a limited number of labeled training data. The pseudo-labeling technique is a semi-supervised learning method widely used in the fields of text and image classification and recognition. Such techniques exploit both labeled and unlabeled data to improve supervised learning performance. The pseudo-labeling steps are summarized as follows: • Step 1: Train the model on labeled RSSI fingerprints only, in a supervised way during E 1 epochs. For each input vector P t ∈ R M ×1 , the associated output vector is C t r ∈ R 2×1 . • Step 2: Using the trained model, predicted labels i.e.
• Step 3: A general model is then trained mixing labeled and pseudo-labeled vectors. This model is different from the model trained in step 1 and used in step 2.
To train such a general model, we use as inputs the entire collected measurements and the outputs are the labels (real labels and predicted pseudo-labels). During the training process, mixing both labeled and pseudo-labeled data is crucial in order to achieve a good localization accuracy. The pseudo-labeling technique is simple and easy in terms of implementation with promising results. However, it may result in gradual drifts and poorly perform if the prediction accuracy of pseudo-labels is low. In fact, we cannot give the same level of trust to pseudo-labeled data, as we do with labeled ones. Therefore, limiting the reliance on pseudo-labeled data may be efficient. Thus, a process which automatically weights labels in order to down-weight less promising ones and high-weight the more promising ones is proposed.

B. PROPOSED WEIGHTED SEMI-SUPERVISED SCHEMES
To limit the reliance on artificial pseudo-labels, two types of weights have been used to train a mixed model: a static weight and a variable weight.

1) STATIC WEIGHT
We assume that all pseudo-labeled data have the same fixed weight during the whole procedure of model fitting. This weight should not exceed the labeled weight since real labels are more confident. Thus, we propose to update the losses calculated at different training steps using refined weights. For epoch e, the loss function L sw (e) which is a variant of a refined root mean square error (RMSE) [39] can be expressed as follows: where C t r , C t p ,Ĉ t r andĈ t p are the available and pseudo coordinates: real and estimated, respectively. Note that T r is the number of labeled samples and T p is the number of unlabeled samples. The total number of available samples is T 1 = T r + T p . Moreover, ω r (e) and ω p (e) stand for static weights associated with the labeled and unlabeled samples, respectively. The weights are adjusted experimentally through an exhaustive experimental process.

2) VARIABLE WEIGHT
Here, the used loss function L vw has the same form as L sw described in (1). However, we consider that the labeled weight ω r (e), is normalized to 1 and we use a variable weight ω p (e) associated with the pseudo-labels for an epoch e.
A proper calibration of ω p (e) is required to benefit from unlabeled data without disturbing the training for labeled data 69900 VOLUME 10, 2022 to ensure a reliable training performance. Furthermore, ω p (e), which is related to the epoch e, is slowly increased passing from 0 to a final value ω. It is described as follows: where E 2 refers to a specific number of epochs. ω and E 2 are hyperparameters to be tuned experimentally. Moreover, we introduce the pseudo-labels in the training and we gradually increase their weights through the epochs. The value of E 2 can be greater than the total number of training epochs, provided that w p (e) does not exceed w r (e) which is equal to 1.

IV. PROPOSED WEIGHTED GAN BASED INDOOR LOCALIZATION
To enhance the richness of collected labeled training data, GANs are used to generate fake RSSI vectors and its corresponding labels (i.e. 2D coordinates) due to the fact that collecting a large amount of real location samples is costly.
Thus, based on a small amount of real labeled data samples, the size and the diversity of the training dataset are increased by generating supplementary fake samples. Real collected labeled data and fake generated data are mixed in order to train a DNN model used for localization. However, generated data can penalize the accuracy if it is not properly considered. Thus, we propose to limit the dependency on generated data by associating a coefficient weight to fake samples during the training phase, as discussed in the next section.

A. INTRODUCTION TO DNN ARCHITECTURE
Let i (0) ∈ R N 0 ×1 be the input vector to the DNN model and o ∈ R N H +1 ×1 its associated output [40]. Let H be the number of hidden layers where 0 ≤ h ≤ H + 1 and N h is the number of neurons in the h th layer. b h ∈ R N h ×1 and W h ∈ R N h ×N h−1 denote the biases and the weights matrices, respectively. The output vector of the h th layer can be expressed as: where the vector i (h) undergoes a linear transformation represented by W (h) , a bias vector b (h) , and then a nonlinear activation function g (h) is applied. During DNN training, the loss function is calculated in order to iteratively update its parameters θ = (W , b).

B. TRAINING GANs FOR DATA AUGMENTATION
GANs have achieved promising performance across a multitude of fields. In this paper, GANs are used for data augmentation to increase the training dataset size and diversity. Such models are composed of two DNNs: the generator G and the discriminator D [41], [42]. The generator model G learns how to produce a realistic representation similar to the real data, and the discriminator model D learns how to distinguish between fake and real samples, as shown in Fig. 3. These DNN models are trained together until G is able to generate fake samples that can be seen as real by D. Let T f be the number of generated fake samples, and let Then, G(z (i) ) is passed to the discriminator which predicts the reliability of G(z (i) ). The generator loss is calculated by: where D(G(z (i) )) refers to the probability of an example being fake. Once the loss function is minimized, the generator parameters θ g = (W g , b g ) are saved. During the training, the discriminator computes the probability D(G(z (i) )), following (3), giving both fake generated samples G(Z ), and the real dataset F ∈ R (M +2)×T r . Each real vector is denoted as r ) are the corresponding 2D coordinates. This probability is given to the generator to improve its performance as expressed in (4). The training of the discriminator is performed by minimizing the loss function between real and fake data: where θ d = (W d , b d ) are the parameters of the discriminator model. Maximizing log(D(f (i) )) refers to the fact that the discriminator is correctly classifying the real examples while maximizing log 1 − D G(z (i) ) would help the discriminator to correctly classify the fake samples that come from the generator.

C. TRAINING A MIXED WEIGHTED DNN FOR LOCALIZATION
Localization is conducted applying a mixed DNN model combining collected and generated data samples. The losses are updated at different training steps using refined variable coefficient weights to limit the reliance on fake generated positions. For epoch e, the loss function L vw (e) used for model fitting takes the same format as (1) and can be expressed as follows: where C t r , C t f are the collected and fake coordinates respectively, whileĈ t r andĈ t f refers to the predicted ones. Note that T r is the number of real collected samples and T f is the number of generated fake samples. The total number of available samples is T 2 = T r + T f . Moreover, ω r (e) and ω f (e) stand for variable weights associated with the collected and generated samples, respectively. We consider that the real weight ω r , is normalized to 1 for all epochs, while the weighting function ω f (e) is a piece-wise linear function of epoch e, and takes values from 0 to ω as expressed in (2).

V. SIMULATION RESULTS
In this section, we provide an evaluation of the two proposed localization schemes by specifying a common simulation environment as well as different used DNN architectures and adjusted parameters followed by the corresponding localization accuracy.

A. ENVIRONMENTAL SETUP
We consider a noisy indoor environment covering 400 m 2 with an existing WiFi infrastructure and M = 10 APs. We use a propagation model with realistic parameters based on real measurements conducted in an indoor environment. In this model, we consider the degradation of signals, combining path loss and shadowing effects, and the signals blockage. Let p t m be the RSSI measured by an MU at position t of the signal transmitted by the m th AP. It can be expressed as: where p e is the transmitted power, which is considered constant. B σ mt is a Gaussian random variable representing the shadowing effects, and p L mt is the path loss calculated as follows: where p L 0 denotes the pathloss value at a reference distance d 0 , f is the frequency, µ is the pathloss exponent and   For each scenario (i.e. S 1 , S 2 and S 3 ), different DNN architectures and parameters have been tested to find a good model for both classical and weighted semi-supervised learning methods. We achieved a good performance, using a simple DNN architecture with two hidden layers having 30 neurons and 15 neurons, respectively. Also, different parameters have been tested to select the appropriate static weight, we have tried ω r ∈ {1, 2, 3, 4}, and ω p has been varied from 0.1 to 1. Different optimal parameters are summarized in Table 2. where E is the number of epochs and B is the mini batch size considered for the DNN training.
For the weighted method using variable weights, we have conducted intensive experiments to select the hyperparameters. For example, we set ω = 3 and E 2 = 550 for 1000 labeled and 3000 unlabeled samples as mentioned in Table 2. As depicted in Figure 4, these variable weights increase gradually from 0 to 0.81, considering 150 epochs for system training. Thus, labeled data is combined with unlabeled data weighted related to the epoch until reaching 0.81. By exceeding this variable weight, the localization accuracy starts to decrease. Thus, it is defined as the maximum value to reach.

2) LOCALIZATION ACCURACY
To evaluate our proposed weighted semi-supervised methods (denoted as fixed-weight semi-supervised method and variable-weight semi-supervised method), we compare it with the state of the art supervised (denoted as supervised method) and classical semi-supervised DNN methods (denoted as classical semi-supervised method). Thus, we compare the mean error in the user coordinates estimation, of the following methods: • Supervised based localization method considering only labeled data.
• Classical semi-supervised using a pseudo-labeling process to determine pseudo-labels and combine them with real labels to construct a generic localization method.
• Fixed-weight semi-supervised using a static weight.
• Variable-weight semi-supervised by integrating a variable weighting process to the classical method. Table 3 and Figure 5 present the localization performance obtained by the cited methods considering 1000 labeled samples and 5000 unlabeled samples. We notice that the use of 1000 labeled data in a supervised way gives the worst results. The combination of 1000 labeled data with 5000 unlabeled data improves the localization accuracy by 37 cm minimizing the localization error by 26.42%. Adding a weight coefficient to the classical semi-supervised method is always beneficial, improving the localization accuracy by 3.88% and 12.62% for static and variable weight, respectively. Thus, our two proposed methods improve the localization performance of the state of the art methods. We mention that the accuracy 0.9 m obtained when using the variable weight for 1000 labeled data and 5000 unlabeled data requires 2000 labeled data samples using supervised model as mentioned in Table 4. This shows that we can reduce the cost of collecting labeled data by half while achieving the same accuracy. From Table 4, we can observe that the use of unlabeled data boosts the localization accuracy and minimizes the cost involved in collecting labeled data. In particular, mixing 3000 labeled data samples and 3000 unlabeled data samples in variable weighted semi-supervised framework provides the same localization accuracy, which can require 6000 labeled data in a classical supervised system.

C. SIMULATION RESULTS FOR THE WEIGHTED GAN
In this part, we consider a test database (T t ,M t ) containing T t test RSSIs vectors taken at M t test positions and a training database (T r ,M r ) where M r training real positions have been used to collect T r RSSI vectors for training. In Table 5, we present the number of different types of simulation data.

1) DNN ARCHITECTURES AND PARAMETERS USED FOR DATA AUGMENTATION AND LOCALIZATION
GANs are used to generate T f fake RSSI data samples along with their M f fake coordinates. In Figure 6, we consider T f = 2000 and T r = 1000. The GANs introduced in this VOLUME 10, 2022   part are based on a DNN model optimized with ADAM using 0.01 as the learning rate during 200 epochs. The activation function used by G is the rectified linear unit (ReLU) [44] used in one hidden layer having 10 neurons. The activation function of D is the ReLU function, while the last layer uses the sigmoid function. A one-hidden layer discriminator with 10 neurons is used. For localization, the DNN models are trained on real labeled data samples and weighted fake  labeled data samples. These models take the RSSI vectors as inputs and give the corresponding coordinates as outputs. We use ADAM optimization algorithm and a learning rate equal to 0.01. The DNN architectures and parameters are summarized in Table 6 where N i (·) refers to the number of neurons in the i th hidden layer. Note that our model converges before reaching E 2 with a final weight value between 0.6 and 0.96.

2) LOCALIZATION ACCURACY
In this section, we compare the following algorithms for different parameters values: • Supervised(T r ,M r ) is a supervised localization method based on T r real data collected at M r different positions.
• W-GAN(T r ,M r ,T f ) is the localization method, where we combine T r real samples collected at M r different positions, and T f fake weighted samples. In Table 6, we present the localization accuracy for W-GAN(T r ,M r ,T f ) trained over T r = 1000 real data samples in addition to different numbers of fake data samples vs. supervised learning model trained only on T r = 1000 real samples. We generate different numbers of fake data samples [100 − 5000]. We notice that for all augmented training datasets, the localization accuracy is improved compared to a dataset limited only to real data. The localization improvement varies between 17.92% and 23.58%. The best localization accuracy is obtained with T f = 2000 weighted generated fake data samples, where we achieve 23.58% localization accuracy increase vs. the conventional supervised algorithm without any additional cost in collecting additional data. This improvement is explained by the consideration that the DNN has been trained over a larger dataset which contains new positions that are not included in the limited dataset constructed from only measured data. Starting from 3000 weighted generated samples, the performance saturates and no improvement can be achieved by generating additional fake samples. This can be explained by the fact that based on 1000 real vectors collected at 100 positions, we cannot provide a higher measurement diversity to the GAN.
We can easily notice that the supervised indoor localization system based on 1000 real samples collected at 100 known positions corresponds to the worst localization accuracy. For fair comparison, we use the same dataset of real labeled positions to which we add (i) 1000 real measurements collected at 1000 different real positions placed randomly in the considered area i.e. Supervised(2000,1100), (ii) 2000 real measurements collected at 200 different positions placed randomly in the considered area i.e. Supervised(3000,2100) and (iii) based on these data samples, we generate 2000 fake positions i.e. W -GAN(1000,100,2000). We consider that 1000 real measurements collected at 100 positions construct the initial dataset. Instead of collecting 1000 extra real measurements at 1000 positions i.e. Supervised(2000,1100), we can achieve the same performance 0.81 by artificially generating 2000 fake data samples i.e. W -GAN(1000,100,2000) based on the initial real data samples. Thus, the proposed data generation process provides an improvement of localization accuracy without additional data collection cost. We notice that if we assume having 3000 real data samples collected uniformly at 300 positions, we can only improve the proposed localization scheme by 4 cm while the required number of collected data is multiplied by 3.
To get more insight into the presented results, we show in Figure 7a and Figure 7b the training and validation accuracy of the DNN model for Supervised(1000,100) and W -GAN(1000,100,2000). As we can see, relying on a small set of data leads to overfitting i.e. 95% training accuracy vs. 78% validation accuracy, while this issue is eliminated when using additional fake data i.e. 98% training accuracy vs. 94% validation accuracy. Figure 7b shows that starting from E = 100, which is the epoch where we start progressively introducing the fake data, the training accuracy gets better which means that the model is able to learn better, while the validation accuracy improves in a rapid way, which means that the model gets more generalized and does not overfit anymore.

VI. PERFORMANCE EVALUATION OF THE PROPOSED SYSTEMS USING EXPERIMENTAL DATA
In order to support the simulation results, we evaluate the performance of the proposed systems based on experimental data from the UJIIndoorLoc database corresponding to Building1-Floor2 collected through measurements. We have 1395 training fingerprints taken at 80 training positions and 40 validation positions, collected four months later than training ones, taken at different validation positions received from 520 deployed APs.

A. UJIIndoorLoc DATABASE DESCRIPTION
UJIIndoorLoc is a publicly available WiFi fingerprinting database. It was created at the University of Jaume I, Spain in 2013. It contains three multifloor buildings (4/5 floors per building) covering 110 000 m2, and is composed of 19937 training fingerprints and 1111 validation fingerprints.  Each fingerprint contains 520 RSSI values corresponding to each AP received at a reference position given by its longitude and latitude. In this paper, we ignore the floor/building related information due to the fact that we only estimate the location (latitude and longitude), regardless the building and floor.

B. OBTAINED LOCALIZATION PERFORMANCE
Limited by the number of collected measurements corresponding to Building1-Floor2, we consider T r = 500 labeled data samples, T p = 500 unlabeled data samples and T t = 435 data samples for the test. We randomly select labeled, unlabeled and test samples from the whole database. The presented results correspond to the average of several random draws, where the number of labeled positions and the number of test positions change. Consequently, in this part, we did not mention the number of labeled positions M r and unlabeled positions M p . We note that the number of fingerprints is not the same as the number of positions, since at one position, users take one or more fingerprint measurements. At first, we only keep APs detected at least once, which is equal to 190 from 520 deployed to construct the database. Then, after data pre-processing, we are able to apply our algorithms.
• Weighted semi-supervised localization system: In Table 7, we mention the parameters for each used DNN when applying the weighted semi-supervised localization system. The DNNs used in this part are trained during 200/250 epochs with 50/100 as mini-batch size and 0.01 as learning rate. N i (.) refers to the number of neurons in the i th hidden layer. When considering the semi-supervised algorithm based on a fixed weight, we use w r = 1 and w p = 0.3. When using a dynamic weight, we have w = 3 and E 2 = 570.  is trained during 200 epochs using 0.01 as learning rate. The DNNs used for localization are composed of three hidden layers with 200 neurons, 150 neurons and 100 neurons, respectively. For weighting process, we consider ω = 3 and E 2 = 500.
DNN architectures and different parameters are chosen based on an exhaustive empirical process to determine the best achievable ones. We compare the localization accuracy of a supervised algorithm and three variants of semi-supervised algorithms (without weight, with a fixed weight and with a variable weight) using 500 labeled data samples and 500 unlabeled data samples and GAN-weighted localization systems using 500 labeled data and generating T f = 2000. We note that we generated different T f based on 500 labeled data from 250 to 4000 and the best obtained parameters correspond to T f = 2000. Table 8 gives the mean localization error and the localization improvement compared with the supervised using the same set of data samples. We notice that the obtained results confirm the simulation results. Thus, the worst localization accuracy is obtained by the supervised algorithm and the use of weight enhances the performance of the classical semi-supervised algorithm by 3.44% and 8.04% using a fixed and a variable weight, respectively, which justifies the combination of labeled data samples and weighted pseudo-labeled data samples. If we consider that having only labeled data, we note that the localization accuracy improvement is not promising compared to variableweight semi-supervised since generated data are not good enough due to the small amount of labeled data.
To generate more realistic data samples, we choose randomly 1000 fingerprints from the training set. We test the localization accuracy combining real and generated data i.e. W-GAN(T r ,T f ). Table 9 presents the localization error, which shows that the error decreases using W-GAN compared to the conventional supervised system using only labeled data i.e. Supervised(T r = 1000). We generate different supplement fake positions [500−2000], where we notice that the best localization accuracy is obtained with 1000 generated positions achieving 17.31% localization improvement vs. the conventional supervised algorithm. Additionally, the performance is saturated, and thus, we cannot provide better accuracy from 1500 generated positions. As a result, even when working in a realistic environment with high dynamic and heterogeneous devices, our proposed system achieves good localization accuracy. We mention in Table 9 the performance of our previous work [36] corresponding to the algorithm Selective SS-GAN(1000,1000) using 1000 collected real data and 1000 selected-generated data. Such algorithm is based on (i) fake data generation, (ii) fake data pseudo-labeling and (iii) fake data selection. We can notice that we succeed to minimize the computational complexity without sacrificing the localization accuracy. We are even able to achieve almost 2% of localization accuracy improvement due to overcoming the pseudo-labels estimation error.

VII. SYSTEM PERFORMANCE DISCUSSION
As proved above, based on simulations and real data, the proposed methods both succeed in improving the localization accuracy compared to the supervised and classical-semi supervised learning. It is true that integrating such a data augmentation process increases the calculation complexity, but, this concerns only the offline phase when training and preparing our localization system. Consequently, once the localization model is trained, it is applied directly without any increase in online complexity.
Fixed-weight and variable-weight semi-supervised system: such system guarantees a localization accuracy improvement compared to the conventional semi-supervised and supervised schemes, especially when using a variable weight introducing gradually the pseudo-labels in the localization system training which allows to not disturb the training process and to ensure an optimized model fitting. The limitation of such method is that unlabeled data is not always available. Consequently, its application depends on whether we have access to unlabeled data or not.
W-GAN: This scheme can be applied once a small set of labeled data is available without the need to have extra unlabeled data. We notice that the number of real labeled data for data generation directly influences the obtained localization performances. As we can see in Section VI, when using the W-GAN proposed system based on 500 labeled data, we improve the localization accuracy by 8.53% compared to the supervised scheme. However, when we rely on 1000 labeled data and generate the same number of fake generated data 2000, the localization improvement is almost 17.31%. Thus, even if the data generation is a promising method for localization performance improvement, it is essential to have sufficient data, which can enable effective data generation. As mentioned in Section VI, compared to a previous work which generates RSSI vectors to be pseudolabeled, we attain the same localization accuracy with lower computational complexity. Even more, we achieve almost 2% of localization accuracy improvement due to overcoming the pseudo-labels estimation error.
However, even when generating extra fake data so as to improve the training process of the localization model, such a model should be retrained periodically based on new collected data in order to cope with the indoor propagation conditions variations. Such operation needs additional human effort and heavy calculation resources. To deal with this problem, (1) federated learning is explored recently in order to distribute the training process and maintain activity on some unities only and (2) transfer learning which transforms a model from a known environment to another variant environment. In future work, our research will be oriented towards deeper study of the aforementioned challenges.

VIII. CONCLUSION
Machine learning-based indoor localization systems provide good localization accuracy with low online complexity. However, a proper training of a deep neural network (DNN) based localization model requires a large amount of collected labeled data which makes data collection an expensive task. To address this problem, in this paper, we propose two localization schemes. The first scheme, which is a semi-supervised system based on DNN for indoor localization, explores both real labeled data and pseudo labeled data in order to boost localization accuracy. This solution has been validated showing that the integration of a fixed or variable weight is beneficial in terms of localization performance compared to the supervised scheme and the classical semi-supervised scheme. When unlabeled data are not available and only a small set of real labeled data samples are collected, we propose a second localization scheme to deal with this scenario. We generate fake fingerprints using generative adversarial networks (GANs). RSSI samples and their labels are both directly provided by the GAN so that pseudo-labeling error is minimized. In order to enhance location prediction performance and avoid overfitting, a DNN model is trained on mixed dataset both comprising of real collected and fake generated data samples. During the training stage of the DNN-based localization model, a variable weighting coefficient is appropriately associated to the generated data samples to limit their reliance on fake data especially in the early training epochs. The proposed weighted data augmentation process leads to a localization improvement of 17.31% using the UJIIndoorLoc database. In future work, we will explore the transfer learning technique to overcome the challenge of collecting costly measurements [14]. Therefore, we will transfer a model obtained from a rich-data environment to a poor-data environment with limited measurements.
WAFA NJIMA (Member, IEEE) received the Engineering degree from the Institut National des Sciences Appliquées et de Technologies de Tunis, in 2015, and the Ph.D. degree in the field of radio-communications from the Conservatoire National des Arts et Métiers in Paris in collaboration with the Ecole Supérieure des Communications de Tunis, in 2019. She joined the ETIS Laboratory at ENSEA Cergy as a Temporary Assistant and a Researcher also as a Postdoctoral Researcher. She is currently an Associate Professor at ISEP, Paris. Her publications span several research areas, and her research interests include related to several topics, including signal processing, wireless communications, sparse data and data augmentation, indoor localization, the IoT, and machine learning for communications. She served as a TPC member and a reviewer for many leading international conferences and journals.
AHMAD BAZZI was born in Abu Dhabi, United Arab Emirates. He received the M.Sc. degree (summa cum laude) in wireless communication systems (SAR) from the Centrale Supélec, in 2014, and the Ph.D. degree in electrical engineering from EURECOM, France, in 2017. He is currently a Research Associate with NYU Abu Dhabi working on 6G and joint sensing and communications and prior to that, he was the Algorithm and Signal Processing Team Leader at CEVA-DSP, Sophia Antipolis, leading the work on Wi-Fi (802.11ax) and Bluetooth (5.x BR/BLE/BTDM/LR) high performant (HP) PHY modems, OFDMA MAC schedulers, and RF-related issues. Since 2018, he has been devoting to publishing YouTube lectures, where his channel contains mathematical, algorithmic, and programming topics, with more than 120K subscribers and more than 10M views, as of April 2022. He was awarded a CIFRE Scholarship from ANRT France, in 2014. He was nominated for Best Student Paper Award at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), in 2016. He received a Silver Plate Creator Award from YouTube, in 2022. He is a co-inventor in several patents involving intellectual property of Wi-Fi and Bluetooth products, all of which have been implemented and sold to key clients. He has served as a TPC member for some IEEE conferences and a reviewer at several top IEEE conferences and well-known IEEE journals. His research interests include signal processing, wireless communications, statistics, and optimization.