CSI2Image: Image Reconstruction from Channel State Information Using Generative Adversarial Networks

This study aims to find the upper limit of the wireless sensing capability of acquiring physical space information. This is a challenging objective, because at present, wireless sensing studies continue to succeed in acquiring novel phenomena. Thus, although a complete answer cannot be obtained yet, a step is taken towards it here. To achieve this, CSI2Image, a novel channel-state-information (CSI)-to-image conversion method based on generative adversarial networks (GANs), is proposed. The type of physical information acquired using wireless sensing can be estimated by checking wheth\-er the reconstructed image captures the desired physical space information. Three types of learning methods are demonstrated: gen\-er\-a\-tor-only learning, GAN-only learning, and hybrid learning. Evaluating the performance of CSI2Image is difficult, because both the clarity of the image and the presence of the desired physical space information must be evaluated. To solve this problem, a quantitative evaluation methodology using an object detection library is also proposed. CSI2Image was implemented using IEEE 802.11ac compressed CSI, and the evaluation results show that the image was successfully reconstructed. The results demonstrate that gen\-er\-a\-tor-only learning is sufficient for simple wireless sensing problems, but in complex wireless sensing problems, GANs are important for reconstructing generalized images with more accurate physical space information.


INTRODUCTION
This study considers the upper limit of the wireless sensing capability of acquiring physical space information. Wireless sensing enables us to obtain a variety of data in physical space by only deploying access points (APs). Several studies have already shown the possibility of extracting physical space information from radio waves. In particular, channel state information (CSI)-based methods are improving the practical feasibility of wireless sensing. This is because CSI, which is used for multiple-input multiple-output (MIMO) communication, is easily acquired from commercial Wi-Fi devices. Using Wi-Fi CSI, state-of-the-art studies have already achieved remarkable results. In the future, Wi-Fi may become a sensing platform; the IEEE 802.11 wireless LAN working group has established a study group for WLAN sensing. The details of wireless sensing are discussed in Section 2.1.
To understand the upper limit of the wireless sensing capability of acquiring physical space information, this study attempts to reconstruct images from CSI obtained from off-the-shelf Wi-Fi devices. If the conversion from CSI to images corresponding to the physical space can be realized, the possibly of extracting physical space information using CSI can be approximately estimated. In addition, because the eye is the most high-resolution sensor in the human body, the images serve as human-understandable information. Furthermore, object detection technology, which has developed in conjunction with the emergence of deep learning and the next generation of applications such as automated driving, can be used to automatically build learning data without manual labeling.
Figures 1 shows an application example of CSI-to-image conversion: automatic wireless sensing model generation. The generation consists of two phases: the learning phase and the recognition phase. In the learning phase shown in Figure 1(a), the system simultaneously captures the CSI and images of the target space, following which the system trains a deep neural network (DNN) with the captured CSI and the images. Finally, the system extracts the physical space information from the image reconstructed from the captured CSI using the trained DNN, as shown in Figure 1(b). Figure 1(c) shows a practical example of the automatic wireless sensing model generation; this is demonstrated as an evaluation in Section 4.
Considering this, this paper proposes CSI2Image, a novel wireless sensing method to convert radio information into images corresponding to the target space using DNN. To the best of our knowledge, this is the first time CSI-to-image conversion has been achieved using GANs. From the perspective of CSI-to-image conversion without GANs, a few related studies have been conducted [1,2]. Wi2Vi [1] uses the video recovered by CSI when a security camera is disabled due to power failure, malfunction, or attack. Under normal conditions, Wi2Vi extracts the background image from the camera image, detects a person using the difference between the background image and the image, and learns by associating it with the CSI. Under critical conditions, Wi2Vi generates an image by superimposing the detected user onto the background image. [2] has successfully generated pose images generated from skeletons from CSI by learning the relationship between the skeleton model of human posture and CSI. [1,2] are application-specific approaches, using application-specific information such as background images and skeleton models. In contrast, the present study focuses on a general-purpose CSI-to-image conversion method using GANs.
The main contributions of this paper are as follows: • The use of GANs for CSI-to-image conversion is proposed, implemented, and evaluated. In particular, because simply introducing GANs is insufficient, this paper shows three methods of learning the conversion model: generator-only learning, GAN-only learning, and hybrid learning. • Novel position-detection-based quantitative evaluation methodology to evaluate the performance of CSI-to-image conversion is demonstrated. Specifically, Section 4 quantitatively shows that the use of GANs enables the successful reconstruction of more generalized images from CSI compared to generator-only learning. • Empirical evaluation using off-the-shelf devices is performed using compressed CSI, which can be acquired from IEEE 802.11ac devices. The obtained results can be easily reproduced using an off-the-shelf USB camera, a Raspberry Pi, and a packet capture tool.
The remainder of this paper is organized as follows. Section 2 describes related works on wireless sensing and GANs. Section 3 proposes CSI2Image with three generator learning structures: generator-only learning, GAN-only learning, and hybrid learning. Section 4 presents the qualitative and quantitative evaluation of the three learning structures proposed in Section 3; the quantitative evaluation methodology is also proposal for the evaluation of CSI-to-image conversion. Finally, conclusions are presented in Section 5.

RELATED WORKS
The present work explores the areas of wireless sensing and GANs.

Generative adversarial networks
GANs enable the generation of new data with the same statistics as the training data using a generative model [72], and they have been used in several applications [73,74]. The generative model is constructed by alternately learning a generator and a discriminator in order to trick the discriminator. This section introduces deep convolutional GAN (DCGAN) [75] and super-resolution GAN (SR-GAN) [76], both of which are highly relevant to this study. DCGAN constructs a generative model to generate realistic fake images from random noise [75]. Figure 2 shows the model structure of DCGAN. DCGAN trains the discriminator to identify an image as real when the image is from the training dataset, and as fake when it is generated from random noise by the generator. At the same time, DCGAN trains the generator to generate images (from random noise) that the discriminator identifies as real. The generator is implemented using deep convolutional neural networks [77]. As the generator and the discriminator learn to compete with each other, the generator is able to generate high-quality fake images.
SRGAN generates high-resolution images from the corresponding low-resolution images [76]. Figure 3 shows the model structure of SRGAN. SRGAN trains the discriminator to identify an image as real when it is from the training dataset, and as fake when it is generated from a low-resolution image by the generator. At the same time, SRGAN trains the generator to attempt to generate images (from low-resolution images) that will be identified as real by the discriminator. From DCGAN and SRGAN, it can be said that GANs can be used to create fake data that appears real or to recover real data from small amounts of data.  Figure 4 shows the entire system of the proposed CSI2Image. CSI2-Image is composed of training data, a generator, and a discriminator. Section 3.1 shows the details of the training data, Section 3.2 shows the model structure of the generator, and Section 3.3 shows the model structure of the discriminator. Note that this paper proposes three types of generator learning methods: generator-only learning, GAN-only learning, and hybrid learning. Generator-only learning does not use a discriminator, as described in Section 3.4.

Training data
The training data of CSI2Image consist of simultaneously captured images and CSIs. Full-color 64 × 64 pixel images and compressed CSI of 312 dimensions, acquired with 2 TX antennas, 3 RX antennas, and 52 subcarriers, are used. The compressed CSI is used in off-the-shelf APs, smartphones, and PCs for their wireless communications, and the common format of CSI feedback is as specified in the IEEE 802.11ac standard [70,71]. CSI2Image recovers the right singular matrix V from compressed CSI and uses the first column of the V as input data. Note that the singular value decomposition of the CSI is expressed as follows.

CSI = USV H
where U is a left singular matrix, S is a diagonal matrix with singular values of CSI, and V is a right singular matrix.
The compressed CSI in IEEE 802.11ac includes the angle information ϕ and ψ . The V is calculated with ϕ and ψ by the following Equation (1).
where M is the number of RX antennas, N is the number of TX antennas, and I M×N is the identity matrix in which zeros are inserted in the missing element if N M. D k is a diagonal matrix,  (2, 2)) Convolution(filters=512, kernel_size=(3, 3), padding="same") BatchNormalization(momentum=0.8) Activation("ReLU") UpSampling((2, 2)) Convolution(filters=256, kernel_size=(3, 3), padding="same") BatchNormalization(momentum=0.8) Activation("ReLU") UpSampling((2, 2)) Convolution(filters=128, kernel_size=(3, 3), padding="same") BatchNormalization(momentum=0.8) Activation("ReLU") Convolution(filters=64, kernel_size=(3, 3), padding="same") Activation("tanh") expressed as follows: Table 1 shows the network structure of the generator. The compressed CSI of 312 dimensions is input to the dense layer of 65,536 neurons with the Rectified Linear Unit (ReLU) layer, and the neurons are reshaped into a 8 × 8 × 1024 tensor. The tensor goes into the upsample layer, convolution layer with 3 × 3 kernel, batch normalization layer, and ReLU layer thrice. Finally, it is also input to the convolution layer with the 3 × 3 kernel and activation function of tanh to obtain the output of 64 × 64 × 3 tensor. Adam is utilized as the optimizer of the generator network, whose learning rate is 0.0002, and momentum term is 0.5. The loss function for the generator network is the mean squared error (MSE) [78]. Table 2 shows the network structure of the discriminator. The input is a full-color image of 64 × 64 pixels. The color image is then fed into four sets of the convolution layer, of a 3 × 3-size kernel with stride 2, batch normalization, LeakyReLU function (α = 0.2), and dropout of 0.25. The output is then flattened and activated by a sigmoid function. The output value is the range of 0 to 1. The discriminator network uses the Adam optimizer, whose initial setting is the same as that of the generator network, and the loss function of binary cross-entropy.

Learning phase
In this work, three methods are proposed for the learning phase: generator-only learning, GAN-only learning, and hybrid learning. Generator-only learning learns the correlation between compressed CSIs and images. GAN-only learning uses both a generator and a discriminator. Hybrid learning combines the generator-only and GAN-only learning.
Generator-only learning Figure 5 and Algorithm 1 depict the model structure and pseudocode, respectively, of the generator-only learning. The convolutional-neural-network-based generator is trained with the measured CSIs and the simultaneously captured images. As generator-only learning learns the relations between the CSIs and images given as the training data, the generator may not accurately generate images from unknown CSIs.

Generator
Train Dataset Image Figure 5: Generator-only learning Figure 6 and Algorithm 2 show the model structure and pseudocode, respectively, of GAN-only learning. Because the discriminator learns the converted image while judging whether or not it is a real image, this method is more likely to reconstruct a clear image than generator-only learning. However, the discriminator only judges the legitimacy of the converted image, and it may not convert an image corresponding to the measured and compressed CSI.
In particular, the discriminator may not learn the detailed parts of the image.  Figure 7. Lines 7 to 8 of Algorithm 3 represent discriminator learning. At line 7, the discriminator is trained to assess the training image to be real, and at line 8, it is trained to assess the generated image (obtained from the random noise) to be fake, as shown in Figure 8. Lines 9 to 11 of Algorithm 3 represent generality learning. The generator is trained every K epochs by feeding the compressed CSI to judge the generated image to be real by the discriminator, as shown in Figure 9. When the value of K is large, the generalization performance increases, while the CSI information is lost; when the value of K is small, the image quality reduces because of generality loss.

Image generation phase
In the image generation phase, the compressed CSI measured by wireless devices is fed into the pre-trained generator, and the generator converts the CSIs into full 64 × 64 pixel images, as shown in Figure 10.

EVALUATION
To clarify the effectiveness of the proposed CSI2Image, qualitative and quantitative evaluations were conducted. Because the conversion of CSI to images is a new research area, no quantitative evaluation method has been established yet. Therefore, a quantitative evaluation method using object detection and positional detection is proposed for the conversion of CSI to images. The image converted from the CSI is applied to object detection, and the possibility of extracting the same detection results as the training image is evaluated. Figure 11 shows the configuration of each piece of equipment in the experimental environment, and Figure 12 demonstrates a snapshot of the environment. This experiment utilized an AP, a camera, a computer, and a capture device as a compressed CSI sniffer. The Figure 12: Snapshot of the experimental environment AP was a Panasonic EA-7HW04AP1ES, the camera was a Panasonic CF-SX1GEPDR with a resolution of 1280×720 pixels, the computer was a MacBook Pro (13-inch, 2017), and the capture device was a Panasonic CF-B11QWHBR with CentOS 7.7-1908. The proposed CSI2Image model was developed using a Dell Alienware 13 R3 computer, equipped with an Intel Core i7-7700HQ central processing unit, 16GB of DDR-SDRAM, a Geforce GTX 1060 graphics processing unit, and a solid state drive for storage. The object detection library used was you only look once (YOLO) v3 [79,80], and the model data were trained with [81] from the COCO dataset [82]. The threshold to determine the object detection in YOLO is 0.3.
In the quantitative evaluation, the following four aspects were extracted: (1) Object detection success rate. The high score obtained indicates that the quality of the generated images is sufficient for object detection. (2) Average confidence score when object detection is successful. The confidence score is the confidence level of the object recognition algorithm in outputting the recognition result. (3) Structural similarity (SSIM) [83]. SSIM is a standard measure of image quality which is used to evaluate the performance of image processing algorithms, such as image compression. (4) Position detection accuracy rate. The position detection accuracy rate is the percentage of correct locations detected via object recognition. The SSIM index is expressed by Equation 2, where x and are vectors, whose elements are the pixels of an original image and the reconstructed images, respectively. Let µ x and µ denote the average pixel values of images x and , σ x and σ be standard deviations of images x and , and σ x be the covariance of images x and . Both C 1 and C 2 are constant values defined as C 1 = (255K 1 ) 2 and C 2 = (255K 2 ) 2 , respectively. In this case, the parameters of K 1 = 0.01 and K 2 = 0.03 are the same values as in [83]. The SSIM index takes a value from 0 to 1, where 1 represents an exact image match.

Experiment 1: Single-user position detection
To clarify the baseline performance of the proposed CSI2Image, single-user location detection was evaluated. The experiment was performed with only one person at positions 1 to 3 in Figure 11. Three types of image patterns were possible, in which the person would be at position 1, 2, or 3, respectively. The evaluation used 180 images as training data and 184 images as test data. The number of epochs was 32,000, and the batch size was 32. In hybrid learning, K was eight. Figure 13 shows an example of successful position detection with one user. The red square on each figure represents the object detection results obtained using YOLO. If a person is detected on the right of the image, as shown in Figure 13 Figure 14 shows an example of failed position detection with one user. If a person is detected on the right of the image, as shown in Figure 14(a), the position detection is accurate. As can be seen from Figures 14(b) to 14(d), pale ghost-like shadows appear at the middle and the right of the images. In contrast, GAN-only learning in Figure 14(c) produces a clean image as compared to generatoronly learning and hybrid learning, although the position is inaccurate. Figure 15(a) shows the success rate of human detection. The black and white bars represent the results using the training and test data, respectively. The confidence threshold of YOLO is 0.3. In terms of the detection success rate, GAN-only learning achieved the highest score: in the test data, the detection success rate of generatoronly learning, GAN-only learning, and hybrid learning were approximately 92.7 %, 93.5 %, and 92.3 %, respectively. Figure 15(b) shows the average confidence score when the object detection is successful. In addition to the detection successful rate described above, GAN-only learning achieved the highest score: in the test data, the average confidence score of generatoronly learning, GAN-only learning, and hybrid learning were approximately 91.1 %, 94.5 %, and 90.8 %, respectively. To understand the reason for the low SSIM performance of GANonly learning, the position detection accuracy was evaluated. The results show that GAN-only learning had the worst performance compared to generator-only learning and hybrid learning. With the test data, the accuracies were found to be approximately 89.9 % for generator-only learning, 21.2 % for GAN-only learning, and 90.5 % for hybrid learning.

Quantitative evaluation
Thus, although the detection success rate and the average confidence score are the highest in GAN-only learning, the SSIM index is low owing to misplaced-user images. In particular, GAN-only learning has a position detection accuracy even with the training data. This is because GAN-only learning only learns the legitimacy of the generated image using the discriminator, as shown in For more complex situations, the position detection was evaluated for the case of one or two users. Specifically, six types of classification problems were evaluated when one person or two people were at positions 1 to 3, as shown in Figure 11: "one person at 1, " "one person at 2, " "one person at 3, " "two people at 1 and 2, " "two people at 1 and 3, " and "two people at 2 and 3. " We used 720 images as training data and 330 images as test data for the evaluation. The other conditions were identical to those presented in Section 4.2. Figure 16 shows an example of successful position detection with one or two users. If a person can be detected at the center of the image, as shown in Figure 16(a), the position detection is accurate. The positions obtained via generator-only learning and hybrid learning in Figures 16(b) and 16(d) are accurate, and the human shape is clearly displayed. However, as shown in in Figure  16(c), GAN-only learning accurately detects the position, but a shadow is also output on the left side of the incorrect position. Figure 17 shows an example of failed position detection with one or two users. If a person is detected on the right side of the image, as shown in Figure 17 Figure 18(a) shows the successful detection rates of each comparison method. It can be observed that hybrid learning achieves the highest detection rate: using the test data, the detection success rates are approximately 79.6 % for generator-only learning, 54.0 % for GAN-only learning, and 85.4 % for hybrid learning. The detection success rate of GAN-only learning is low even when using training data. Figure 18(b) shows the average confidence score of each comparison method. Similar to the above successful detection rate, it can be observed that the confidence score of hybrid learning is the highest: using the test data, the average confidence scores are  Figure 18(c) shows the SSIM index of each comparison method. The results are the same as in the single-user evaluation: GANonly learning shows the worst performance. Using the test data, the SSIM indexes are 0.803 for generator-only learning, 0.656 for GAN-only learning, and 0.803 for hybrid learning. Figure 18(d) shows the position detection accuracy of each comparison method. The results show that using the test data, hybrid learning achieved the highest accuracy, while GAN-only learning had the lowest: the values were 79.3% for generator-only learning, 13.1% for GAN-only learning, and 83.8% for hybrid learning.

Experiment 3: Continuous Position Estimation for a Single User
To evaluate a more complex situation than that in Section 4.3, experiments were conducted in which one person walked around an oval connecting positions 1 to 3, as shown in Figure 11. The evaluation used 515 images as training data and 498 images as test data.
The other settings were identical to those in Experiment 1 and 2. As the results of the qualitative evaluation did not differ from those of the position detection problem in Section 4.2, only the quantitative evaluations are presented in this section. Figure 19(a) shows the detection success rates of each comparison method. The detection success rates are relatively low compared to those obtained in the evaluations in Section 4.2 and Section 4.3. This was despite the fact that the results when using training data showed high detection success rates. In the training data, the values were 95.7 % for generator-only learning, 39.0 % for GANonly learning, and 96.3 % for hybrid learning, whereas in the test data, they were 29.6 % for generator-only learning, 35.2 % for GANonly learning, and 27.8 % for hybrid learning. We believe that the amount of training data is small relative to the complexity of the problem. Figure 19(a) shows the distance (in pixels) between the left coordinates of the detected box of the training data and that of the generated data. The lower value of the distance indicates that CSI2-Image precisely tracks the position of a user. This evaluation only used the generated images that successfully detect a user. The lower limit of the error bar is the minimum value, and the upper limit is the maximum value. The evaluation results show that GAN-only learning cannot be used for single-user continuous position detection. While generator-only learning and hybrid learning are superior to GAN-only learning, they require performance improvements. This is because the maximum value is too high, although the results are only calculated from successfully detected images. Using the test data, the maximum differences of generator-only learning, GAN-only learning, and hybrid learning are 46 px, 54 px, and 49 px, respectively.

CONCLUSION
This paper proposed CSI2Image, a GAN-based CSI-to-image conversion method. Specifically, three learning methods have been explored: generator-only learning, GAN-only learning, and hybrid learning. An evaluation method using an image recognition algorithm has also been proposed as a quantitative evaluation method for CSI-to-image conversion. For simple problems such as classifying the location of a person, it was found that the simplest generator-only learning model can be used. In addition, it was observed that simple use of GANs, such as GAN-only learning, resulted in the successful generation high-quality images, but the images lacked physical space information. Furthermore, hybrid learning, which is a combination of generator-only learning and GANonly learning, was found to achieve superior performance under slightly more complex conditions, such as classifying the location of one or two people. However, none of the three methods performed well in the more complex single-user continuous position detection problem. It is concluded that further improvements can be made by redesigning the network structure, allowing the input of time-series CSI, inputting CSI from multiple devices, and converting CSI to higher-order information such as angle-of-arrival before inputting it into the DNN.