Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies, Opportunities, and Challenges

Deep learning (DL) has obtained great success in computer vision (CV) field, and the related techniques have been widely used in security, healthcare, remote sensing, etc. On the other hand, visual data is universal in our daily life, which is easily generated by prevailing but low-cost cameras. Therefore, DL-based CV can be explored to obtain and forecast some useful information about the objects, e.g., the number, locations, distribution, motion, etc. Intuitively, DL-based CV can facilitate and improve the designs of wireless communications, especially in dynamic network scenarios. However, so far, it is rare to see such kind of works in the existing literature. Then, the primary purpose of this article is to introduce ideas of applying DL-based CV in wireless communications to bring some novel degrees of freedom for both theoretical researches and engineering applications. To illustrate how DL-based CV can be applied in wireless communications, an example of using DL-based CV to millimeter wave (mmWave) system is given to realize optimal mmWave multiple-input and multiple-output (MIMO) beamforming in mobile scenarios. In this example, we proposed a framework to predict the future beam indices from the previously-observed beam indices and images of street views by using ResNet, 3-dimensional ResNext, and long short term memory network. Experimental results show that our frameworks can achieve much higher accuracy than the baseline method, and visual data can help significantly improve the performance of MIMO beamforming system. Finally, we discuss the opportunities and challenges of applying DL-based CV in wireless communications.


I. INTRODUCTION
Recently, deep learning (DL) has obtained great success in computer vision (CV) field.It comprises networks like deep neural network, deep belief network, recurrent neural network (RNN), and convolutional neural network (CNN).A considerable amount of various DL networks have shown up with the availability of large image or video datasets and high-speed graphics processing units (GPUs) [1].The reason why DL networks can ahcieve success in CV is that it can discover and integrate low-/middle-/high-level features in images then leverage them to accomplish specific task [2].DL can easily fulfill applications in CV with very high performance, like semantic segmentation, image classification, and object detection/recognition [1].So DL-based CV has been widely utilized in public security, healthcare, and remote sensing as there are many visual data generated in these fields [3].However, it is rarely seen in wireless communication in which 1-dimensional temporal wireless data prevails.
Nowadays, high-definition cameras are universally installed almost everywhere because of their low cost and small size.In some public areas, cameras have already existed there for monitoring.Therefore, visual data can be easily obtained in wireless communication systems in our real-life [4].As useful information about the static system topology (including terminals' number, positions, and the distances among the terminals, etc.) and dynamic system information (including the moving speed and direction, and the change in the number of the terminals) can be recognized, estimated, and extracted from these multi-medium data via DL-based CV techniques, new potential benefits can be exploited for wireless communications to aid the system design/optimization, such as resource scheduling and allocations, algorithm design, etc.In Fig. 1, we present the framework of applying DL-based CV to wireless communications, the core idea of which is to explore the useful information obtained/forecasted by DLbased CV techniques to facilitate the design of wireless communications via DL-based/traditional optimization methods.In the following, we will introduce some applications of DLbased CV in wireless systems from three aspects: Physical layer, medium access control (MAC) layer, and network layer, respectively.1) In the physical layer, wireless communication systems can leverage object detection and segmentation techniques in CV to get the locations, amount, and environmental information of users from the visual data.With the aid of the obtained information, specific modulation, source encoding, channel encoding, and power control strategies can be selected to realize the optimal utilization of system resources (e.g., bandwidth and energy budgets).In this way, dynamic modulation, encoding, and power control can be easily formulated and implemented.For example, in multiple-input and multipleoutput (MIMO) beamforming communication systems, the direction and power of beams can be scheduled from the knowledge of users' locations and blocking cases in the visual data.
2) In the MAC layer, making use of the density or distribution of users obtained from the visual data in the serving area of the base-station (BS), channel resources (including frequency bands, time slots, etc.) can be efficiently reserved and allocated to achieve the optimal overall performance.For example, in smart home scenarios, there are various kinds of terminals such as smartphones, televisions, laptops, and some other intelligence home appliances.As such, channel resources can be dynamically scheduled by considering the information obtained from the vision data, like the number and location of the users.Another example, differing from traditional handover algorithms which adopt the measured the fluctuation of the received signal power to estimate the distance between the terminal and BS, the moving information including the moving velocity and its variation can be fully estimated from visual data to accurately facilitate the channel resource allocation in the handover process.This can be quite useful in fifth-generation wireless networks due to the shrinking size of the serving zones.
3) For the network layer, in multi-hop transmission scenarios, novel routing algorithms can be designed to improve the transmission performance, e.g., the end-to-end delivery delay, packet loss rate, jam rate, and system throughput, by exploiting the system topology information achieved from the visual data.For instance, in wireless sensor networks, there are numerous sensors can be deployed in a target area to monitor, gather, and transmit information about their surrounding environment.Then, the system topology information from visual data can be used to design multi-hop transmissions, which are required due to the inherent resource limitations and hardware constraints of the sensors.
In this context, this article aims at introducing the methodologies, opportunities, and challenges of applying DL-based CV in wireless communications so as to offer an essential reference/guide for theoretical researches and engineering applications.
The rest of this article is organized as follows.Section II gives an overview of related work from two aspects: datasets and applications.In Section III, an example of applying DLbased CV to mmWave MIMO beamforming is presented.A problem definition, framework architecture, pipeline, and results of this example are also elaborated in this section.In Section IV, some challenges and open problems of applying DL-based CV to wireless communications are introduced and discussed.Finally, this article is concluded in Section V.

II. AN OVERVIEW OF RELATED WORK
In applying DL-based CV to wireless communications, there are two essential things: datasets and applications.In the following, we will give a brief overview of recent work according to these two aspects.a) Datasets: DL is data-hungry, and building datasets is an essential step.In [5], the authors proposed a parametric, systematic, and scalable dataset framework called Vision-Wireless (ViWi).They utilized this framework to build the first-version dataset with four scenarios with different camera distributions (colocated and distributed) and views (blocked and direct).These scenarios are based on the millimeter wave (mmWave) MIMO wireless communication system.Each scenario contains a set of images captured by the cameras and raw wireless data (signal departure/arrival angles, path gains, and channel impulse responses).Using the provided Matlab script, users can get the user's location and channel information in each image from the raw wireless data.Later the same authors built the second version dataset called ViWi Vision-Aided Millimeter-Wave Beam Tracking (ViWi-BT) [6] and post it for the machine learning competition at IEEE International Conference on Communications 2020.This dataset contains images captured by the co-located cameras and mmWave MIMO beam indices under a predefined codebook.The details of this dataset are illustrated in Section III-D1.b) Applications: There are some interesting applications focusing on tackling beamforming problems.A framework to implement beam selection in mmWave communication systems by leveraging environmental information was presented in [4].They use the images with different perspectives captured by one camera to construct the 3-dimensional (3D) scene and generate corresponding point cloud data.They built a model based on 3D CNN to learn the wireless channel from the point cloud and predict the optimal beam.Based on the firstversion ViWi dataset, [7] proposed a modified ResNet18 model to conduct beam and blockage prediction from the images and channel information.Based on the second-version ViWi-BT dataset, the authors of [6] provided a baseline method without the images and only with the beam indices.They believe that they can get better performance if they leverage both of these kinds of data.

A. Problem Definition
MmWave communication is a promising technique in the 5th generation communication system, thanks to its very large available bandwidth and ultra-high data transmitting rate [5]- [7].Beamforming should be implemented among a large antenna array to achieve the required high power gain and direction.The classic MIMO beamforming algorithms suffer a common disadvantage that the complexity will increase dramatically with the number of antennas, resulting in substantial computational overhead.DL-based CV is a promising candidate to address the overhead issue.
In this section, we will give an example of applying DLbased CV to mmWave MIMO beamforming.A scenario with a BS forming a MIMO beam to serve a target user moving along a street is considered, as shown in Fig. 2. Therefore, the beam direction must dynamically be adjusted to catch the target mobile user.The target user may be blocked at some moments like t 8 in Fig2, and then the beam cannot directly reach the target user, while proper reflection from other objects, e.g, buildings or vehicles, need to be designed.Meanwhile, there are three cameras installed at the BS to capture RGB (red, green, blue) images of the street view to assist the beamforming process.So the problem here is how to utilize the previously-observed 8 consecutive beam and their corresponding 8 images to predict the future 1, 3, and 5 beams.Notably, these beams are represented as beam indices under the same predefined codebook.

B. Framework Architecture and Methods
We propose a DL network framework shown in Fig. 3.It is composed of ResNet [2], 3D ResNext [8], feature fusion module (FFM) [9] and predictive network which will be elaborated as below.
1) ResNet, ResNext and 3D ResNext: ResNet consists of several residual blocks, as presented in Fig. 4. The block contains two or more convolutional layers and superimposes its input to its output through identity mapping.It can efficiently address the vanishing gradient issue caused by the increasing number of convolutional layers.If a specific number of such blocks are concatenated, as depicted in Fig. 4, ResNet is available to achieve as deep as 152 layers.
The structure of ResNext block [10] is presented in Fig. 4. It is an improved version based on residual block, and a 'next' dimension, also called 'Cardinality' is added.It sums the outputs of K parallel convolutional layer paths that share the same topology.It also inherits the residual structure after the combination.As K diversities are achieved by the K paths, they can focus on more than one specific feature representations.
In 3D ResNext, a similar structure can be observed with the one in ResNext, while 3D convolutional layers are adopted instead of 2-dimensional (2D) ones.The 3D convolutional layer is designed to capture spatio-temporal 3D features from raw videos.
ResNet and 3D ResNext have been widely used as a feature extractor for their powerful feature representation ability.If they are used in DL network directly, the training time will become extremely long, and much computational resources will be occupied due to the large number of layers.So, commonly, researchers apply the pre-trained ResNet on ImageNet dataset to extract visual features from images, and 3D ResNext on Kinetics dataset to extract spatio-temporal features from videos [11].Then these features are fed to DL network as part of inputs.
2) Long Short Term Memory (LSTM) Network: 3) Method with 1D LSTM Network: LSTM [12] network is designed for the tasks that contain time series data, e.g., prediction, speech recognition, text generation, etc.Hence, it is a very suitable candidate for our predictive network.The Fig. 5. Structure of LSTM cell, method with 1D LSTM network, and method with 2D LSTM network network is composed of several LSTM cells as depicted in Fig. 5. Event (current state), previous long term memory (hidden state), and previous short term memory (cell state) are the inputs of LSTM cell, in which learn, forget, remember, and output gates are employed to explore the information from the inputs.It outputs new long term memory and short term memory.The latter is also regarded as a prediction.
When LSTM cell is recursively utilized in a 1-dimensional (1D) array form, 1D LSTM network is obtained, as presented in Fig. 5.At each moment, the cell and hidden states of the previous moment are used to generates the outputs of the current moment.By using this network, one can get as many predictions as recursive times.
As given in Fig. 5, 2D LSTM network can be realized when the LSTM cell is recursively in a 2D mesh form [13].Each LSTM cell utilizes the hidden and cell states from the two neighboring cells in the left and below positions in the mesh.Moreover, its states will be delivered to its neighboring cells in the right and top positions.Obviously, the number of predictions is equal to the number of rows.

4) Feature Fusion Module:
The structure of the feature fusion module is shown in Fig. 3.It comprises 2 LSTM networks and cross gating block, and the former is used to aggregate these features.High-level features can be obtained through these two networks.The latter can make full use of the related semantic information between these two kinds of features by multiplication and summation operations.Thus, the merged features can be obtained through a linear transformation.

C. Pipeline of our Framework
In the pipeline of the considered DL network, 8 consecutive images are inputted and utilized.As the 8 consecutive images are equivalent to a clip of video, they contain motion information, which is helpful for the beam prediction.Combined with the visual information from each image, location, motion, blockage information can be revealed from these RGB images.
The pre-trained 3D ResNext with 101 layers (3D ResNext101) is adopted to extract motion features and the pre-trained ResNet with 152 layers (ResNet152) to extract visual features.Then these features are merged through FFM and sent to the predictive network.In Section III-B2, there are three forms introduced to the LSTM network, based on which three methods of designing DL networks are respectively proposed and explained below.
When the predictive network is 1D LSTM network Fig. 3, the first method is obtained, as presented in Fig. 5.The LSTM cell is recursive for 12 times.The cell at kth moment is denoted as the 'kth LSTM cell'.The input and output of each LSTM cell are embedded vectors and output vectors.Embedding is mapping a constant (beam index) to a vector and can well represent the relation between constants.
During the training process, the pipeline of our first method is shown as below: Step 1: 8 consecutive images are fed to the pre-trained ResNet152 and 3D ResNext101 and then visual features and motion features are obtained; Step 2: These features from step 1 are merged through FFM; Step 3: The output from the FFM is fed to each LSTM cell as part of the inputs; Step 4: The embedded vectors of the first 12 beam indices go through the 1st to the last LSTM cells to update the hidden states and generate 12 output vectors; Step 5: The 12 output vectors are used to calculate the training loss with the ground truth and train the network.
During the testing process, as we only have the first 8 beam indices and images, the 4th step above will not be applicable.It will be separated into two sub-steps:  indices of the maximum element in these output vectors.Each of these cells is fed with the hidden state and the embedded beam index from the prediction of the previous LSTM.
The fifth step will be skipped during testing.1) Method with Modified 1D LSTM Network: In our first method, the training and testing procedures are different.Actually, the first method essentially aims to predict the next one beam index as we utilize all the first 12 beam indices as inputs during the training.During the testing process, among the 8th to 12th predicting beam indices, the previous one's correct prediction is very important for the next prediction.To make training and testing processes consistent, we design a modified version of the first method.In the modified version, the output vector of each of the last five LSTM cells goes through a linear transformation module and is fed to the next cell as the embedded input.In this way, only the first 8 beam indices are used as input, and the training and testing can be the same.
2) Method with 2D LSTM Network: When we apply the 2D LSTM network to the predictive network, the third method can be realized as shown in Fig. 5.In this method, we need to input the embedded vectors of the first 8 beam indices into the LSTM network and get 5 outputs vectors directly.The training process is the same as the testing one.

D. Experiment
In this section, we evaluate our three proposed methods on the ViWi-BT dataset.
1) Dataset: The VIWI-BT dataset contains a training set with 281100 samples, a validation set with 120468 samples, and a test set with 10000 samples.There are 13 pairs of consecutive beam indices and the corresponding images of street views in each sample of the training set and validation set.Furthermore, the first 8 pairs are the observed beams for the target user and the sequence of the images where the target user appears, and the last 5 pairs are ground truth label pairs, i.e., they have the future beams of the same user and the corresponding images.In this experiment, the first 8 pairs serve as the inputs of the designed DL network to generate the predicted future 5 beam indices to compare with the last 5 given beam indices.
2) Implementation Details: We first use pre-trained ResNet152 and 3D ResNext101 to extract 2048-dimensional visual and 8192-dimensional motion features from the first 8 images of each sample.The merged features will be embedded as a 463-dimensional vector and fed to the predictive LSTM network.There are a 512-dimensional hidden size and a 129-dimensional output vector in each LSTM cell.Then the training pipeline mentioned in Section III-C is implemented to train the proposed network.
During the training, the designed DL network is optimized by Adam optimizer.The learning rate is set as 4 × 10 −4 at first and reduces by half every 8 epochs.The batch size is set as 256.The cross-entropy loss is utilized for the loss function.
3) Performance: Following the evaluation in [6], the performance of our proposed methods is evaluated on the validation set with the same metrics, which are exponential decay score and top-1 accuracy.Their detailed expression can be explained by Equations 6 and 7 in [6].Our results are listed in Table I in which the baseline method in [6] is considered for com parson purposes.In the baseline method, the authors just leveraged the beam indices data and ignored images data.
From the exponential decay scores, we can see that our proposed methods with the 1D LSTM network and modified 1D LSTM network absolutely outperform the baseline method.
The method with 2D LSTM network is better than the baseline on '1 future beam' and '3 future beams', while a little worse on '5 future beams'.
For the top-1 accuracy, the designed method with 1D LSTM network also outperforms the baselined method.The method with modified 1D LSTM network is better than the baseline method on '1 future beam' and '3 future beam'.The method with 2D LSTM network only performs better than the baseline method on '1 future beam'.
In summary, among the three proposed methods, the method with 1D LSTM network shows the best beam prediction for the target mobile scenarios.

IV. CHALLENGES AND OPEN PROBLEMS
Although an example of leveraging CV to tackle the mmWave beamforming problem has been elaborated in previous sections, there exist some challenges and open problems in the front way of applying DL-based CV technologies in wireless communications, as discussed as below.

A. The Building of Datasets
DL is super data-hungry.A large dataset can guarantee the successful application of DL-based CV techniques on wireless communications.A qualified dataset in CV usually includes more than ten thousands of samples.For example, there are more than 14 million images in ImageNet, 60,000 images in Cifar-10, and 650,000 video clips roughly in Kinetics, respectively.It will take much time, money, and labor work to generate such a huge amount of visual data.However, building a qualified dataset, which should be comprehensive and exhibits a balanced diversity of data, is still a long way.
So these data should be able to represent all possibilities in the corresponding problem and the amounts between different kinds of data can not have so much difference.Usually, a training set, a validation set, and a test set are consisted in a dataset.These three sets should be homogeneous and no overlapping.So it is better to randomly sample from a shuffled data pool to obtain the three sets.These data should be well organized and easily manipulated.So, normally, it is the hardest work in DL to build up a satisfactory dataset.

B. The Selection of CV Techniques
There are many state-of-art DL techniques that have been proved efficient and powerful in CV, like reinforcement learning, encoder-decoder architecture, generative adversarial network (GAN), Transformer, graph convolutional network (GCN), etc. Reinforcement learning can be utilized to tackle optimization problems [14].GCN can be leveraged to address network-related issues [15].Encoder-decoder architecture is widely used in semantic segmentation and sequenceto-sequence tasks.The GAN is a very powerful CNN to learn the statistics of training data and has been widely used to improve the performance of other DL networks in CV [1].A transformer is one kind of recurrent neural network and can handle unordered sequences of data.It can be used to replace the 2D LSTM network.Many CV pieces of research have shown that if these techniques can be jointly applied to make full use of the visual data, better results can be obtained [9], [11].So, a single proper CV technique or an adequate combination of several CV techniques are required to deal with a specific problem in wireless systems.In the example given in Section III-B, we combine ResNet, 3D ResNext, and LSTM network to achieve the required performance.So how to find proper and efficient CV techniques is an open problem.

C. Open Problems on Vision-aided Wireless Communications
In previous sections, the problem of beam and blockage prediction in mmWave MIMO communication system has been proposed.As there are many kinds of cameras and Lidar operating in real life, a huge amount of visual data can be obtained through them.Then, more accurate motion and position information of the terminals can be recognized, analyzed, and extracted from these multimedia data, which can be explored to facilitate the design and optimization of wireless communications.Thus, some open problems in wireless communication scenarios are introduced and discussed as follows: (1) Cellular network: Visual data obtained at the BS in the cellular network may contain the location, number, and motion information of the terminals in the open area.This information can be used for the BS to adjust its transmit power and beam direction to save power consumption and reduce the interference.For example, the motion information of the users at the edge of the coverage area can be utilized to forecast and judge whether/when a terminal go out or come in its serving area, and then accurate channel resource allocation can be set up for the handover process to improve the utilization efficiency of the system resource.
(2) Vehicle-to-everything communications: Visual data captured by one vehicle can reveal its surrounding environments, such as traffic conditions.Thus it can be used to set up the links with the neighboring terminals, access points, and vehicles.Therefore, traffic schedules and jam/accident alarm can be conducted for improved road safety, traffic efficiency, and energy savings.
(3) UAV-ground communications: When a UAV serves as an aerial BS, visual data captured by the UAV can be used to identify the locations and distribution of ground terminals, which can be utilized in power allocation and route/trajectory planning, etc.Moreover, when a ground BS communicates with several UAVs, visual data captured by the ground terminal can be used to define the serving range, allocate the channels/power, etc.
(4) Smart cities: Visual data captured by satellites or airborne crafts can be applied to recognize and analyze the user's distribution and schedule the power budget/serving range to achieve optimal energy efficiency.
(5) Intelligent reflecting surface (IRS): Usually, it is impossible to implement channel estimation and achieve network state information at the IRS, because there are no comparable calculation capacity and no RF signal transmitting and receiving capabilities at the IRS.Fortunately, DL-based CV is capable of offering such useful information to make up for the gap mentioned above of the IRS.Thus, proper control matrix can be optimally designed to accurately reflect the incident signals to the target destination by utilizing the visual data captured by the camera installed on the IRS, which includes the locations and the number of terminals.

V. CONCLUSION
This article mainly presented the methodologies, opportunities, and challenges of applying DL-based CV to wireless communications.First, we discussed the feasibility of applying DL-based CV in physical, MAC, and network layers in wireless communication systems.Second, an overview of related datasets and work was presented.Third, we give an example of applying DL-based CV to mmWave MIMO beamforming system.In this example, previously-observed images and beam indices were leveraged to predict future beam indices by using ResNet, 3D ResNext, and LSTM network.Experimental results show that visual data can help significantly improve the accuracy of beam predicting.Moreover, the challenges and possible research directions were discussed and elaborated.We hope our work will stimulate more research innovations and fruitful results in the future.
Manuscript received**, 2020; revised **, 2020; accepted **, 2020.The associate editor coordinating the review of this paper and approving it for publication was ***.(Corresponding author: Gaofeng Pan.) Authors are with Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.

Substep 4 . 1 : 4 . 2 :
The embedded vectors of the first 8 beam indices go through the 1st to 8th LSTM cells and update the hidden states;Substep The 8th to 12th LSTM cells are used to predict the future beam indices which are obtained by getting the

TABLE I PERFORMANCE
OF EXPONENTIAL DECAY SCORES AND TOP-1 ACCURACY