Learn to Adapt to New Environments From Past Experience and Few Pilot Blocks

In recent years, deep learning has been widely applied in communications and achieved remarkable performance improvement. Most of the existing works are based on data-driven deep learning, which requires a significant amount of training data for the communication model to adapt to new environments and results in huge computing resources for collecting data and retraining the model. In this paper, we will significantly reduce the required amount of training data for new environments by leveraging the learning experience from the known environments. Therefore, we introduce few-shot learning to enable the communication model to generalize to new environments, which is realized by an attention-based method. With the attention network embedded into the deep learning-based communication model, environments with different power delay profiles can be learnt together in the training process, which is called the learning experience. By exploiting the learning experience, the communication model only requires few pilot blocks to perform well in the new environment. Through an example of deep-learning-based channel estimation, we demonstrate that this novel design method achieves better performance than the existing data-driven approach designed for few-shot learning.


I. INTRODUCTION
D EEP learning (DL) can address the intricate correlation among variables, especially those that are difficult to accurately describe with mathematical models [1], which allows us to design wireless communication systems without requiring expert knowledge. Therefore, it has been used to develop communication systems and received widespread attention for its effectiveness [2]. For the data-driven method in [3], a deep neural network (DNN) is adopted to replace the channel estimator and the signal detector in the orthogonal frequency division multiplexing (OFDM) receiver. The end-to-end design in [2] and [4] use two DNNs representing the transmitter and receiver, respectively. Recently, the spirit of the end-to-end model has been extended to semantic communication [5]. These DL-based wireless systems have demonstrated impressive performance in the additive white Gaussian noise (AWGN) channel [3] and frequency-selective channels in [2]. However, only single communication environment is considered in [2] and [3]. It is a challenge for the DL-based communication system to be adapted to new environments with different power delay profiles (PDPs) or distortions.
DL has been highly successful in data-intensive applications but is often hampered by a small available training dataset [6]. The DL-based channel estimation (CE) is often purely datadriven. For any particular channel propagation environment, a large number of pilot blocks or labelled data are required for the training, which is usually performed offline.
Using a small dataset to train a DNN with a large number of parameters can easily lead to over-fitting [1]. Few-shot learning (FSL) has been proposed to tackle that issue. Using prior knowledge, it is possible to quickly generalize a model to a new task with only a few available samples [6]. Meta-learning has been the most common framework for FSL in recent years. In [7], common initialization parameters that enable fast training on any channel have been found using the meta-learning approach. From [7], significant training speed improvement and an efficient communication model can be obtained only through one iteration of gradient descent. Transfer learning (TL) can also be a solution for FSL tasks. Existing applications in wireless communications include deep TL-based signal detection for backscatter communication networks [8] and TL-enabled convolutional neural network (CNN) for 5G industrial edge networks in [9]. Accretionary learning [10] is designed to accumulate learned knowledge and acquire new knowledge. The idea of accretionary learning can be applied to achieve the goal of FSL, where knowledge is learned and accumulated independently during offline training. For accretionary learning, few new parameters are trained and combined with learned knowledge to acquire new knowledge online. In [11], an online training system, called SwitchNet, can capture the features of the new propagation environment. In SwitchNet, multiple DNNs are pre-trained, each for a specific propagation environment. Only a small dataset is required to linear combine outputs of those DNNs in online adaption. However, the propagation environments tested for this method are limited in [11]. In addition to meta-learning and accretionary learning that can be potentially used to realize FSL, model-driven methods also require less data for training due to fewer trainable parameters. In [12], the adaptivity of the channel estimator is This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ enhanced by designing a hypernet to generate parameters for the model-driven based wideband mmwave system. However, for the data-driven CE, the parameter set is enormous, which is challenging for the hypernet to generate all parameters.
In this paper, we focus on a data-driven CE system that can be quickly adapted to a new environment. In Section II, we introduce the existing DL-based approach for CE and briefly review the attention mechanism. Section III formulates the problem and introduces the mechanism for attention generators. In Section IV, we present the FSL method. In Section V, we compare our FSL method with other related ones. Section VI concludes the paper and discusses the potential of developing FSL with some other techniques. The main contributions of this paper are as follows: • We propose to use the attention mechanism for the CE model to realize FSL, where attention networks generate weights for each feature vector under multiple rules for dynamic adjustment. To the best of our knowledge, this is the first work to introduce the attention mechanism in wireless communications to realize FSL. • We design a task-attention model to enhance the generalization ability for various distributed training data and improve testing performance in the new environment. By using few pilot blocks in the new environment, the task-attention network adds the knowledge of the new environment to feature maps by generating attention vectors in the channel domain. 1 • We introduce the cross-attention mechanism to find the correlation between the support and the query blocks in the spatial domain, which leads to a higher estimation accuracy in initialization blocks. This is the first work applying the cross-attention mechanism to wireless communication problems. • The cross-attention model embedded initialization network is proposed to produce the input of the CE backbone. The query block is firstly initialized according to the efficient feature embedding from support blocks and then sent into the CE backbone. We develop a method that takes the most advantage of the support blocks to improve CE for the query block.

II. RELATED WORKS
In this section, we briefly review recent advances in DLbased CE and introduce the working mechanism of SwitchNet, which is closely related to our work. Furthermore, the attention mechanism and its applications in wireless communication are described briefly.

A. Deep Learning in CE
DL has emerged as an effective tool for CE in wireless systems. A new DL-based receiver design, ComNet, has been proposed for an OFDM system to deal with frequencyselective fading channels [13], where two cascaded DNNs are utilized for the CE and signal detection (SD), respectively. For ComNet, the estimated channel is used to recover the transmitted data. With the expert knowledge embedded, ComNet is more predictable and explainable than the fully-connected DNN-based receiver in [3], which treats the joint channel estimator and signal detector as a black box. Due to the strong fitting ability of DNN and the end-to-end training mechanism, the SD network can still recover the transmitted signal even if the output of the CE network is far away from the real channel. Therefore, the output of CE can be regarded as a feature representation rather than accurate instantaneous channel coefficients.
A DNN-based approach for channel sensing and downlink hybrid beamforming is proposed in [14], which is generalized for any number of users. The multi-user cascaded CE is formulated as a denoising process in [15]. CNN with the deep residual framework estimates channel coefficients from the noisy pilot-based observations.
Recently, SwitchNet has been proposed to provide a more accurate CE and enable the system to adapt quickly to the new environment. The architecture of the SwitchNet is shown in Figure 1. It consists of CE and SD and is similar to the conventional receiver. In [11], the structure of the CE network is designed for online adaption. The CE network consists of least-square (LS) CE and five CE SubNets, which are all implemented by neural networks. The CE outputs are linearly combined with parameters α. CE SubNet 0 performs the basic CE. Multiple CE SubNets are used as compensating networks for CE SubNet 0 to adapt propagating environments in the training set. Each compensation network aims at a specific propagation environment. During online adaption, each compensation network is governed by a trainable parameter α i that controls its contribution to fit the new environment. Since there are only few trainable variables during testing for the new environment, only a small batch of samples is required for the adaption. Denote W i and θ i as the multiplicative parameter weights and bias, respectively. The estimated channel can be expressed as, where h ls is channel coefficients calculated through LS. After more accurate channel coefficients h est are estimated, the signal detection (SD) network is employed to recover the transmitted data. In order to compare the CE performance with our proposed method, only the CE part of SwitchNet will be used sometime. The above is the offline process. The online adaption aims at learning a combination of the CE SubNets using the support blocks, where true channel coefficients are known for the SwitchNet support blocks. The online fine-tuning process is to learn a set of α to minimize the mean-squared error between the true channel coefficients and the output of CE SubNet, h est . Such an online adaption mechanism allows the model to be adapted to different propagation environments and makes the system more robust than conventional DL-based communication systems.

B. Attention Mechanism
Convolutional Neural Networks (CNN) performs classification or regression tasks by generating feature maps for samples. Traditional CNN often treats the extracted features equally, limiting the generalization of a model in some cases. When the category of testing samples is highly separate data, like instantaneous channel coefficients from different environments, the importance of these extracted features is different. For example, the features for coping with channel distribution within a specific angular region might have limited even with negative impacts on estimating channels from another region or environment. Therefore, the attention mechanism is employed to generate weights for each extracted feature vector so that more critical features have more significant contributions [16].
The theory of attention in DL has been first proposed in [17], which adopts recurrent neural network (RNN) and reinforcement learning (RL) to obtain attention in the spatial domain. Then attention mechanism becomes popular with the introduction of squeeze-and-excitation network (SENet) [18] and multi-head attention [19]. SENet is designed to recalibrate channels by using attention weights adaptively. It advances computer vision. Our proposed attention approach is developed based on this theory. We use self-attention to represent the working mechanism of SENet subsequently.
Attention mechanism has been employed to help the DL-based communication system, such as channel state information (CSI) compression [20], channel compression [21] and CE [16], [22]. It can enhance the estimation accuracy for channels with highly separate distributions in [16]. Therefore, it has the potential to improve the learning capacity of the DL model by accumulating more experience learning from datasets with different distributions.

III. PROBLEM FORMULATION
In this section, we will introduce our FSL problem scenario and provide necessary information about cross attention and task attention in FSL application.

A. Impact of PDPs
The mobile communication channel in our problem scenario is time-varying multipath propagation, which leads to serving dispersion of the transmitted signal. As a multipath propagation channel, the channel impulse response (CIR) at time t can be written as, where L is the number of resolvable paths. A l , φ l and τ l , represent attenuation, phase shift, and delay in the l th path. The PDP indicates the distribution of transmitted power over various paths in propagation [23]. The channel PDP is calculated from the spatial average of |h(t, τ)| 2 over a local area and represents small-scale multipath channel statistics. The mean power of each multipath component depends on the propagation delay τ l and is defined by a PDP P (τ l ) [24]. Furthermore, we employ α l (t) to describe the fading characteristic of the channel. Therefore, we can use PDP P (τ l ) and fading function α l (t) to express the path attenuation. The received signal from the multipath channel can be written as, Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. where x(t) refers to the transmitted signal, n(t) represents AWGN, and α l (t), with E |α l (t)| 2 = 1, models the timevariant fluctuation of the path attenuation. Channels with the same PDP share the same P (τ l ) in path attenuation.
This paper defines the propagation environment as an area shared with the same PDP. We consider high Doppler spread in our simulation, where the coherence time is short and instantaneous channel coefficients vary rapidly. However, instantaneous channels from the same propagation environment share the characteristics of PDP.

B. Problem Scenario
We assume that the channel estimator is designed and trained offline for several propagation environments. Then we investigate how we can use the previous experience together with a few pilot blocks to estimate the channel in a new environment. In the beginning, all symbols in the blocks are pilots in the new environment, which are referred to as support blocks T support . We can learn the features of the new environment well from these pilot blocks, which can be done when mobile devices are on standby. Then the pilot portion in each block is decreased so that transmission becomes more efficient. Such blocks with fewer pilots are called query blocks S query . Due to high Doppler spread, T support and S query may differ more significantly compared with slow time-varying channel cases.
As shown in Figure 2, each support block, T support , includes whole w pilots in an OFDM block while the query block S query only contains w 4 pilots. The communication system will learn environmental features offline from T support . In the online adaption, these features are employed to enhance the CE from S query . The communication system can estimate channel H i est from the i th query block and then recover the transmitted data x i est in the i th data block. The accuracy of CE plays a decisive role in transmitted data detection. Therefore, we develop an FSL approach to quickly adapt the communication system to the new environment. With this FSL approach, we can still achieve good CE results for query blocks in the new environment with the guidance of few support blocks. Support blocks and query blocks in our problem are different from those in conventional FSL problems. In the conventional setting, the support dataset comes with true labels or values, while the query dataset is unlabelled. In our cases, the support blocks, T support , have no corresponding true values since it is hard to obtain the accurate channel coefficients in real communication systems due to various inherent uncertainties in the wireless channels. We only use channel coefficients estimated by LS from the pilot blocks. Support blocks T support have more pilots compared with the query blocks, which means the channel estimated from T support has higher accuracy. Therefore, T support contains more channel information and more accurate environment features can be extracted compared to that included in S query . Since instantaneous channels from the same propagation environment share many features, the environment features extracted from T support can be employed to enhance the CE for S query . Since instantaneous channels from the same propagation environment share many features, the environment features extracted from T support can be employed to enhance the CE for S query .
The symbol-spaced multipath channel is described by complex random variable {h(n)} N −1 n=0 . while x(n), y(n), and ω(n) represent discrete transmitted, received pilot signals and AWGN. N refers to the length of pilot blocks, which equals w in Figure 2. After performing the discrete Fourier transform (DFT), the received pilot signal in the frequency domain can be written as, and ω(n), respectively. The support and query blocks are estimated through LS from pilot blocks, which can be described below.
Since S query only contains w 4 pilots, interpolation is then performed to obtain the estimated channel impulse response in all OFDM symbols. Two estimated channel coefficients from adjacent pilot symbols are employed for CE on the data between them.
We demonstrate the process of estimating channel H est from the query block in Figure 2. First, we use pilots to estimate H LS through LS. Then we get real-valued vector S query that consists of the real and imaginary parts of H LS . S query is sent to the CE network to generate accurate estimated channel H est . The process of CE is as follows: the task-attention model learns the environment features from T support offline and generates attention vectors to provide dynamic adjustment during the estimation process by the CE backbone. The cross-attention-based initialization model employs T support to initialize S query through feature embedding and initialized S query is the input of the CE backbone. The detailed working mechanism will be introduced subsequently.
The essence of our FSL problem is to use a small number of pilot blocks with more information (T support ) to guide blocks with less channel information (S query ) for efficient CE in new environments. With only a small number of support blocks, the estimation accuracy of query blocks can be close to the testing accuracy boundary, which is obtained by testing the query blocks with the model trained with sufficient pilot blocks of the same environment. The approach to achieve the goal of FSL is to find out useful features through a small number of support blocks and then use these features to improve the accuracy of the channel estimated from the query block.
The attention mechanism is employed to realize the approach. The task attention uses global environment features extracted from all support blocks to give dynamic adjustment in the process of estimating channels from S query . In comparison, the cross-attention uses local feature correlation between each support and query block to enhance channel feature embedding. For each S query , we will select different channel features from T support and embed them in S query to improve CE. This process is called CE guidance. Such guidance depends on the individual pilot block, while the adjustment from task attention is only related to the environment. Both attention approaches have meta-learner to guarantee effectiveness when deployed to new environments. These two attention mechanisms will be introduced subsequently.

C. Cross-Attention
Conventional attention models (e.g., SENet) are not effective for FSL since they usually find out important features of the test samples only based on the prior of the training dataset and result in poor generalization performance to unseen samples. Furthermore, a large number of samples are required to learn the attention mechanism for any specific class. Instead of using each sample's own feature map to draw attention, the cross-attention model (CAM), proposed in [25], employs semantic relevance between support and query set features to generate attention maps in pairs. The query set contains the unlabeled samples from unseen classes, while the support set has few labelled samples from the corresponding classes. With such an approach, only few labelled samples are required to highlight the important regions so that more discriminative features can be extracted.
The CAM is illustrated in Figure 3 (a) in our problem scenario. The class feature map, P ∈ R n×w ×c , is extracted from the support blocks and the testing feature map, Q ∈ R w ×c , is extracted from the query block, where c and w refer to the number of channels and the width of the feature map generated from one block, n represents the number of support blocks. For each testing feature map, there are n class feature maps to help extract attention vectors. CAM will produce cross attention weights A p (A q ) in the spatial domain for the feature map P (Q). The testing feature map computes the cross-attention with each class feature map. A total of n pairs of attention maps will be generated. The testing feature map, Q, is duplicated n times in order to weight with different attention maps.
First, the correlation map between feature maps P and Q is calculated for the generation of the cross-attention map. Such a correlation map R refers to the semantic relevance between p i and q i with cosine distance, where p i and q i ∈ R c are the i th spatial position in P and Q, respectively. The correlation layer computes the semantic relevance between p i and q j to get the correlation map R ∈ R nw ×w as There are two correlation maps, , where r p i ∈ R w denotes the correlation between the local support block's feature vector p i and all query blocks feature vectors {q j } w j =1 , and r q j ∈ R nw is the correlation between the query block feature vector, q j , and all support block feature vectors {p i } nw i=1 . The correlation vector r p i is formulated as: Then, a fusion layer is employed to generate attention maps based on R p and R q . The fusion layer applies global average pooling (GAP) to the correlation map and sends it into a metaleaner to generate kernel w kernel in the spatial domain. The convolutional operation is taken between the correlation map where A p i represents the attention value at the i th position, τ is the temperature hyperparameter in softmax function, W 1 and W 2 are parameters of meta-learner, and σ represents to the RELU function. The meta-learner generates the kernel, w kernel , that aggregates the correlations between two feature maps P and Q in order to draw attention to the target objects. The testing attention map, A q , can be achieved in a similar way. Finally, we weight the initial feature map, P and Q, elementwisely by 1 + A p and 1 + A q to form a more discriminative feature maps.

D. Task-Attention
The idea of task-attention is borrowed from the task-aware feature embedding network (TAFE-Net) in [26], which is proposed to handle FSL in image classification. TAFE-Net consists of a meta-learner and a prediction network backbone to do classification for unlabelled images. The meta-learner module focuses on extracting feature embedding from few labelled images for the particular task and generating taskspecific weights for each layer in the prediction network. TAFE-Net has shown promising results for data efficiency improvement for FSL. In our experiment settings, we estimate channel from query block, S query , in CE backbone, which acts as a prediction model in TAFE-Net. By following the working mechanism of the meta-learner in TAFE-Net, we develop a task-attention model (TAM) to improve the estimation accuracy.
In Figure 3 (b), the TAM employs an extra CNN to extract features of the support blocks. Then the extracted features are sent to a single layer perceptron to generate attention vectors a task for the CE backbone. The 2D convolutional layers are employed for the CNN since the input shape of the TAM is n × w × k, where k equals two since each complex channel coefficient is split into two real values. The TAM in Figure 3 (b) can be formulated in the following, where f CNN represents the multi-convolutional-layer network function, W i FC is the weights of the parameter generator with single fully-connected (FC) layer. The sigmoid function is utilized to limit the range of the parameters. The output attention vector for the i th backbone layer output is represented as a i task , as shown in Figure 3 (b).

IV. FSL FOR CE
Instead of using a large amount of data, we propose an FSL approach to allow the channel estimator to adapt to the new environment with only few pilot blocks. In this section, we are going to present the attention-based CE method in detail, where the overview of the algorithm and the working mechanism of each attention model will be demonstrated.

A. Structure of FSL-Based CE
As shown in Figure 4, the FSL-based channel estimator consists of three parts: the CE backbone, the CAM-based initialization network, and the TAM. The CE backbone estimates channel coefficients from the query blocks. CNN is known for being good at exploiting correlation in the input data and is widely employed in CE applications in the frequency domain [27] and the angular domain [16]. A multi-layer one-dimensional (1D) CNN is used for the CE network due to the shape of the input query blocks, as shown in Figure 4. We design the initialization network to initialize the query blocks before sending them to the CE backbone. The initialized query blocks, s initial , are used as the input of the CE backbone. The initialization network allows support blocks to provide guidance for the query block and CAM is embedded to enhance such guidance. The TAM helps the backbone to improve robustness. It selects important features in the channel domain, while the CAM helps select features for the initialization model in the spatial domain. With the cooperation of the attention mechanism and the initialization model, the CE backbone can be fast adapted to the new environment.
We should emphasize that the above CE backbone is trained offline, corresponding to certain channel environments, such as indoor or outdoor channels. When deploying the CE backbone online, its parameters, θ CE , are fixed. If it is used in a new environment, the mismatch will cause significant performance degradation, which will be addressed through FSL.
The detailed algorithm is demonstrated in Algorithm 1, where f CAM , f CIN , f TAM , and f CE refer to the functions of the CAM, convolutional initialization network (CIN), TAM and CE backbone. From Algorithm 1, the CE backbone has N+2 layers. Except for the last layer, which directly generates the CE result, the output of each other layer is multiplied by the weights, a task , generated by TAM. The loss function is considered in two aspects, the output of the CE backbone and the CAM-based initialization network output, both of which derive the error based on the true channel coefficients s true .

B. TAM-Embedded CE
The support blocks from the same propagation environments are employed as the input of TAM. The generated parameter vector, a task , selects important feature vectors in the channel domain. When encountering a new environment, TAM attempts to predict the PDP based on T support , which is the most representative feature of each environment. The attention vector, a task , is generated based on the predicted PDP. By performing channel-wise multiplication with the original feature map, the backbone can have a better generalization ability to the new environment.
The TAM plays another essential role in multi-environments adaption. Similar to the self-attention mechanism, the TAMbased method ensures the entire deep learning model to become robust to the data with different distributions. The difference is that self-attention generates weights based on local features (e.g., features of the sample itself), while TAM uses global features (e.g., PDP of the environment) to select important features. The original CE backbone cannot converge when the training dataset contains channel coefficients from different environments. However, with TAM embedded, the system can converge to an optimal minimum with data from all environments.
It is challenging to analyze in detail how the TAM learns the weight vectors, a task , to improve the robustness. But we can try to explain the role of the attention mechanism in an implicit way. For the conventional neural network, all data with different distributions are processed by the same weights of the DNN and there is no dynamic adjustment to adapt to a specific distribution. Such a philosophy of DNN design limits the diversity of trainable data distributions. The operation of TAM is similar to the concept of "divide and conquer", where generating weight vectors for feature maps can be treated as a dynamic adjustment. With such an adjustment, the model can be adapted to the training data with various distributions.
Furthermore, the mechanism of this dynamic adjustment can be learnt by the attention network in the training process of adapting to various environments. Compared with TAM, self-attention has poor performance for blocks in the new environment. Because attention is generated based on the feature map of the block itself, self-attention can only find the basic features with the prior of previous training classes and lacks generalization ability to new environments. In contrast, we use TAM to enhance the generalization ability to the new environment. TAM performs as a meta-learner, which aims to learn to make the backbone generalized to various tasks. After being trained by a large number of pilot blocks from different environments, TAM can find out the environment features based on the experience when it encounters pilot blocks of the new environment. The environment-specific attention helps the backbone to be generalized to the new environment.

C. CAM-Based Initialization Network
One of the limitations of the TAM-embedded CE backbone is that the attention generated by TAM is independent of query blocks S query . In other words, for the same propagation environment, different query instantaneous channels with various feature maps will have the same weights if support blocks T support are the same, which results in attention maps becoming less efficient. Furthermore, TAM cannot introduce extra channel features in the support blocks to the query blocks since TAM only works on re-weighting existing features from the query blocks. Due to more pilots in each support block, we intend to develop a model to allow S query to learn additional features, which are contained in those positions without pilots. This learning efficiency should be further improved by exploiting the correlation between support blocks T support and query blocks S query .
As shown in Figure 4, we use a CAM-based initialization network to realize that goal. Both T support and S query are sent into a pre-trained feature extractor. Then feature maps are used as input of CAM to generate weighted feature maps in the spatial domain. Each pair of feature maps, P i and Q i ∈ R w ×c , are concatenated to form a high-dimensional feature map F initial . There are n pairs of P i and Q i in total, where n refers to the number of support blocks as mentioned before. The new feature map, F initial ∈ R 2n×w ×c , is processed by the CNN. With the convolution operation, the channel features at the positions without any pilots in the query block can be learned with the help of support blocks in T support . The output of the CNN is treated as the initialized query blocks, s initial , and sent into the CE backbone for further processing.
The function of the CAM-based initialization network is to use T support to guide CE for S query . CAM focuses on the correlation established by cross-attention so that query blocks with less channel information can learn the rich channel information contained in support blocks more efficiently. The meta-learner in CAM contributes to the new environment adaption and its working mechanism is the same as TAM mentioned above. Through dynamic adjustment in the channel domain based on the global feature and information enhancement in the spatial domain based on each pilot block, a better estimation accuracy can be achieved with the cooperation of the CAM initialization network and the TAM.

V. EXPERIMENTS
In this section, the proposed attention-based CE system is trained and tested under different propagation scenarios. We first introduce the WINNER channel model employed for simulating the propagation effect in different environments. Then we describe the details of the experiment settings. We implement the SwitchNet [11] and test it using the same training and testing set to compare with our proposed attention-based method. We show that the attention-based CE system outperforms SwitchNet. Furthermore, we explore how the CAM-based initialization network and cross-attention mechanism help improve the testing performance. The testing accuracy boundary of w 4 -pilot case is also considered, where each testing environment has sufficient training data for the CE backbone to learn from instead of only few shots available. With the help of the w-pilot block in T support , our proposed method is able to get closer to the testing accuracy boundary or even exceed it.

A. WINNER Channel Model
All instantaneous channel coefficients in this experiment are generated by the WINNER channel model (WCM) [28], which has been adapted to various mobile communication scenarios from a local area to a wide area. WCM uses spatial and temporal parameters obtained from the measured CIR to characterize different environments. The measured CIR for each propagation environment is analyzed and processed to get the environment-specific parameters [28], which can be used to simulate the propagation effect for the specific environment. There are twelve different propagation scenarios 2 that WCM can emulate and we choose five of them as the training set while another two are used for testing. In the simulation setting, the carrier frequency is 5.25GHz, and both line-of-sight (LOS) and None-line-of-sight (NLOS) are considered. The PDP varies according to different environments, which leads to different lengths of instantaneous channel coefficients. In order to facilitate the subsequent training of the DL-based CE model, we unify the number of all channel coefficients to 72 and use zero padding for those channels shorter than 72. Since channel coefficients are complex numbers, we split each one into two real numbers. Therefore, the number of real coefficients for each channel is 72 × 2.

B. Experiment Settings
The propagation system layout is configured as a 300by-300 (meters) map. We consider single-input-single-output (SISO) in the system. The training set for five different propagation scenarios with maximum delay are listed: indoor office with a maximum delay of 175 ns, indoor-to-outdoor with a maximum delay of 305 ns, indoor hotspot with a maximum delay of 405 ns, outdoor-to-indoor (urban) macro-cell with a maximum delay of 535 ns, and urban macro-cell with a maximum delay of 615 ns. It is evident that the average power of each channel tape varies for different environments or the same environment but with different transmitters and receivers positions. We generate 500 PDPs for each propagation scenario for the training dataset by changing the propagation conditions and user positions. Each PDP contains 500 different instantaneous channel coefficients. A significant number of PDPs are required for training the meta-learner in TAM and CAM to learn as many environmental features as possible. When new environmental samples appear, the meta-learner can learn new features by combining the learned features, which guarantees the new task adaption of the backbone.
The testing set includes channels in rural macro-cell with a maximum delay of 420 ns and moving networks with a maximum delay of 210 ns. Compared with propagation scenarios in the trainingset, the moving network scenario considers a higher Doppler spread. And the rural macro-cell has a much lower building intensity. Hence, the LOS condition is more likely to exist in the coverage area. Each propagation scenario has five PDPs. The channel output can be formulated as y = h x+n, where n and x are channel noise and channel input, respectively, refers to the linear convolution, and h denotes the CIR. The input of the attention-based CE system is the query blocks with w 4 pilots and the support blocks contain w pilots. The number of support blocks required for the new environment 3 will be explored in the following part. All experiment results are obtained by taking the same 500 samples of each environment for testing and then averaging.

C. Baseline
The proposed attention-based CE system is compared with two baselines, SwitchNet and testing accuracy boundary. For SwitchNet, w 4 pilots are employed as the input for the LS CE block, which has the same number of pilots as the CE backbone. The testing environments described in [11] are limited for the SwitchNet, while part of the online testing is implemented with similar propagation conditions but with different max delays. We explore the ability of SwitchNet to adapt to highly separate channels in this part. Furthermore, we will measure the distance between the test performance of the proposed attention-based system and the testing accuracy boundary. We will see whether the testing performance of our proposed method can approach or even exceed the testing accuracy boundary. We also test the effectiveness of each part of the attention-based system. First, we explore how the CE backbone performs with only TAM embedded. We will see, compared with SwitchNet, whether the TAM can help the CE backbone to perform better in the new environment through the dynamic adjustment from the meta-learner. Then the CAMbased initialization network is added to check whether the system can consistently improve its generalization to the new environment.

D. End-to-End Model Training
The number of training samples for each environment is 1.25 × 10 6 and the training batch size is 128. The whole attention-based CE system is trained in an end-to-end manner. Each training sample includes two categories of data: a frame of w 4 pilots block and multiple frames of w pilots blocks. Such multiple frames of w-pilot blocks refer to support set T support in FSL sampling, which is randomly sampled from the training set and belongs to the same average PDP. The loss function and the optimizer used for training are binary cross-entropy and Adam. The model is trained with SNR = 20 dB, while the value of SNR varies from 5 dB to 25 dB during testing.
The detailed network layout for our attention-based CE is introduced in Table I. TAM Main processes the support blocks and the attention vectors are generated through TAM Meta. CAM Main is a deep residual network to introduce extra channel features in the support blocks to the query blocks and the cross-attention vectors are generated through CAM Meta. Table II lists the training parameters for the simulation. Figure 5 demonstrates how the attention mechanism can help the model to improve the CE accuracy in joint training of multiple environments. As mentioned in the previous section, the TAM-based method ensures the DL model to become robust to the data with different distributions. We select two environments from the training set and test the generalization ability under the condition of whether the TAM is applied for CE. Figure 5 shows that the TAM can enhance the adaption for multiple environments, especially in high SNR scenarios. Figure 7 shows the testing accuracy for the rural macro-cell channel versus the number of support blocks. From Figure 7, when there is only one block in T support , the performance The MSE between true and estimated channel for known environments.  power in the specific environment. As the number of support blocks increases, the performance improves since more implicit features are learnt. The elbow point appears when the number is 16. After the number of blocks exceeds 16, the accuracy improvement is very limited by continuing to increase the blocks. Therefore, the subsequent experimental results are all tested under the condition that the number of blocks contained in T support is 16.

E. Experimental Results
During training, TAM allows the CE backbone to be adapted to data in various distributions. Without the TAM, the model has a bad generalization performance using training samples from different environments. Figure 6 compares mean-squared error (MSE) for different design methods. From Figure 6, the model's generalization ability to the new distributed data is enhanced using the task-attention mechanism. The TAM can help the CE backbone to achieve a lower MSE for the new environments compared with SwitchNet. The degree of freedom for the dynamic adjustment by TAM is much higher than SwitchNet. The attention network generates multiple parameter vectors while SwitchNet can only rely on the five parameters for the new environment adaption. Furthermore, SwitchNet is trained separately for the datasets with different distributions while the attention-based approach uses joint training of these datasets, which allows the model to learn additional information in joint learning from different environments, such as high-level features for the meta-learner training. Therefore attention-based mechanism outperforms SwitchNet in generalization to the new environment. In addition, SwitchNet requires online training steps to be adaptive to the new environment. Our proposed attentionbased method is directly generalized to the new environment without fine-tuning steps for online adaption.
The CAM-based initialization network positively affects the TAM-embedded model's generalization to the new environment from Figure 6. Additional channel information in support blocks can be effectively learned by the query block in the initialization network. Furthermore, We demonstrate that CAM can enhance the system's testing performance. In Figure 6, the FSL model without CAM-embedded is also considered and it has worse performance compared with the CAM-embedded case. The CAM provides attention in different dimensions from TAM and it is proved in [29] that combining channel and spatial attention leads to consistent improvements for CNN-based models. Figure 6 shows that with the initialization network embedded, the CE system becomes more robust for low SNR. For the high SNR scenario, the initialization network can improve the testing accuracy and is close to the boundary in most cases. With the help of CAM, our proposed model can exceed the w 4 -pilots testing boundary.  We should emphasize that such MSE performance with support blocks is not due to the high similarity between the test environment and environments in the training sets since we carefully set up the experiments to avoid such circumstances. We employed CNN to do classification for all training and testing channel coefficients, whose label is the environment to which the channel belongs. The classification accuracy is over 95%, which indicates that these environments are highly separated and have features that distinguish them from other environments. In addition, we prove that if query blocks S query and support blocks T support belong to different environments, which is called 'mismatch'. Two mismatch cases are considered. One case is that these two different environments are obtained from different scenarios. The other is in the same propagation scenario but with different transceiver positions. Both cases lead to varying PDPs for S query and T support and result in significant degrading for the testing accuracy. Only the features of the same PDP can give the most accurate CE in this environment, indicating that environmental feature similarities between different PDPs are not high. Figure 8 demonstrates that mismatch leads to a significant degrading of the testing accuracy. For the same propagation scenarios, PDPs in different positions share more similar features compared with different scenarios. Therefore, the mismatch for different positions has a better performance than in different scenarios. However, since most channel features are different, such as delay profiles and the number of taps, the gap between mismatch and match performance is enormous.

VI. CONCLUSION AND FURTHER DIRECTIONS
We have realized FSL for new unseen propagation environments in the DL-based data-driven CE model by exploiting an attention-based mechanism. With few pilot blocks sampled from the new environment, the global features of the new environment and correlations between the support and the query blocks can be quickly extracted. Environment-specific and block-specific attention is generated to allow the model to be fast-adapted to new environments. The proposed mechanism outperforms the existing FSL method in the data-driven scenario.
In this article, we have used CE as an example to realize our novel idea: learn to adapt to new environments using past experience. The same spirit can be utilized in a large group of communication systems and networks, such as endto-end systems and signal detection, localization, and resource allocation.
For future research, it is desired to investigate the modeldriven DL-based wireless communication model for FSL. The model-driven DL-based approaches combine communication domain knowledge with DL models [30]. Compared with the data-driven method, the model-driven method only contains a small number of parameters that need to be trained, which means the demand for the amount of training data is not so significant. The challenge is to find the most critical parameters affecting performance in the specific environment and figure out which environment features are the most closely associated with these parameters. Therefore, the model-driven approach has great potential to deal with FSL problems.
Another interesting research topic is to apply the graph neural network (GNN) to deal with FSL problems in wireless communications. GNN can be designed to capture the dependence of graphs through information transformation between nodes in the graph. Its working mechanism is equivalent to distributed optimization algorithms. GNN has the potential to perform FSL due to its fewer parameters and high computation efficiency owing to its distributed structure. Furthermore, we can employ GNN to enhance the model-driven method, such as the belief propagation (BP) algorithm. GNN can be used to construct the factor graph and extract features from each iteration in parameter updating through the algorithm so that optimum parameters can be learned.