Wi-Fi Based User Identification Using In-Air Handwritten Signature

This paper conducts a feasibility study regarding the use of the Wi-Fi channel state information for user recognition based on in-air handwritten signatures. A novel system for identity recognition is thus proposed to observe for distinctive signal distortions along the propagation path for different users. The system capitalizes on the vast availability of Wi-Fi signals for signal analysis without needing additional hardware infra-structure. Since the patterns of the raw Wi-Fi signals are sensitive to the signer’s location, a transfer learning has been adopted to cope with the positional variation. Specifically, features trained at one position are transferred to classify signals collected at another position via a single shot retraining. A kernel and range space projection has been adopted for the single shot retraining. Our experiments show encouraging results for the proposed system.


I. INTRODUCTION
According to [1]- [3], the biometric traits have played a central role for user identity authentication in the networked society. Generally, biometric based systems can be divided into two categories depending on the adopted information: 1. one that based on the user's physiological characteristics such as face, fingerprint, palmprint, iris, and 2. one that utilizes the user's behavioral traits such as signature, gait, and keystroke. Comparing with the physiological characteristics based systems, the behavioral traits based systems are more robust to the spoofing attack because the physiological characteristics could be easily cloned or imitated due to their static nature [4], [5].
Among those behavioral traits, the handwritten signature has been widely used as means for large scale person identification [6], [7]. The handwritten signature uses personalized patterns while the hand gestures include more general patterns such as waving [8]. Hence, the handwritten signature can be considered as one particular type of hand gestures.
The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo .
The handwritten signature has several advantages over other behavioral biometrics. These advantages include 1. the acquisition procedure is non-invasive. 2. signatures are replaceable even if the biometric information is stolen or compromised. For example, it is very challenging to have the gait replaced by a new walking pattern upon compromise since the walking pattern has become a habit throughout one's life. In contrast, the handwritten signature can overcome this issue by creating a new signature pattern to replace the compromised signature. Systems which adopt the handwritten signatures can be classified into two categories according to the signature acquisition method. They are namely, those which utilized the offline signatures and those which used the online signatures [4], [6]. The offline signature can be acquired by having a signing instrument directly in contact with a scanning device. This method has several problems compared with the online signature. Firstly, it is vulnerable to imitation because it leaves a traceable print on the surface [4]. Secondly, it contains fewer features than online approaches. For example, online signature contains dynamic information such as signing velocity, degree of pressure, and degree of tilt while the offline signature only contains the shape information [4].
Finally, it could distract the user activities in practice. A user has to stop what he/she was doing and move the focus onto the system to write his/her signature on the restricted surface where the sensing module is placed. Moreover, setting up the contact-based sensing modules takes up space and could be costly in practice.
In view of the above advantages of the online signature systems, the online based in-air signature recognition systems have been gaining attention recently [4], [9]- [13]. These systems were designed to recognize the user's signature that was written in the air. In [9], the in-air hand signature was recognized by the trajectory image of the hand through a visual sensor. In [10], a movement sensor in mobile phone was utilized to capture the in-air signature. Although these systems could avoid the costly setup issue by utilizing built-in devices, these systems either require the users to hold a certain type of device (e.g., mobile phone) or standing towards a specific direction where the sensing module is placed during the authentication process.
Recent studies present a variety of potential applications of the Wi-Fi signal based human activity recognition system. This includes the human fall detection [14], [15], keystroke recognition [16], respiration monitoring [17], etc. The fundamental mechanism that forms the basis of these systems is the identifiable interference caused by the human activities along the Wi-Fi signal propagation. Utilizing the raw Wi-Fi signals for activity recognition has several characteristics. Firstly, the system is able to utilize the widely available Wi-Fi devices as its platform. This helps to alleviate the struggle associated with implementing the system in new devices where an immediate use of the widely available Wi-Fi devices becomes possible. Secondly, the system does not require the user to hold or wear any additional device or sensor. Thirdly, the system is able to identify human activities from various locations and directions without needing to position the sensors at specific locations. Finally, the Wi-Fi based system is robust to illumination conditions compared with vision based systems [17].
In this work, we propose to utilize the widely available Wi-Fi signals for user identification based on the in-air handwritten signature. As the user identification/recognition consists of N number of one-to-one matching, the verification accuracy is also included to supplement the recognition performance. Here we note that the adoption of Wi-Fi signals for human activity recognition poses new challenges. For example, the Wi-Fi signals can be disturbed by distractors such as the presence of movements other than the desired signature patterns. Moreover, apart from contamination by noises, the Wi-Fi signals can be attenuated physical distance and barriers. Lastly, different from the consistency of offline signature images acquired from contact based devices [9], the Wi-Fi based system is observed to be sensitive to the position of objects that reflect or interfere with the Wi-Fi Channel State Information (CSI) signals. When the same person draws an in-air signature at different locations, the captured Wi-Fi signal shows different patterns (see Figure 1). Our task in this work is to establish the feasibility and then design a system for user recognition.
Our proposed system addresses the last challenge by adopting transfer learning [18]. Instead of training from scratch at each position, we firstly train a Convolutional Neural Network (CNN) [19] using the Wi-Fi signature signals collected from one position. Subsequently, the trained feature extractor is transferred to recognize the Wi-Fi signature signals collected at another position. In view of the time consuming retraining process by the gradient descent (GD) algorithm [20], [21], we adopt a fast single shot algorithm for fine tuning the transferred features. The single shot training algorithm, called Kernel and the Range (KAR) space projection learning [22]- [25], shows an impressive learning speed for optimizing relatively shallow networks. Our experiments on the proposed system show up to 99.875% identification accuracy. Apart from the pioneering effort in addressing the issue of positional sensitiveness of Wi-Fi based in-air signature for user identification, the proposed system delivers a reasonable prediction output based on a relatively small training dataset. This is an important advantage over many deep learning methods which require a large pool of data for effective learning.
The main distinctive attributes of our contribution are as follows: -Upon establishing the feasibility, we propose a system for user identification based on the Wi-Fi handwritten signature signals. The system can leverage on existing Wi-Fi system for signal acquisition. -We address the positional sensitiveness of Wi-Fi based in-air handwritten signature signals by adopting transfer learning. A single shot learning algorithm based on kernel and range space projection has been adopted for fine tuning the transferred features for computational efficiency. A score-level fusion is subsequently employed for accuracy enhancement. -In view of the lack of public datasets, we have constructed a dataset of 100 subjects for experimentation. The paper is organized as follows: Section II provides a brief review of related works. Section III introduces our proposed system for user identity recognition. In Section IV we present our experimental settings and evaluation of the system. Concluding remarks are presented in Section V.

II. RELATED WORKS A. IN-AIR HUMAN SIGNATURE RECOGNITION
According to [2], the problem of identity recognition using biometrics (e.g., signature) can be divided into two types based on their operation. The first type is known as identification (also known as recognition) and the second type is known as verification (also known as authentication). Identification is a 1-to-n matching problem, which means that the given individual biometric template is compared with all others in the database to locate the identity. In contrast, verification is a 1-to-1 matching problem, which means that the given biometric template is matched against a specific biometric template to see whether it is a match or a nonmatch. In other words, the available answer to verification is either ''match'' or ''non-match'' (or am I who I claim I am?). For identification, the answer to the 1-to-n is ''who'' is the owner of the biometric template (or who am I?). Table 1 provides an overview of existing literature for in-air signature recognition systems where their modes of operation (verification or identification) can be found in Table 2. Although these systems have different measurements in their respective setups, their performances can nevertheless be quantified by a common measure of either the accuracy of identification or the Equal Error Rate (EER) of verification [8], [32].
As seen from Table 1, based on the signal capturing method, these works on handwritten signatures can be categorized into two groups, namely (a) vision/handhelddevice based and (b) Wi-Fi based systems. The vision/ handheld-device based systems utilized either imaging sensors such as RGB, NIR, ToF sensors or a handheld device for signature motion capture or imaging. The Wi-Fi based systems utilized either the CSI or the RSSI signals for capturing of signature patterns. Based on the captured signature type, the related works can be further classified into two categories, namely 1) 2D signature and 2) 3D signature. In terms of the adopted methodology, we further divide the related works into i) handcrafted feature based systems and ii) deep learning based systems. Based on this categorization, we observed that most of the earlier works utilized handcrafted features for in-air signatures while only a few recent works used deep learning technique where the feature extraction was performed in an end-to-end manner. Our proposed system belongs to the category of 2D handwritten signature recognition utilizing the Wi-Fi CSI signals. Table 2 provides greater details of existing literature in terms of adopted modality, data type, methodology and the mode of operation. The following subsections provide a brief account of these existing technologies.

1) VISION/HANDHELD-DEVICE BASED SYSTEMS
In [26], Katagiri et al. conducted a preliminary study of in-air handwritten signature for person verification. In their system, 53550 VOLUME 9, 2021 the user was prompted to draw his/her signature in the air using a pen that came with a light-emitting diode (LED). The highlighted trajectory of the in-air signature was subsequently captured by a video camera. In [10], Bailador et al. made use of the movement sensor in the mobile device to capture in-air signatures. They also conducted experiments under spoofing attack scenarios. Similarly, in [29] and [30], an accelerometer embedded in the mobile device was utilized to perform the verification task. Casanova et al. achieved an EER of 2.5% by fusing the features of gesture accelerations at decision level. In [30], Diep et al. achieved an EER of 0.8% by using the Support Vector Machine (SVM) to differentiate between the genuine users and the impostors.
In [9], Jeon et al. proposed a vision based in-air signature verification system. The system utilized a depth camera (Kinect) to track the finger together with the palm motion of the user. Unlike [10], [26], this system did not require a person to hold any additional devices. In [27], Sajid et al. proposed a vision based in-air signature verification system. In this work, the Google-Glass was used to recording in-air handwritten signatures. The extracted 2D coordinates are matched and authenticated by Dynamic Time Warping (DTW). In [31], Fang et al. proposed a video-based in-air signature verification system. The time and space information of in-air signature were fused by DTW and Fast Fourier Transform (FFT), and this fusion resulted in a robust and accurate verification system. In [13], Malik et al. utilized a depth sensor to capture the 3D hand joint positions from in-air handwritten signatures. They employed a Multi-dimensional DTW algorithm to match between the preprocessed test signature and the corresponding feature. Behera et al. [12] proposed a deep learning based matching system. They employed a CNN based sequential classifier to perform the identification task on the 3D in-air signature dataset captured by a leap motion sensor. Recently, Khoh et al. [4] proposed an in-air signature recognition system using the Kinect camera to capture not only the RGB images but also the depth information. They utilized several distance measures to compute the dissimilarities among the extracted features. Both the identification mode and the verification mode were experimented to show the potential of the system.

2) WI-FI BASED SYSTEMS
Comparing with the vision/handheld-device based systems mentioned above, the Wi-Fi based in-air signature recognition system has less spatial constraint due to the vast accessibility of the Wi-Fi signal via commercial devices. Recently, Moon et al. [11] constructed a preliminary in-air signature recognition system for user verification. The system was able to authenticate in-air signatures which were captured through the Wi-Fi CSI signals.
The above review shows that none of the previous systems are designed to address the location sensitivity issue of Wi-Fi based systems except the gesture recognition work of Ohara et al. [33] where the Doppler shift effect was adopted to extract position-independent features. However, only features parallel to the direction of the Wi-Fi signals can account for the position change. The current paper reports an extension of the preliminary study in [28] on an extended dataset from [11], [28] using 100 subjects and incorporates score-level fusion of multi-samples (i.e., 1-dimensional, 2-dimensional Wi-Fi signals) for performance enhancement.

B. TRANSFER LEARNING
When there is insufficient data for training a deep network, the technique of transfer learning [18], [34] can be exploited to adapt the features of a pretrained network to another under the similar domain. In [35], Oquab et al. proposed a CNN which was pretrained using available labeled source and then used it as a fixed feature extractor. To address the distribution difference between the source used for pretraining and the novel target domain, an adaptation layer was utilized to fine tune the network to the novel target. In [36], Shin et al. transferred an ImageNet [37] pretrained CNN and then retrain the network using a sub-domain of medical images to perform skin lesion detection. They showed that this approach can effectively address the problem of insufficient medical images and achieved the state-of-the-art performance. Wang et al. [38] proposed an online training method to improve the visual tracking performance. They transferred deep features that was trained on a large visual dataset and was shown to remarkably improve the performance on the small dataset. Our proposed system has been inspired by these works. Since our in-air Wi-Fi handwritten signatures captured at different positions and orientations belong to a common domain, we can utilize transfer learning to address the limited number of samples acquired from a single orientation at each position.

C. MOORE-PENROSE INVERSE BASED NETWORK LEARNING
The backpropagation algorithm using gradient descent [20], [21] has achieved great success in deep learning. However, gradient descent based learning algorithm comes at the expense of several costs such as selection of hyperparameters, having local minima with time consuming iterations, and gradient vanishing. The Moore-Penrose inverse based learning for training the shallow (less than three layers) networks has been recently reviewed in [39]. This approach aims to find the global minima of the reformulated system in a single learning shot without training iterations. Moreover, it does not require hyper-parameters such as the learning rate and the momentum setting. However, the learning has been limited to shallow networks. Recently, a novel method to learn neural networks with more than three layers was seen in [22]- [25]. This learning approach aims to learn all the parameters of the deep neural network in a single operating pass by manipulating the KAR space of the reformulated system. We shall utilize this learning method in the stage of retraining to avoid iterative search. The details shall be provided in Section III-B. VOLUME 9, 2021 FIGURE 2. Overview of the proposed system. (a) Capture of in-air handwritten signature gesture using Wi-Fi CSI signals (b) Data preprocessing steps (c) User identification process using transfer learning based on the KAR space projection learning (d) Score-fusion in the testing stage for accuracy enhancement. The symbol ''F'' refers to fusion. Each of these processing steps is described in greater details in Section III.

III. PROPOSED SYSTEM
In this section, we introduce our system for user identification where its flow diagram is shown in Figure 2. Essentially, the processing steps in the system can be divided into three stages namely, i) data acquisition and preprocessing (see labels (a) and (b) in the figure), ii) network pretraining and transfer learning (see label (c) in the figure), and iii) decision fusion (see label (d) in the figure). These processing stages are described in greater detail in the following subsections. The abbreviations and acronyms in this section are explained in Table 3.

A. DATA ACQUISITION AND PREPROCESSING 1) DEVICE AND ENVIRONMENTAL SETUP
A VPCSB16FK VAIO laptop with Intel Core i5-2410M 2.3GHz 64bit CPU and 8Gb memory was used as the Wi-Fi signal receiver. The receiver was equipped with the Linux 802.11n CSI tool [40] and an Intel 5300 Network Interface Controller (NIC) to collect the CSI of the signals. The aforementioned tool was implemented on an Ubuntu 12.04.05 LTS laptop. The receiver was set to receive each CSI at every 0.0001 second (10kp/s) in view of higher packet loss at lower packet rates as well as the relatively long range between the transceivers in our setting 1 [41]- [43]. In order to gather as many packets as possible in multiple paths, 3 external antennas at the receiver side were deployed. An IpTime A1004 router with 2 transmission antennas was used as the Wi-Fi signal transmitter. The transmitter was set to transmit Wi-Fi signals at 2.4 GHz frequency band. To the best of our knowledge, the primary difference between the 2.4GHz and 5GHz signals are the speed and the range of coverage. A 2.4GHz Wi-Fi signal is able to cover a larger area but it sacrifices the speed. In contrast, the 5GHz band provides a faster speed for a smaller area. We observed higher noise in the collected signals via the 5GHz band than that via the 2.4GHz band during the data collection process. In view of the noise issue, we have utilized the 2.4GHz signals in this study. Under this device setup, subjects were requested to write their signatures in the air while sitting at two different positions with four writing directions (see Figure 3 (a)). Since our router utilizes the omnidirectional antennas, it can suffer from wireless interference caused by other devices such as microwave ovens and cordless phones. To minimize such interference, we either turned off related devices during data collection process or having them removed from the data collection environment. The signals based on four facing directions of the user were recorded to observe the impact of user orientation.
The layout of the experimental environment is shown in Figure 3 (b). The data collection was performed in a typical office room of dimension 4m×6m with desktop, tables, chairs, and bookshelves. The locations of the transmitter (Tx) and the receiver (Rx) were placed near the two ends of the room. The distance between Rx and Tx was approximately 4.84m and they were placed about 82cm and 110cm above the floor, respectively. The 3 external antennas at the receiver end were placed at approximately 45cm apart. During data acquisition, each user was asked to sit at one of the positions labelled as Tx-side or Rx-side and perform his/her personal gesture of in-air handwritten signature while the transmitter and the receiver were put in operation. Some samples of the received signals containing the disturbances caused by the in-air signatures are shown in Figure 1.

2) DATA COLLECTION AND PREPROCESSING
The above setup was used to collect the Wi-Fi CSI signals, which contained signals distorted by the in-air handwritten signatures, from 100 subjects. For each subject, 10 samples were collected for each of the 2 sitting positions with each position having 4 signature facing orientations. In order to enact a realistic usage environment during the data collection process, the users were seated at a comfortable posture at each of the four orientations for each of the two positions without any other restrictions of the body pose. The users can use either the left or the right hand, with or without specific finger posture, to perform the signature gesture according to their habits or preferences. Moreover, there is neither the need to use a specific language nor the need to use their real signature for the experimentation so long as the same signature pattern has been used for each orientation and position. Each of the collected samples is of approximately 12 second duration. A total of 8000 signal sequences were collected to form the database for experimentation.
The collected Wi-Fi CSI signals consist of several subcarriers for each packet in the Tx-Rx antenna pair. The CSI of sub-carrier can be modeled as where c ∈ {1, 2, · · · , C} denotes the sub-carrier index, with C being the maximum number of sub-carriers, |h c | and θ respectively denote the amplitude and the phase of the subcarrier. Since our Intel 5300 NIC had a firmware issue in extracting the phase information from sub-carriers at 2.4GHz frequency band, we only exploited the amplitude of the CSI of the collected signals. Under our 2 × 3 Tx-Rx configuration, there are 6 (2 × 3) streams and each stream has 7,000 to 15,000 packets on average, and each packet has 30 sub-carriers.
Since the collected signals at fast receiving rate are computationally expensive and likely to contain missing values, we adopt several preprocessing steps to reduce the computational load and enhance the signal quality for our user identification system. Our preprocessing steps include linearinterpolation, resampling, low-pass filtering and removal of DC components. To reduce the computational complexity, the C sub-carrier streams of each packet in the Tx-Rx antenna pair were averaged into a single stream and this resulted in an 1-dimensinal signal. The averaging process can be written as: and the averaged signal stream can be expressed as: where s ∈ {1, 2, · · · , S} denotes the streaming index, with S denoting the maximum number of streams, and P denotes the maximum number of collected CSI packets. Subsequently, the signal streams are packed as: The collected raw data tends to have missing packets and shows irregular data collection packet sizes P (i.e., 7000 to 15000 under our configuration) despite the fixed packet receiving rate. To address this issue, we firstly apply a linearinterpolation [44] to fill the missing values. After the linear interpolation process, the CSI packets are filtered by a Butterworth filter [45] to smoothen the highly jagged signal. The Butterworth filter has been chosen for this application since our signals of interest (signal distortions caused by signatures) mainly lie in the low frequency range but containing much high frequency components due to environmental noises. We have observed that the Butterworth filter has high efficiency in handling a large number of packets in the collected signals while showing good system performance. Subsequently, we perform resampling [46] to uniformly adjust all the packet sizes from P to P , where P denotes the length of the resampled stream. Since the Wi-Fi signals can be easily affected by surrounding terrain (e.g., location of stuffs on the table), the collected Wi-Fi signal contains DC component arising from minor changes of the data acquisition process. A shifting and subtraction algorithm described in [47] has been applied to remove the DC component. Figure 4 shows an illustration of the shifting and subtraction process.
Without and with consideration of the subcarrier variations, two types of CSI streams (called 1, 2-dimensional signals respectively) have been considered and packed intõ H 1d ∈ R S×P andH 2d ∈ R S×C×P as: Considering the sampling instance, the data can be further packed into tensor formX 1d ∈ R N ×S×P andX 2d ∈ R N ×S×C×P : where N denotes the maximum number of samples. Under our 2 × 3 Tx-Rx configuration, the S, C respectively become 6, 30. P and shifting distance k are empirically set at 500 and 5, respectively. A sample set of results for each preprocessing step is shown in Figure 5.

B. NETWORK PRETRAINING AND TRANSFER LEARNING
Part (c) of Figure 2 illustrates our transfer learning based on the Kernel and the Range space projection (KAR learning). As explained in Section II-B, transfer learning is a methodology where the network weights trained on a certain source domain are taken and used in another target domain for learning refinement. It has been studied to address the problem of insufficient training data in deep learning [18]. Since our Wi-Fi in-air signature dataset is not large enough to train a complex network structure, the conventional transfer learning may not generalize well for prediction. Moreover, most of the works that adopted the Gradient Descent Algorithm to retrain the transferred model (called GD-retrain hereafter) are iterative and time-consuming. To avoid the iterative learning process in the retraining, we adopt the KAR learning [22]- [25] in this work. The KAR learning aims to solve the network in a single operating pass where no iterative search is needed. The main advantage of this approach is that neither descent nor gradient computation is needed for network learning. Moreover, It does not require learning of hyper-parameters such as the learning rate and the momentum setting. 53554 VOLUME 9, 2021 1) PRETRAINING Firstly, the model is pretrained utilizing every orientation at say, the Tx (or Rx) position based on the gradient descent algorithm [20], [21] where this position is not used for identity prediction. Based on the CNN structure described in [19], the feature extractor of the proposed pretraining model consists of three convolutional layers with activation functions followed by pooling layers. The activation function and the pooling layer are the commonly used ReLU and Max-pooling. Each convolution layer has several filters to be trained. Each filter convolves with the input and forward its result to the next layer. A Fully Connected Feedforward Neural Network (FCN) serves as a classifier for pretraining. The prevalent technique for CNN such as Batch Normalization (BN) [48] can be utilized for the pretraining. The pretrained model which shows the best result is saved and the parameters of the saved feature extractor are transferred for use in identity recognition at a different position. Let us denote the pretrained weights byŴ k , k = 2, . . . , l, and these pretrained weights will be used to perform weights update in the retraining stage that utilizes data from a different position.

2) RETRAINING
After the pretraining stage, we have the feature extractor pretrained with the Wi-Fi signature signals collected from every orientation at the Tx side. Since the Wi-Fi signature signals obtained from different positions are in the similar data domain [11], we can treat the pretrained convolutional layers as feature extractor and retrain only the classification layers for identity recognition. We shall describe the procedure of KAR learning for retraining our transferred model in the sequel. Next, consider the new input Wi-Fi signature signals collected from say, the Rx (or Tx) side (for each single orientation) or even from a different collection environment. These novel signals are fed into the pretrained model to generate the transferred features. Suppose X ∈ R n×m and Y ∈ R n×q denote respectively the input features to the FCN (in other words, features generated from the transferred feature extractor) and the corresponding one-hot encoded matrix indicating the class label from the new position. The symbols n, m, q respectively denote the number of input samples, the number of output features, and the number of identities. By utilizing the pretrained weight matricesŴ 2 , · · · ,Ŵ l obtained from the above stage, the network can be retrained according to [22], [24] utilizing the forward propagation of the network and the functional inverse based solution for W 1 as follows: σ ( · · · σ (XW 1 )Ŵ 2 · · · )Ŵ l = Y, (9) σ ( · · · σ (XW 1 )Ŵ 2 · · · ) = YŴ † l , where σ (·), † respectively denote an invertible activation function and the Moore-Penrose inverse operation [49].
After W 1 is retrained following (12), it is back-substituted into (9) to retrain W 2 based on: This functional inverse based learning is repeated until W l is retrained as shown in (15): After W 1 , W 2 ,· · · , W l have been learned, the one-hot output prediction matrix can be estimated as follows: whereŶ ∈ R n×q . We shall call this learning process for our retraining as KAR-retrain.

C. DECISION FUSION
According to [50], multibiometrics can be utilized to overcome limitations inherent in each unibiometric system. According to [50], [51], multibiometric fusion can be performed at different levels namely, the data level, the feature level, the score level, the rank level and the decision level. Among these fusion levels, the score-level fusion is among the most commonly used due to the ease of accessing scores generated by commercial matchers [50], [51]. Moreover, it is known to produce the best classification accuracy performance [47], [50], [51]. For example, simple non-learning based algorithms such as the SUM-rule, the MAX-rule and the MIN-rule were performed and compared in [52]- [55]. Apart from the above fusion means, learning based algorithms, such as SVM [56] and TER [57], can be adopted for score level fusion [47], [58]- [60]. However, learning based fusion requires additional computational cost for training. For simplicity, we adopt the score-level fusion based on the rulebased operation. To take advantage of multibiometric system, we prepare two different kinds of signals in the preprocessing stage of the proposed system, which can be regarded as multisamples of the same biometric modality captured with certain variations. As illustrated in part (d) of Figure 2, letŶ 1d ∈ R n×q andŶ 2d ∈ R n×q respectively denote the predicted score matrices of the 1-dimensional and the 2-dimensional Wi-Fi signals from the same user in our proposed system. After a SoftMax normalization [61], the ranges of the scores for the 1-deimensional and the 2-dimensional Wi-Fi signals are both within [0, 1]. We fuse these two normalized scores by an element-wise rule-based operation, such as SUM, MAX, and MIN operations to get the final score matrixŶ ∈ R n×q : where L denotes an element-wise rule-based operation, such as using SUM, MAX, and MIN. After score fusion, the classification accuracy can be calculated by comparing the fused one-hot label with the ground truth label (i.e., Y). VOLUME 9, 2021

IV. EXPERIMENTS
The main goal of this study is to verify the effectiveness of the proposed system for user identification and verification using the Wi-Fi based in-air handwritten signature signals. Firstly, we introduce the details of our dataset and parameter settings in Section IV-A. Secondly, we provide the experimental results and discussion in Section IV-B. Essentially, we evaluate several pretraining models to utilize the best architecture for each of the two differently preprocessed data (i.e., the 1-and 2-dimensional data). Subsequently, we compare the result of KAR learning with that of the GD algorithm at the retraining stage. We also show results of non-transfer learning based methods (i.e., training from scratch at each orientation using SVM, ELM and CNN). Next, the impacts of ambient wireless interference and packet rates are evaluated. Finally, we compare all the methods in terms of the verification and identification accuracies as well as the elapsed training time. Table 4 summarizes our experiments with brief descriptions.  A. EXPERIMENTAL SETTINGS 1) DATASETS Table 5 summarizes the dataset for our experimentation. An expanded version of the Wi-Fi in-air signature dataset from [11] is utilized. Under the acquisition setup described in Section III-A, the dataset consists of 8000 samples collected from 100 subjects. The subjects were requested to draw their signature in the air while sitting at 2 different positions (i.e., the Tx side and the Rx side as shown in Figure 3 (b)). At each position, the subjects were asked to face 4 different directions (i.e., front, right, left and back).
For each orientation, 10 samples were collected for each subject, resulting in 1000 samples per orientation (i.e., 10 samples × 8 orientations × 100 subjects in total).
In order to evaluate the impact of ambient wireless signal interference, another dataset has been collected with and without ambient interference. The ambient interference has been produced by two smartphones, which were carried by each subject and the experimenter. This dataset has been collected at a different location with the user sitting in between the Tx side and the Rx side in order to observe also the impact of geographical locations. Following the same acquisition protocol described above, this dataset consists of 200 samples from 5 subjects (i.e., 10 samples × 4 orientations × 5 subjects). According to [16], [62], [63], the performance of a Wi-Fi CSI-based recognition system highly depends on the granularity of the captured CSI signal. Therefore, in order to study also the impact of receiving packet rates on the recognition accuracy, the data has been collected using two different receiving packet rates (1kp/s and 10kp/s). As the set of genuine-users under the identification mode has been relatively small for representative learning, only the verification mode is studied here. Table 6 summarizes the four subsets of data collected for this study.

2) EVALUATION PROTOCOL
The experiments were performed using a desktop computer equipped with an i7 processor (3.70GHz), together with a NVIDIA Geforce GTX 1080 Ti GPU and 32GB of RAM. Both transfer-based learning and non-transfer-based learning algorithms were evaluated.
For evaluations of transfer-based learning methods, in the pretraining stage the dataset was divided into two groups, namely a training set and a test set. For example, the dataset (4000 samples) on four writing directions at the Tx side was separated into a training set (3200 samples, 80%) for training, and a test set (800 samples, 20%) for evaluation. A 5% portion from the training set (i.e., validation set) was used to fine tune the hyper-parameters. In the retraining stage, the performances were evaluated and averaged from five-fold crossvalidation tests. For example, the dataset (1000 samples) of single direction at the Rx side was divided into a training set (800 samples, 80%), and a test set (200 samples, 20%). Subsequently, these two positions were swapped for pretraining (Rx side) and retraining (Tx side). For non-transfer-based   k, s, p, c) where the symbols c and k respectively indicate the number of outputs and the channel size. We have k 1 and k 2 for respectively the 1D and the 2D signals. s and p respectively denote the stride and the padding. The pooling layer has three parameters given by (k, s, p).
learning methods, the five-fold cross-validation tests have been utilized for performance measure.
For the identification task, the rank-1 identification accuracy is adopted as the metric for performance evaluation. As mentioned in Section III-C, the system produces the final score matrixŶ ∈ R n×q , where n, q respectively denote the number of input samples and the number of identities. The predicted identity labelŷ ∈ R n can be obtained fromŶ by determining the rank-1 (the highest possibility of) identity for each sample. The identification accuracy is calculated by comparing between the predicted labelŷ and the ground truth label y: Identification accuracy = n(ŷ ∩ y) Total number of input samples n .
For the verification task, the degree of matching between two biometric templates is measured. In our implementation, the known target labels (1 for genuine-users and 0 for impostors) of the template pairs from the training set has VOLUME 9, 2021 been used to learn the networks (KAR-retrain, GD-retrain, CNN-scratch) [64]. The genuine-users refer to matching of templates drawn from the same user while the impostors refer to matching of templates drawn from different users. The verification accuracy is subsequently computed based on the population of test matches obtained from the genuine-users and the impostors in terms of the EER [32]. The EER is obtained based on the intersection of the False Acceptance Rate (FAR) and the False Rejection Rate (FRR) curves. The FAR is the number of false accept counts over the size of the impostor population. The FRR is the number of false reject counts over the size of the genuine-user population. Since the ratio of the genuine-user and the impostor populations is highly imbalanced, we randomly subsample the impostor pairs to make it having a balanced size with that of the genuine pairs. Also, limited by our memory constraint for KAR learning, only the 1-dimensional data is utilized for the verification task. All the experiments are implemented using the Pytorch [65] deep learning framework. Table 7 summarizes the parameter settings adopted in our experiments. For preprocessing, the packet size for resampling in (2) and (4) has been empirically fixed at 500 and the shifting distance k in Figure 4 has been empirically selected as 5. For the transfer learning (i.e., pretraining + retraining), there are several parameters to be set. At the pretraining stage of our proposed system, the CNN was trained starting with a learning rate (η) of 0.0001 utilizing the Cross Entropy Loss [66]. The Adam optimizer [67] with an L2 penalty of 0.0001 was adopted for the training. After half of the total epoch had elapsed, the η value was halved. The training epochs and batch size were empirically set at 120 and 64, respectively. The trained feature extractor which showed the best classification accuracy during training iterations was subsequently saved and transferred.

3) PARAMETER SETTINGS FOR TRANSFER-BASED LEARNING
In order to compare with the adopted KAR retrain at the retraining stage, the GD algorithm was experimented starting with an η value of 0.0001 for the 1-dimensional data and 0.0005 for the 2-dimensional data. Similar to the pretraining stage, the η value was halved when the training iterations passed half of the total epoch. The training epochs were empirically set at 50. For the GD algorithm, all other settings including the loss function, the optimizer and the batch size were chosen to be the same as that of the pretraining stage. At the testing stage, a score level fusion with elementwise rule-based operations (i.e., SUM, MAX, and MIN) was adopted.

4) PARAMETER SETTINGS FOR NON-TRANSFER-BASED LEARNING
For non-transfer-based learning, we evaluate three learning models, namely the SVM [56], the Extreme Learning Machine (ELM) [68], and a CNN training from scratch which utilized a similar structure to that used in the pretraining stage. Before applying the SVM, the dimension of input signals was reduced to d dimension using the Principal Component Analysis (PCA) [69] in view of the heavy computational overhead. Several reduced features d ∈ {50, 100, · · · , 300} were selected and compared in Section IV-B2. Subsequently, common kernel functions such as linear, polynomial, and Radial Basis Function (RBF) kernels were also investigated for the SVM. The degree of the polynomials and the RBF kernel coefficients were empirically set as shown in Table 7.
In a similar manner, for ELM, the input features were reduced to d dimension by PCA. Several sizes of the hidden neurons α ∈ {2000, 4000, 6000, 8000} were compared in Section IV-B2. The sigmoid function was used for the activation function in ELM.
For training from scratch using CNN, the training epochs and learning rate were empirically set at 100 and 0.0001, respectively. All other settings (i.e., the loss function, the optimizer and the batch size) were chosen to be similar to that in the pretraining stage.

1) RESULTS OF TRANSFER LEARNING MODELS a: EVALUATION OF PRETRAINING MODELS
Six CNN models have been experimented to figure out the best architecture for pretraining each of the two different types of experimental signals. Table 8 details the architectures of the CNN models utilized for pretraining study. As shown in the table, CNN 1 and CNN 3 have no BN layer [48] while CNN 2 , CNN 4 , and CNN 5 have. CNN 6 downsamples the feature-map with strided convolution instead of a pooling layer.
In order to increase the size of the dataset, a random horizontal flipping and a random vertical flipping with probability 0.5 have been used for the 2-dimensional (2D) Wi-Fi data while only a horizontal flipping is utilized for the 1-dimensional (1D) data. All the models are trained from scratch using the dataset from every orientation at position Tx side (or Rx side). The identification accuracies are presented in Table 9. For the 1D data, CNN 6 shows the best averaged identification accuracy while CNN 4 shows the best result for the 2D data. Following this observation, we use CNN 6 , CNN 4 as a pretraining model for the 1D and the 2D data, respectively.
Under the verification mode, only the 1D data has been experimented for the convolutional network method due to  our memory constraints. Therefore, only CNN 6 for 1D data is used for the pretraining. The EER performance of CNN 6 is 4.3% at the Tx side and 2.7% at the Rx side. Table 10 shows the average identification accuracies of the two retraining algorithms (KAR-retrain and GD-retrain) evaluated under the parameter settings described in Section IV-A. For the 1D data, the KAR-retrain yields higher average accuracy (98.05% at Tx side and 98.65% at Rx side) than that of the GD-algorithm (96.55% at Tx side and 98.175% at Rx side) at the retraining stage (see the ''1D'' columns of Table 10). For the 2D data, the GD-retrain (97.85% at Tx side and 98.25% at Rx side) outperforms the KAR-retrain (95.25% at Tx side and 96.45% at Rx side) as shown in the ''2D'' columns of Table 10.

b: EVALUATION OF RETRAINING ALGORITHMS
The KAR-retrain shows comparable or slightly higher accuracy in case of utilizing score fusion (see ''Score fusion'' columns of Table 10). Among the three fusion means (i.e., SUM, MAX and MIN), both learning algorithms achieve the highest accuracy using the SUM-rule operation (named fusion-SUM hereafter). With fusion-SUM, the KAR-retrain shows marginally higher accuracy (99.75% at Tx side and 99.875% at Rx side) than that of the GD-retrain in the case of using fusion-SUM. Table 13 shows the EER performance of KAR-retrain and GD-retrain under the verification mode. The GD-retrain shows a slightly lower EER than that of the KAR-retrain at both the Tx and the Rx sides (2.9% at Tx side and 2.2% at Rx side).

2) RESULTS OF NON-TRANSFER LEARNING MODELS a: EVALUATION OF TRAINING FROM SCRATCH USING SVM
Since applying the SVM directly on the raw dataset is computationally expensive, we utilize the PCA to reduce the number of features (called PCA-SVM hereafter). Figure 6 shows the average identification accuracies plotted over the reduced feature dimensions. For the 1D data, the SVM with RBF kernel (SVM-RBF) shows the best performance over all feature sizes except for the case of the reduced features of 50 dimensions where the SVM with linear kernel (SVMlinear) shows marginally better accuracy. For the 2D data, the SVM-linear shows the best performance in all cases. Among these studied cases, applying the SVM-linear to the reduced features of 50 dimensions shows the best performance on both types of data (i.e., 1D and 2D data) in terms VOLUME 9, 2021 TABLE 11. The identification accuracy (%) of training from scratch using PCA-SVM with fusion-SUM and CNN evaluated from all four facing directions and two subject positions. All the performances are averaged from 5-fold cross validation tests. of the average identification accuracy while the SVM with polynomial kernel (SVM-poly) shows the worst performance. Therefore, we utilize the former case for score fusion to compare with the proposed system. The ''PCA-SVM'' column of Table 11 shows the results of score fusion. Among the three fusion rules, fusion-SUM achieves the highest result (96.075% at Tx side and 96.825% at Rx side).
For the verification task, the PCA-SVM shows an average EER of 12.5% at the Tx side and 7.8% at the Rx side (see Table 13). The number of reduced features adopted was five and the kernel of SVM adopted was RBF.

b: EVALUATION OF TRAINING FROM SCRATCH USING ELM
Similar to the SVM case, we reduce the number of features using PCA before applying ELM on the dataset (named PCA-ELM hereafter). Figure 7 shows the average accuracies of PCA-ELM plotted over the reduced dimensions. For the 1D data, the ELM with 8,000 hidden neurons (ELM-8000) shows the best performance in all cases. Among these studied cases, applying the ELM-8000 to the reduced features of 50 dimensions shows the best result. For the 2D data, the ELM with 6,000 hidden neurons (ELM-6000) shows comparable results with that of ELM-8000. However, applying the ELM-6000 to the reduced features of 50 dimensions shows the best performance over the other cases. Therefore, similar to applying the SVM, we utilize the former case for score fusion to compare with the proposed system. The results of score fusion are presented in the ''PCA-ELM'' column of Table 11. Among the three fusion rules, fusion-SUM achieves the highest performance (96.275% at Tx side and 96.725% at Rx side).
For the verification task, the PCA-ELM shows an average EER of 11.7% at the Tx side and 8.8% at the Rx side. Similar to the above case (SVM), five reduced features with 8000 hidden neurons have been adopted.

c: EVALUATION OF TRAINING FROM SCRATCH USING CNN
We train the CNN from scratch using data collected at each direction as described in Section IV-A. The ''CNN'' column of Table 11 shows the results of training from scratch using CNN (called CNN-scratch hereafter). The selected CNN architecture is indicated next to each data type. As seen from the table, the average accuracy at the Tx side is 95.1% for the 1D data and 94.7% for the 2D data while the average accuracy at the Rx side is 97.35% for the 1D data and 97.05% for the 2D data. Similar to the SVM and ELM cases, we can see that using score fusion with rule-based operation improves  the performance. Among the three fusion rules, the highest result can be found for the case of using fusion-SUM (99.45% at the Tx side and 99.85% at the Rx side).
The accuracy of verification for CNN 6 is reported in Table 13 in terms of the EER. The results show an average EER of 3.0% at the Tx side and 2.3% at the Rx side for CNN 6 . Table 12 shows the average accuracy results (in terms of the verification EER%) recorded based on 5-fold crossvalidation tests. Under the scenarios with wireless interference, the data subset collected at the higher packet rate (10kps, 0.56%) shows a lower average EER than that at the lower packet rate (1kps, 3.94%). Similarly, with no wireless interference, a lower average EER is observed at the higher packet rate (10kps, 0.13%) than that at the lower packet rate (1kps, 2.56%). In terms of the total average, the performance without interference (1.34%) is seen to have a lower EER (better performance) than that of the case with interference (2.25%). Figure 8 plots the average identification accuracies (from all four facing directions at each position) of all the compared learning algorithms over each signature orientation. According to Figure 8(a), the PCA-SVM and the PCA-ELM show comparable average accuracies (96.075% and 96.275% at Tx side and 96.825% and 96.725% at Rx side).

4) COMPARISON OF ALL THE METHODS a: COMPARISON OF IDENTIFICATION ACCURACY AND EQUAL ERROR RATE
These accuracies are significantly lower than that of KARretrain, GD-retrain and CNN-scratch. Figure 8(b) shows an enlarged plots for KAR-retrain, GD-retrain and CNNscratch. These results show either better or comparable accuracy of KAR-retrain relative to GD-retrain and CNN-scratch. Table 13 shows the EER of all the compared methods. Similar to identification case, the PCA-SVM and the PCA-ELM show similar average EERs (12.5% and 11.7% at Tx side and 7.8% and 8.8% at Rx side) which is worse than that of KAR-retrain, GD-retrain and CNN 6 . However, KAR-retrain shows comparable EER with that of the GD-retrain and CNN 6 (about 0.8% higher EER on average) in verification mode. Table 14 shows the time taken to train the compared algorithms in seconds. All the compared algorithms have been trained using the GPU except for PCA-SVM which has been trained using the CPU due to the unavailability of codes for GPU.

b: COMPARISON OF ELAPSED TRAINING TIME
Non-transfer learning methods: among the non-transfer learning methods in the table, the CNN-scratch shows the longest training time among the three evaluated methods. Attributed to the shadow architecture, both PCA-SVM and PCA-ELM show a much faster training time than that of the CNN-scratch. However, this speed comes with the price of a compromised identification accuracy.
Transfer learning methods: for the two evaluated methods based on transfer learning, the retraining time is compared between that of KAR-retrain and GD-retrain. The results in Table 14 show a much faster training speed of KAR-retrain than that of GD-retrain, both of which have been built upon a pretaining overhead (with pretraining time = 64.49 secs (1D), 162.41 secs (2D) utilizing data of all four directions).
Comparing between the transfer learning methods and the non-transfer learning methods, the former shows faster adaptation when there are more unseen user positions (apart from the studied Tx and Rx positions) for retraining. In other words, the transfer learning in our system trains only the last three layers in a single shot manner whenever a new user position is added while not compromising the identification accuracy. Such a fast adaptation to new user positions is a clear advantage over training from scratch for real world applications.

5) SUMMARY OF RESULTS
The results are summarized and discussed as follows: -As observed from Table 9, making the CNN architecture deeper does not help to improve the performance for the 2D data while it shows enhanced performance for the 1D data. We infer that this is because the 2D data has the same number of samples as that of the 1D data, but with richer features. This may indicate that training CNN on 2D data needs much more number of samples in view of the higher dimensional features. Hence, training the deeper CNN using 2D data shows lower performance than that of the shallower CNN. -From Figure 8, the proposed KAR-retrain outperforms the compared methods in terms of the average identification accuracy. Although the GD-retrain shows comparable accuracy compared with the KAR-retrain (see Table 10 for details), the KAR-retrain is shown to learn the transferred features with much lower computational cost (see the ''Transfer learning'' column of Table 14). This is because learning a classifier (i.e., FCN) using the GD algorithm requires iterative search while the KAR learning learns the weights of classifier in a single operating pass. -Among the compared methods, the CNN-scratch also shows a competitive accuracy (see Figure 8). However, its computational cost is about 10 to 16 times heavier than that of the KAR-retrain when the pretraining overhead is not considered. The main reason is due to the iterative search of the GD algorithm to train the CNN-scratch. -Although the PCA-SVM and the PCA-ELM classify the data with high computational efficiency (see Table 14), they achieve the lowest identification accuracy among the compared methods. Besides, the two learning algorithms incur an additional cost of using PCA for dimension reduction. Comparing with GD-retrain and CNN-scratch, the KAR-retrain provides a balance between identification accuracy and computational efficiency. -Since the proposed system trains only the final three layers of the network for transfer learning, the adaptation of the network to new user positions and orientations is much faster than that of training from scratch. -Under the verification mode, the EER performance for KAR-retrain is observed to be much better than that of PCA-SVM and PCA-ELM and comparable to that of GD-retrain and CNN 6 (see Table 13). -The difference in experimental location does not have apparent impact on the verification performance (see Table 12). -A higher receiving packet rate at 10kp/s shows a better verification accuracy than that at a lower receiving packet rate of 1kp/s. We have observed that even though severe packet loss has occurred during data acquisition, the higher packet rate at 10kp/s can capture about 1.5 to 2 times more packets than that at 1kp/s. Hence, setting the receiving packet rates as high as possible is advantageous for capturing accurate in-air signature. -Under the presence of wireless interference, the verification accuracy is observed to be slightly degraded comparing with that without ambient interference (see Table 12). This shows the feasibility of the proposed system for application under practical scenario. -In this feasibility study, the improvements in terms of recognition accuracy and training time have been obtained based on the data sets captured under a similar setting. Moreover, as shown in Table 12, the proposed KAR-retrain performs favorably well on the dataset captured at a different location (0.13% at most in terms of EER). This is a clear sign that our system can effectively handle the varying patterns of captured signals at a different location by adopting the functional inverse based transfer learning.

6) FUTURE WORKS
Comparing with the studied methods, the proposed KARretrain shows promising performance with low retraining time. The study using different packet rates shows improved verification accuracy for the higher transmission rate. Moreover, the interference study shows minor degradation of verification accuracy with ambient interference. In order to improve the robustness of the system to work in the production environment, several aspects can be investigated in future. These investigations include recognition of multiple in-air signatures and handling of spoofing attacks. In other words, as the signature gesture can be visible during authentication, a challenging topic for future research would be whether any forgery of the personal signature gesture can be detected. Also, recognition of multiple users could be another challenging topic.

V. CONCLUSION
In this paper, a novel Wi-Fi enabled system for in-air handwritten signature was proposed for user identification after the feasibility has been established. The proposed system utilized the variation of CSI amplitude information caused by the in-air writing movements as patterns for recognition.
To address the sensitiveness of captured patterns towards different user positions, the transfer learning was adopted to avoid training from scratch for each position. A KAR learning was adopted in the retraining stage to reduce the learning computational cost. Subsequently, two differently preprocessed features were fused at the score level for performance enhancement. Our experimental results based on a moderate size of in-air signature dataset showed promising identification accuracy of the proposed system with relatively low computational cost during transfer.