Air-Writing Recognition Based on Deep Convolutional Neural Networks

Air-writing recognition has received wide attention due to its potential application in intelligent systems. To date, some of the fundamental problems in isolated writing have not been addressed effectively. This paper presents a simple yet effective air-writing recognition approach based on deep convolutional neural networks (CNNs). A robust and efficient hand tracking algorithm is proposed to extract air-writing trajectories collected by a single web camera. The algorithm addresses the push-to-write problem and avoids restrictions on the users’ writing without using a delimiter and an imaginary box. A novel preprocessing scheme is also presented to convert the writing trajectory into appropriate forms of data, making the CNNs trained with these forms of data simpler and more effective. Experimental results indicate that the proposed approach not only obtains much higher recognition accuracy but also reduces the network complexity significantly compared to the popular image-based methods.


I. INTRODUCTION
With the rapid growth of artificial intelligence technology, many intelligent applications have been developed such as smart TV and intelligent robots. The most natural way for humans to communicate with these intelligent systems is dynamic gestures. In recent years, air writing has become one of the most popular dynamic gestures. It is defined as writing alphanumeric with hand or finger movements in a three-dimensional (3D) free space. Air writing is particularly useful for user interfaces that do not allow the user to type on the keyboard or write on the touchpad/touch screen or for text input for intelligent system control [1].
Air-writing recognition is closely related to motion gestures or sign language recognition. Motion gesture recognition methods can be roughly divided into two categories: device-based and device-free. The device-based method requires the use of either handheld or worn devices to obtain hand (or finger) movement in three dimensions, for example, handheld pointing devices such as Wii [1], inertial sensors attached to a glove [2], [3], or motion sensors on the watch [4]. However, the requirement for handheld or worn devices The associate editor coordinating the review of this manuscript and approving it for publication was Li Zhang . and sensors are troublesome and complicated to use; thus, device-based methods are not commonly used. By contrast, in the device-free method, users do not need to hold or wear any devices; hence, this method is more convenient than the device-based method. Device-free methods can be further divided into vision-based and radio-based methods. The former utilizes 2D or 3D cameras to capture gesture input images. The latter uses radio sensors such as radar [5]- [7] or WiFi [8]- [11] to obtain gesture signals.
Air writing can be realized in three manners [1]: isolated, connected, and overlapped air writing. In isolated writing, the letters are written in an imaginary box with fixed height and width in the field of view of an image, one at a time. In connected writing, multiple letters are written from left to right, which is similar to writing on a paper. In the last manner, one can write multiple letters stacked contiguously one over another in the same imaginary box. We study the isolated writing style in this paper.
Isolated writing is the most essential and popular method. Motion characters are isolated alphanumeric letters written in a unistroke. The steps involved in air-writing recognition generally include hand/finger tracking, feature extraction and classification. The fundamental problems in isolated writing include [1], [12]: VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ (a) tracking of hand and/or fingers, (b) segmentation of writing acts (or push-to-write), (c) restrictions on the users' writing due to the limitation of an imaginary box, and (d) intraclass variability of the writing patterns of a letter. For vision-based methods, the first problem has been addressed, but different solutions must be used for 2D and 3D image sensors. 2D camera-based systems often utilize color markers on fingers to increase tracking performance since finger tracking without markers is challenging. 3D camerabased systems address the hand/finger tracking problem well simply using the depth information provided by 3D image sensors such as Kinect [12], Leap Motion Controller (LMC) [13], or Intel RealSense camera [14].
Air writing lacks a reference position on the writing plane and thus lacks the beginning and end points of a stroke. Therefore, it needs to automatically detect the start and end coordinates of the characters written in the air. This is referred to as segmentation of writing acts, or the so-called pushto-write problem. One of the possible solutions is to use a specific posture to signal the endpoint of a writing act [1], e.g., a fist posture. However, this will increase the number of gestures that users must remember. When depth information is available, the segmentation of writing acts can be done by merely using a depth threshold [15].
In summary, 3D camera-based systems address the first two problems more conveniently than 2D camera-based systems. However, 3D systems are more complex and expensive.
The imaginary box limits the range of writing. It reduces the variations of letter input such as position, scaling or rotation of the written image. This alleviates the burden of the subsequent processing. Nevertheless, from the users' perspective, this method causes inconvenience and restrictions of users on writing.
In this paper, we design a simple yet effective air-writing recognition approach based on deep convolutional neural networks (CNNs) using a single low-cost 2D web camera. Our approach solves the first three problems in a convenient manner. Furthermore, it can work in a real-time smart-TV-like environment. The major contributions of this article are: (a) A robust air-writing trajectory acquisition algorithm based on a web camera. The algorithm combines skin and moving features to detect the moving skin region and then applies the Camshift algorithm to track the moving hand. It performs hand tracking only, thus avoiding the complicated procedures for finger tracking. In addition, the proposed algorithm solves the push-to-write problem without using a delimiter. Furthermore, it does not utilize an imaginary box; hence, users can write freely in the air without any restrictions. (b) A novel data preprocessing scheme. The scheme normalizes the x and y coordinate sequences of the writing trajectory and then combines them into 1D and 2D arrays. The two types of data arrays are employed to train 1D-CNN and 2D-CNN. These simple data arrays make the designed CNNs simpler and more effective than the use of complex written images. (c) A CNN-based air-writing recognition system using a low-cost web camera. It achieves real-time recognition with a high accuracy of more than 99% and very low network complexity. It outperforms the popular approaches using written images as input. The remainder of this paper is organized as follows. Section II discusses the related prior work. Section III describes the proposed method in detail. The experimental results are presented in Section IV. Finally, the conclusions are drawn in Section V.

II. RELATED WORK
This work presents a vision-based approach; hence, only the vision-based methods that utilize 2D and 3D cameras in the literature are discussed in the following.
Many studies have been carried with 2D technology. Air-writing recognition can also be considered in parallel to hand gesture recognition. The steps involved in visionbased 2D hand gesture recognition are hand/finger detection and tracking, feature extraction and classification. An early vision-based work by Oka et al. [16] used a complex device with an infrared and color sensor for fingertip tracking and recognition.
To simplify the acquisition process of the writing trajectory based on generic 2D video cameras, Roy et al. [17] used a marker of a fixed color for writing in the air. The marker tip can be easily detected by color-based segmentation. This work also presented a velocity threshold of writing to achieve the segmentation of writing acts. The proposed method achieved 97.7%, 95.4% and 93.7% recognition rates in person-independent evaluations over English, Bengali and Devanagari numerals, respectively. To incorporate flexibility in marker choice and stable motion tracking under varying lighting conditions, Rahman et al. [18] improved the marker tip tracking scheme by a marker calibration mechanism. They presented a dual network configuration consisting of RNN-LSTM (recurrent neural network-long short-term memory) networks for noise elimination and digit recognition. The proposed method yielded a recognition rate of 98.75% for single-digit recognition and 85.27% for multidigit recognition. Due to the variation in writing speed of different users, we argue that it may be difficult to set an appropriate velocity threshold value. Recently, Misra et al. [19] developed a hand gesture recognition scheme to recognize letters, numbers, arithmetic operators and ASCII characters using a red marker placed on the finger for fingertip detection. The scheme achieved a recognition rate of 96.95% for the classification of 58 gestures. The above maker-based schemes impose behavioral constraints on the users. Therefore, a marker-free approach is a better option.
Marker-free fingertip detection/tracking is very challenging because a face that is a moving object with similar skin tone to hands is present in video frames, making hand detection and hence fingertip detection much more complicated. The preceding step of fingertip detection is hand segmentation. Numerous works for hand segmentation and fingertip detection based on 2D cameras have been performed in recent years [20]. These works can be divided into two categories: model-less and model-based approaches. The former utilizes color and motion cues, which are simple and can operate in real time but are often less robust with respect to environmental variations such as illumination changes [21], [22]. By contrast, the latter usually provides higher robustness but incurs a high computational cost and requires a large amount of training data, making it unsuitable for real-time application [22]. The air-writing recognition system in [20] proposed a new writing hand pose detection algorithm for the initialization of air writing. Furthermore, the work used a distance-weighted curvature entropy for robust fingertip detection and tracking. In addition, it also proposed a termination criterion based on the moving velocity of the fingertip to serve as a delimiter and mark the completion of the air-writing gesture. Character recognition experiments gave a mean accuracy of 96.11%.
Recently, several air-writing methods based on 3D image sensors have been developed. References [12] presented a Kinect-based online handwriting recognition system. The authors in [13] used LMC to obtain the 3D positions of fingertips, the center of the palm and the orientation of the hand. References [14] developed an air-writing recognition scheme using 3D trajectories of fingertips acquired by an Intel RealSense 3D depth camera.
The writing trajectory can be obtained after the detection/tracking of fingertips is completed. The subsequent processing is to recognize the trajectory, generally including feature representation (extraction) and classification in traditional machine learning. The vision-based representation contains 3D model-based and appearance-based approaches [23]. The appearance-based approach is more widely used than the 3D model-based approach. The appearance-based model is further categorized into colorbased, silhouette geometry, deformable Garbarit, and motionbased models [23]. Based on these models, a wide variety of distinguishing features for the representation of gestures have been proposed in recent years [24].
The methods based on traditional machine learning extract features in a hand-designed manner and then train a classification model. While those methods are robust, they have some limitations in the generalization of the models for many cases. Recently, some deep learning-based approaches have been presented, such as [30]- [33]. The work in [30] mapped 3D fingertip coordinates acquired with LMC into a trajectory image that was used to train a 2D-CNN model. References [31] presented air-writing recognition based on a fusion framework that combines 2D-CNN and BLSTM to model the spatial and temporal features of gestures. The method achieved 99.25% and 99.83% for the alphabet gesture and the numeric gesture, respectively. References [32] developed dynamic hand gesture recognition with a 3D-CNN that fuses the motion volume of normalized depth and image gradient values. References [33] presented an approach for activity and gesture recognition with 3D spatiotemporal data based on a combination of a CNN and an LSTM (long short-term memory) network. The CNN was utilized to extract relevant features from 3D skeleton data, and the LSTM was applied to tackle the activity recognition. In [34], the authors proposed a gesture recognition system based on the data collected by an RGB camera and a depth sensor. By combining 3D-CNN and LSTM networks to extract spatiotemporal features of the gesture sequence, the system achieved a recognition rate of 97.8% for eight selected gestures. References [35] developed an air-writing recognition system using 3D trajectories collected by a depth camera. The LSTM recognizer of the system achieved the highest recognition rates of 99.17% and 99.32% for two different datasets.
The deep-learning-based methods stated above perform better than the conventional methods in recognition rate. However, most of these methods are very complex since they use 2D/3D networks and/or written images. This work aims to develop a simple yet effective system using a 1D or 2D network that utilizes only the writing trajectory data instead of images.

III. PROPOSED METHOD
The proposed air-writing method is shown in FIGURE 1. It includes three stages: trajectory acquisition, data processing and network. The image sequence is acquired with a web camera. Based on the image sequence, a novel hand tracking algorithm is presented to calculate the trajectory of a stroke that a user writes in the air. Then, the trajectory data are processed and converted into two kinds of forms: 1D arrays and 2D arrays. The two kinds of data are formed into trajectory datasets, which are used to learn CNN models in the offline training phase. During online prediction, the system receives real-time data from the web camera and then predicts the digit (or symbol) that the user writes using the learned models. We describe the three main stages of the proposed system as follows. VOLUME 9, 2021 A. TRAJECTORY ACQUISITION The purpose of this unit is twofold: to acquire the 2D image that the user writes in the air and to record the coordinate sequence of a stroke, called the trajectory of writing. The trajectory is formed by the coordinates of the center of a moving hand. Thus, detection and tracking of the moving hand from the 2D image sequence is essential in this unit.
Hand detection/tracking has been studied for a long time. However, it is still a challenging issue if both robustness and real-time execution are required. In this paper, we combine skin and moving features to detect the moving skin region and then apply the Camshift algorithm [36] to track the moving hand. The proposed algorithm is robust and can operate in real time.
Hand detection using skin features is a simple and fast method. However, it is easily prone to errors due to the interference of skin-like objects, lighting changes, and skin changes of different users. To solve the interference of skinlike still objects, a moving feature is included in our skinpixel detection algorithm. Moreover, to adapt the skin feature variation due to the light change and user change, we extract the skin feature of the face region of the user who is writing. Specifically, we use a face detection algorithm to extract the face region of a particular user and calculate the histogram of the H channel in HSV color space. Then, we apply backprojection on the whole image to detect other regions of the image that have the same histogram. The backprojection is calculated from the histogram. It replaces every pixel by its probability to occur in the image.
The detected regions above can be hands or naked body parts. By combining moving features, we can further remove the naked body and detect the moving hand simply using a logical AND operator. Here, the simple frame differencing method is employed to detect moving pixels using adjacent frames, as shown in FIGURE 2(a) and FIGURE 2(b). The binary image after the AND operation may have few small holes and/or noise. We removed them with morphological operations, and the detected hand was clean and complete, as shown in FIGURE 2(c).
Finally, we apply the Camshift algorithm to track the moving hand region and record the center coordinates of the hand region of every frame of a gesture that corresponds to writing a character. The sequences of the coordinates form the trajectory of the character.
The procedure for obtaining the image and trajectory of air writing of a digit (symbol) is as follows. A user sits down in front of a web camera. When the system detects the face of the user, the air-writing session begins. When the user raises his or her hand and writes a character by moving his or her finger, the system will detect the moving hand. The frame immediately after the moving hand is detected is regarded as the start of a stroke. The frame immediately after the moving hand disappears is the end of the stroke. The center coordinates of the hand in all frames between the start and end of the stroke are recorded, forming the trajectory of the digit, as illustrated in FIGURE 3. In addition, the image of  the written character is also recorded. The above procedure solves the push-to-write problem without a delimiter in a simple manner.

B. DATA PROCESSING 1) 2D IMAGE
As stated before, we convert the handwritten data into a 2D image. The original size of the captured image is 640 × 480. The user most likely writes commands in different positions in air. To attack the shift variance, we transform the captured image into an image that has a size of 360×360 and is located in the middle of a window. The resulting image is shown in the top row of FIGURE 4.
The transformation is performed using the following equations: where In the above equations, x i and y i are the original coordinates in the x-axis and y-axis, respectively, and x i and y i are the transformed coordinates. x max and y max are the maximal values of coordinates in the x-axis and y-axis, respectively. The purpose of 1.4 × r in Eqs. (1) and (2) is to leave 0.2r margins on the left boundary and right boundary when the coordinates are plotted as an image. To reduce the computational load in training, the normalized image with a size of 360 × 360 is further resized to 36 × 36 with linear interpolation. The resized image set is used to implement the existing popular approaches for comparison.

2) 1D TRAJECTORY
To reduce network complexity, we normalize the coordinates of the original trajectory and then transform them into a 1D sequence. The normalization is performed according to y max = y avg + r/2 (8) The average values [x avg , y avg ] and the range r above are defined in Eq. (3)∼(5). Using the new maximal and minimal values defined in Eq. (6)∼(9), we can normalize the coordinates into [-1, 1] by y i = 2 y i − y min − y max − y min r (11) It is noted that the x coordinate and y coordinate are normalized with the same value r that is the length of the long axis of the written image. Thus, the aspect ratio of the width and height of the written image can be preserved. If normalization is performed independently on the x-axis and y-axis, the aspect ratio will be lost. Our experience indicates that the above aspect-ratio preservation will significantly improve the performance. Examples of the normalized x-coordinate and y-coordinate sequences are illustrated in FIGURE 4.
The times required to complete a writing action of a digit (or character or symbol) are generally not equal for different users. Therefore, the data lengths for different writing actions are not the same. Our experiences indicate that 3 seconds is sufficient to complete the writing of a digit for general users. The frame rate per second (fps) of a camera is 30 in our system. Thus, the data length of a written digit is not greater than 90 points. Here, we set the data length to 100 to consider tolerance. The results obtained from Eqs. (10) and (11) are then upsampled by linear interpolation to obtain 100 points of data for each dimension. To further study the effect of the data arrangement on system performance, we arrange the data into two ways as follows.

a: 1D_PAD
The x-coordinate sequence is padded with 14 zeros at both ends to form a [1,128] sequence. The y-coordinate sequence is processed using the same method. The resulting x and y sequences are then concatenated into a 1-D array [1,256]. The zero-padding is used to isolate the x and y coordinates to avoid their interference with each other in the convolution operation.

b: 2D_NO-PAD
The x coordinate and y coordinate sequences without padding are arranged in a 2D array [2,100], where the x sequence is placed in the first row, and the y sequence is placed in the second row.

C. CONVOLUTIONAL NEURAL NETWORK DESIGN
A basic CNN is composed of several convolutional layers for feature extraction, each of which is usually followed by a pooling layer. The last convolutional layer is also followed by one or more fully connected (dense) layers for classification.
For the 1D and 2D trajectory data stated above, we design a 1D-CNN and 2D-CNN, respectively, to recognize the input digits (or directional symbols). The typical architectures of our proposed 1D-CNN and 2D-CNN for recognizing digits are shown in FIGURE 5(a) and FIGURE 5(b), respectively, and consist of several 1D or 2D convolutional blocks. The architectures for directional symbols are similar; hence, they are neglected here. Each convolutional block contains convolution, maximal pooling, batch normalization, and activation function. The CNN (1D or 2D) applies batch normalization after convolution and before activation because it helps to improve the performance and stability of neural networks [37]. The ReLu function is adopted as the activation function in the hidden layers to avoid the vanishing gradient VOLUME 9, 2021 problem [38]. The dense block consists of more than one dense layer. The softmax activation function is employed in the output dense layer that maps the real-value input into the prediction probability in the range of [0,1]. Dropout is also employed between the two hidden layers since it is beneficial for avoiding overfitting [38].
We apply the minibatch gradient descent (MBGD) algorithm [39] to learn the CNN model. MBGD computes the gradient of the loss function l with respect to the parameter set ϕ for every minibatch of n training examples and then performs an update iteratively to obtain the optimal parameter set (corresponding to the minimal loss function) by ; y (i:y+n) ) where x and y are the target output and the predicted output of the network, respectively, ∇ ϕ is the gradient operator, and ρ is the learning rate. MBGD utilizes the backpropagation (BP) scheme to compute the gradient of the loss function. In this work, we choose cross-entropy in Eq. (13) as the loss function.
In MBGD training, choosing a suitable fixed learning rate is difficult. A learning rate that is too small will lead to slow convergence, while a learning rate that is too large will hinder convergence and cause the loss function to oscillate around the minimum or even cause divergence. To solve this problem, several gradient descent optimization algorithms with different learning rate schedules have been reported such as Adagrad, Adadelta, Adam and RMSprop [39]. The Adam (adaptive moment estimation) algorithm is employed in this work since it has been experimentally proven to be effective [39]. Adam estimates the individual adaptive learning rates of different parameters according to the first and second moments of the gradients of the loss function. The update algorithm of the parameter set of the network is given by [39] where η is a fixed learning step size, ε is a very small constant, and m t andv t are the first and second moments after bias correction, respectively, that are calculated by where g t denotes the gradient of the lost function at time t, and β 1 and β 2 are the decay constants for the first and second moments, respectively.

A. DATASET CREATION
Our work aims to develop an air-writing system for smart-TV control. Since no public dataset for this purpose is available, we create two types of datasets. One is a digit dataset, including 0 to 9 with different writing directions: clockwise and anticlockwise; hence, it contains a total of 20 symbols. The other is a pure directional symbol dataset that includes 16 symbols. FIGURE 6(a) and FIGURE 6(b) show the images of the symbols in the two datasets. Each of the two datasets contains a training set and test set. The training and test sets were obtained by 6 and 8 volunteers, respectively, with ages ranging from 20 to 30 years old. To improve the robustness of our system, the volunteers for the collection of training data and test data are completely different. For the digit dataset, the training set size and test set size are 12,000 and 1,600, respectively. For the directional symbol dataset, they are 9,600 and 1,280, respectively. K-fold cross-validation is the most popular method in various applications of machine learning [38], [40]. We apply K-fold cross-validation for tuning the hyperparameters using the training sets. To find the best K value, we divide the training set into training and validation subsets with different size ratios and then carry out training and validation for each size ratio. The result indicates that K = 5 (size ratio = 4:1) achieves the highest recognition rate; therefore, we used 5-fold cross-validation in this work.

B. OPTIMIZATION OF CNN CONFIGURATIONS AND PERFORMANCE EVALUATION
This subsection discusses the design of optimal CNN configurations and the evaluation of performance in terms of two metrics: recognition rate and network complexity presented in [8]. Here, the total number of parameters of the CNN is used to evaluate the network complexity. A highly complex network often requires a large amount of training data to avoid overfitting and involves a high computational cost. We apply the popular deep learning platform Keras to calculate the two metrics [41].

1) OPTIMIZATION OF CNN CONFIGURATIONS
The design of CNN configurations involves hyperparameter optimization, that is, to set various hyperparameters, including the number of hidden layers, nodes of every layer, batch size, and learning rate. The goal of the hyperparameter optimization in our work is to find the set of hyperparameters that obtains the highest recognition rate given network complexity. Some general strategies such as grid search and random search have been presented to find the best network that achieves the highest recognition rate [29]. However, these strategies do not consider the metric of the network complexity. Therefore, in this work, we apply the process of trial and error based on our experience to obtain improved network configuration that balances recognition accuracy and network complexity.
For 1D zero-padding data, we design two 1D-CNNs: the first has two convolutional layers and is denoted as 1D-2, and the second has three convolutional layers and is denoted as 1D-3. The leftmost columns in TABLE 1 and TABLE 2 show the best configurations of 1D-2 and 1D-3, respectively. The other two columns list the network parameters for the two datasets. It is obvious that the 3-layer CNN achieves a higher recognition rate with much lower network complexity for the digit set than the 2-layer CNN.
Similarly, we also train two 2D-CNN models for [2,100] data. The results are shown in TABLE 3 and TABLE 4, respectively, and indicate that the 3-layer CNN is much improved with respect to network complexity (4548 vs. 8252 on average of two datasets) at the cost of a small decrease (approximately 0.2%) in recognition rate.

2) PERFORMANCE EVALUATION
Currently, there is no standard air-writing dataset available for smart TV control. Therefore, we implement the popular networks using the written images of our datasets. The results in terms of recognition rate and total number of  parameters are compared with those of our methods. Here, three popular networks widely used in the literature [30]- [35] are implemented for comparison: (a) pure 2D-CNN, (b) 2D-CNN plus LSTM (2DCNN-LSTM), and (c) 2D-CNN plus SVM (2DCNN-SVM). The first is an end-to-end pure CNN approach that uses 2DCNN to complete both the feature extraction and classification tasks. The last two are hybrid approaches that utilize 2D-CNN for feature extraction and then apply SVM or LSTM for classification. The results of the three methods are shown in TABLE 5-7. It is noted that 2DCNN-SVM is implemented in a simple manner, as reported in [42]. Specifically, at the output layer of the CNN, instead of the conventional softmax function with the  cross entropy function, the Euclidean norm with the squared hinge loss is used [42].
The best models in every case of the proposed CNNs that are trained with the 1D form and 2D form of the writing trajectory are selected and compared with the three popular models mentioned above in TABLE 8. By using the 1D form and 2D form of the writing trajectory data, we obtain the first two models in this table. It is noted that the values of the  recognition rate or total number of parameters in this table are calculated from the average of two datasets, i.e., digit and direction. We conclude that 1D data concatenated by the x-coordinate sequence and y-coordinate sequences achieve the best performance in terms of the recognition rate and network complexity. Moreover, the proposed approach using trajectory data in 1D or 2D form as the input of CNN is superior to the popular methods that use written images as the input.

V. CONCLUSION
In this paper, we have proposed deep CNNs for the recognition of air-writing digits and special direction symbols for smart-TV-like control. A robust air-writing trajectory acquisition algorithm based on a web camera is developed that performs hand tracking only, avoiding the use of complicated procedures for finger tracking. By preprocessing the writing trajectory, we obtain one-dimensional and two-dimensional data that are utilized to design 1D-CNN and 2D-CNN, respectively. Through careful design and optimization of hyperparameters, the proposed CNNs achieve excellent performance with a recognition rate greater than 99%.
Among our proposed networks, 1D-CNN is slightly better than 2D-CNN. The two CNN models based on trajectory data significantly outperform the existing popular methods using written images. In addition, the network complexity of our proposed neural networks is much lower than those of the popular methods, and our systems can operate in real time.