Action Recognition From Thermal Videos

Human action recognition using a camera-based surveillance system remains a challenging task. In particular, action recognition is difficult when a human is not visible in an image captured in a dark environment. The existing studies have utilized near-infrared (NIR) and thermal cameras to solve this problem. Compared to NIR cameras, thermal cameras enable long- and short-distance objects to be visible without an additional illuminator. However, thermal cameras have two major disadvantages: a halo effect and a temperature similarity. A halo effect occurs around an object with a high temperature. In a human object, such a halo effect is similar to a shadow under the body area. It is more difficult to segment a human area from an image with a halo effect. Moreover, if the background and human object have similar temperatures, it becomes more difficult to segment the human area. These disadvantages influence not only the accuracy of the segmentation of the human area but also the performance of human action recognition. Unfortunately, no studies have considered these issues. To address these problems, this study proposes the cycle-consistent generative adversarial network (CycleGAN)-based methods for removing halo effects from thermal images and restoring the areas of the human bodies. In addition, this study also considered a method for creating a skeleton image from a thermal image to analyze body movements. To extract more spatial and temporal features from skeleton image sequences thus created, a method for human action recognition that combines a convolutional neural network (CNN) and long short-term memory (LSTM) was proposed. In an experiment using an open database (Dongguk activities & actions database (DA&A-DB2)), the proposed method demonstrated a better performance than the existing methods.


I. INTRODUCTION
Human action recognition using a camera-based surveillance system remains a challenging task.In particular, action recognition is difficult when a human is not visible in an image captured in a dark environment.Existing studies have utilized near-infrared (NIR) and long wavelength infrared (LWIR) cameras to solve this problem.LWIR cameras (thermal cameras) enable long-and shortdistance objects to be visible without an additional illuminator, whereas NIR cameras need an additional illuminator to make only short-distance objects visible in a dark environment.A thermal camera makes an object visible in either a dark or bright environment by measuring temperatures ranging from -40 °C to +80 °C.This study utilized a thermal camera to acquire data on long-distance objects in either a dark or bright environment.
A halo effect and temperature similarity are two challenges in images obtained using a thermal camera.A halo effect occurs around an object with a high temperature.The higher temperature of the object, the larger the halo effect.Such a halo effect is similar to a shadow under the area of a human body in a thermal image.However, in a thermal image, the pixel value of the area of the object is quite similar to that of the area with a halo effect.By contrast, in a visible image, the pixel value of the shadow is lower than that of the area of the object.For this reason, segmenting a human area from an image with a halo effect is more difficult than segmenting a human area from an image with a shadow.The size and pixel value of a halo effect depend on the material of the ground upon which the object stands.In addition, because the area of a halo effect is usually connected with that of an object in a thermal image, it is difficult to segment the area of the human.When the background and human have similar temperatures, segmenting the area of the human becomes more difficult.To date, no studies have dealt with these issues.
This study proposes a new method to address these issues.The proposed method utilizes convolutional neural network (CNN)-and cycle-consistent generative adversarial network (CycleGAN)-based methods for removing halo effects from thermal images and restoring the areas of the human bodies.This study also examines a new skeleton generation method to analyze the body movement of a long-distance object for action recognition in a thermal image.Depending on the temperature of the environment, the pixel value of the entire body area of a long-distance object in a thermal image is frequently either 0 (black) or 255 (white).In either case, body joint features such as the ankles, knees, hips, wrists, elbows, and shoulders are barely visible and difficult to detect.To solve this problem, a deep learning-based method for converting original thermal images into skeleton images is examined.No existing studies have utilized a deep learning-based method for recognizing the action of a long-distance object in a thermal image to address the above issues.This study attempts to extract more spatial and temporal features, and proposes a method for human action recognition that combines a CNN with long short-term memory (LSTM).The remaining sections of this paper are organized as follows.Section II reviews the existing studies dealing with action recognition, skeleton generation, and deep learning.Section III examines the contributions of the proposed method.Section IV describes the proposed method in detail.Section V presents the experiment results and those of a comparative experiment.Finally, Section VI provides some concluding remarks.

II. RELATED WORKS
Current methods of human action recognition can be classified mainly into two groups: deep learning-based methods and handcrafted feature-based methods without deep learning.
Among the existing studies on the latter group, the methods in [1] and [2] extract invariant Fourier descriptors for the scale and rotation from silhouette images and utilize those features for human action recognition through a support vector machine (SVM) and neural network (NN).However, although they are useful in expressing the shapes of objects, Fourier descriptors have difficulty expressing different actions with the same shape.A person standing has the same shape as another person lying down but they are both conducting a different type of action.In [3], human action recognition is applied using a local descriptor-based scale invariant feature transform (SIFT) and Zernike moment features.However, it takes a long time to extract such features during the test phase.In [4]- [6], local spatiotemporal features are created by applying a corner detection method and attempting an action recognition using an SVM.Unfortunately, the background is not clear, and when the background includes different objects, the number of detected corners increases, thereby decreasing the accuracy of the human action recognition.In [7] and [8], motionlets and motion saliency methods are proposed, which demonstrate high accuracy only when an object has been clearly segmented from its background.In [4], [5], and [9], methods for extracting motion features using space-time interest points and a histogram of oriented gradients (HoG) are proposed.However, these methods require a long time to extract the features and are sensitive to noise and illumination.In [9]- [14], various handcrafted features such as a gait flow image (GFI), gait history image (GHI), motion history image (MHI), and accumulated motion image (AMI) are extracted to achieve human action recognition.When these features are extracted, body areas are detected in continuous images to obtain the gravity center points of each area.Based on the gravity center points, all areas were combined into a single image.However, the accuracy of the gravity center points was lowered based on the detection accuracy of the body areas and the background noise, which resulted in a decrease in extraction accuracy for handcrafted features.In [15], a method for achieving faint action recognition is considered, which utilizes information of the width and height of the human areas detected in thermal images.In [16], human action recognition is conducted based on a convexity defect feature point.However, the accuracy of the action recognition was not satisfactory because many inaccurate feature points were detected owing to the background noise.In addition, it takes a long time to calculate the contours, polygons, convex hulls, and convexity defects in each frame.In [17], the gait energy image (GEI) based ethnicity determination method is examined.In [18], action recognition is conducted by extracting point-cloud features from silhouette sequences.All studies mentioned thus far were conducted based on handcrafted features, and deep learning-based human action recognition has been attempted in the following ways.
In [19], two CNNs are used, namely, one utilizing optical flow features and the other utilizing the original images as inputs for action recognition.In [20] and [21]; [22]- [24]; and [19], [25], and [26] depth images, joint map data, and visible light images are used, respectively, for action recognition.Such input data contain many spatial features but lack temporal features.To solve this problem, the study in [27] utilized skeleton information as the input of a recurrent neural network (RNN) for action recognition.However, an RNN causes a vanishing and exploding problem.For example, as the length of the sequential input features increases, important information may disappear, or trivial information may be accumulated.In [28]- [30], attempts were made to address this problem by proposing LSTM networkbased action recognition methods using skeleton information.The LSTM-based methods use input, output, and forget gait functions to solve the vanishing and exploding problem.For action recognition, the study in [31] used joint distance maps as the input of CNN and skeleton joint information as the input of LSTM to extract spatial and temporal information, and combined their output scores.In [32]- [34], a simultaneous learning method connecting a CNN and an LSTM is proposed.
However, there have been no recognition methods using a CNN and an LSTM for the various actions of long-distance objects.This study proposes a method for recognizing various actions including waving with one hand, waving with two hands, punching, kicking, sitting, standing, walking, running, lying down, leaving, and approaching.The proposed CNN-LSTM structure is interconnected, and sequential learning is conducted using input images.There are only a few existing studies on human action recognition using a thermal camera.Some of the existing studies [9,13,15] did not utilize a deep learning algorithm such as CNN-LSTM.The present study proposes a thermal camera-based method for recognizing various actions of a long-distance object under both dark and bright environments using a deep learning algorithm.
In addition, this study also proposes a human action recognition method that generates and utilizes skeleton images from thermal images.The existing action recognition methods of the studies in [22]- [24], [27], [29], [31], and [35], which are based on skeleton information, utilized skeletons generated beforehand.There are already existing methods for extracting skeleton information from depth images [36]- [39], visible light images [40], [41], and thermal images [42].However, no methods that can generate skeleton images directly from thermal images have been proposed.There are also no methods for generating a skeleton image of a longdistance object from thermal images obtained under various environments.To improve the performance of the proposed human action recognition method, this study examined CycleGAN-based methods of image restoration and the removal of a halo effect using thermal images obtained under various environments.
Table 1 shows a brief summary of related studies, which are compared with the proposed method.

III. CONTRIBUTIONS
Our method is a novel approach compared to previous studies in the following ways: -To date, no CNN-LSTM based methods have been proposed to recognize various actions of a long-distance object in thermal images, such as waving with one hand, waving with two hands, punching, kicking, sitting, standing, walking, running, lying down, leaving, and approaching.Accordingly, this study proposes a CNN-LSTM-based method for recognizing various actions of a long-distance object in thermal images.
-There are no methods for changing low-quality thermal images obtained under various environments into highdefinition (HD) thermal images.Accordingly, this study proposes a method for generating HD thermal images from low-quality thermal images of long-distance objects through CycleGAN.To develop the proposed method, hyper-parameters, the number of filters, the numbers of convolution layers, and the sizes of the filters in the existing CycleGAN were modified according to the experiments.
-There are no methods for analyzing and removing halo effects of objects in various long-distance environments.
Accordingly, this study proposes a halo effect removal method using the modified CycleGAN considering its high ability of image transformation.
-Some methods for matching a short-distance object with a skeleton or extracting skeleton information in a thermal image have been developed.However, no method has been proposed for generating a skeleton image from a thermal image.In addition, only a few studies on applying a thermal image-based skeleton have been conducted.Accordingly, this study proposes a method for generating a skeleton image directly from an original thermal image through a deep learning algorithm.
-The developed CNN model, the data generated, and the Dongguk activities and actions database (DA&A-DB2) were released [43] for a fair performance evaluation by other researchers.

Ⅳ. PROPOSED METHOD A. Camera settings and human detection
This section provides a simple description regarding the camera setup and human detection method.Figure 1
-Inevitable to higher loss of temporal information -Performance is affected by shadows, variations in illumination, and human clothing of various colors

RNN-based
Using skeleton joint information [27] -Good at extracting temporal information -Appropriate features are extracted in various environments and camera settings.
-Inevitable higher loss of spatial information, encounters vanishing and exploding problems -Performance is affected by shadows, variations in illumination, and human clothing of various colors

LSTM
Using skeleton joint information [28][29][30][31] -Good at extracting temporal information, while overcoming vanishing and exploding problem -Appropriate features are extracted in various environments and camera settings.
-Inevitable higher loss of spatial information -Performance is affected by shadows, variations in illumination, and human clothing of various colors

CNN-LSTM
Using visible light images [32][33][34] Using skeleton joint information [35] -Good at extracting spatial and temporal information while overcoming vanishing and exploding problem -Appropriate features are extracted in various environments and camera settings.environment, variations in illumination, and severe shadows), this study utilized a thermal camera.The details of the object detection method are shown in [45].As indicated in the thermal image on the right side of Figure 1, a larger ROI (red dashed box) than the detected area (green) being captured was applied.

B. Overall procedure of proposed method
Figure 2 shows an overall flowchart of the proposed method.As indicated in Figure 1, the original cropped image was used as the input.During the halo effect removal phase, the halo effect is removed from the input image using a CycleGAN network.In the skeleton generation phase, a skeleton is generated from a thermal image using a CNN.In the action recognition phase, human action recognition was conducted using a CNN-LSTM and sequential skeleton images.The details of each phase are described step by step in the following section.

C. Image restoration 1. Overall procedure of image restoration
This section describes the method for restoring the thermal camera images.When thermal images are obtained in various environments, if the background and object have similar temperatures, the images acquired are similar to those shown in Figure 3 Accordingly, this study utilized the CycleGAN network [46] and conducted image restoration, as illustrated in Figure 4, to convert the thermal images of Figure 3(a) into those of 3(b).As shown in Figure 4, the cropped thermal image mentioned in Section IV.A was used as the input. In

Description of CycleGAN and discriminator CNN structures
This section describes the CycleGAN network structures used for the image restoration methods in detail.
CycleGAN shows the high-transformation results obtained from a model trained using unpaired training data.Because the proposed image restoration method used unpaired training data, it applied CycleGAN.In addition, before being used, the original CycleGAN model was fitted to our database by modifying the hyperparameters, number of filters, number of convolution layers, and size of the filters, as indicated in Tables 2-4.Tables 2-4 describe the detailed structures of the generator, residual block, and discriminator used by the CycleGAN, respectively.

D. Removal of halo effect
This section describes the method of the halo effect removal.
VOLUME XX, 2019   As shown in Figure 6(a), a halo effect appears like a shadow under the area of the human body.This study examined methods for removing halo effects, as indicated in Figure 6(b), from the images shown in Figure 6(a).When an object is detected in the images of Figure 6(a), the body area is connected to the area of the halo effect, as shown in the images of Figure 6(c).Thus, the accuracy of the detection method for the object is degraded.However, if the object is detected in the images of

E. Skeleton generation
This section describes the method of skeleton generation.It is difficult to detect an object or extract skeleton information from an image obtained using a visible light camera in a dark environment, where a person is barely visible.Some methods for detecting a human body in a dark environment using a thermal camera have been developed.However, no method has been proposed to extract the skeleton information of a detected object.Accordingly, this study proposes a method for extracting a skeleton image from a thermal image obtained in a dark environment.The existing methods for extracting skeleton information from images obtained in a bright environment were implemented, as shown in Figure 8 We generate skeleton image by using the open source of CNN proposed in [51].The network in [51] was originally proposed for style transfer and super-resolution reconstruction based on perceptual loss, and we adopted this network for skeleton generation.The detailed explanations for this CNN can be referred to [51].The structure of this CNN was the same as that of generator of CycleGAN.The structure was fitted to our database by modifying the hyper-parameters, number of filters, number of convolution layers, and size of the filters, as shown in Table 2.
As illustrated in Figure 9, the original thermal image was used as the input image of the CNN, and the skeleton image was used as the output image.A skeleton in an image for training is created, and was set to be thicker than the conventional skeleton.In an additional experiment, the output image extracted by a CNN was postprocessed (size filtering and morphological operations) to generate a narrow skeleton in an image, as illustrated in Figure 9.

F. Action recognition 1. Overall procedure of action recognition
We propose an action recognition method that extracts the action information of a long-distance object from thermal images obtained in a dark or bright environment by using a CNN stacked LSTM (CNN-LSTM).As illustrated in Figure 10, the human action recognition was conducted by adopting a sequence of skeleton images as the input.As shown in Figure 11, this study attempted to recognize 11 actions such as waving with one hand, waving with two hands, punching, kicking, sitting, standing, walking, running, lying down, leaving, and approaching.

Description of CNN-LSTM structure
To solve the problem of long-term dependencies, an LSTM was applied to various research areas such as action recognition [32], text recognition [52], gait recognition [53], caption generator [54], speech recognition [55], person re-identification [56], and gait diagnosis [57].Regarding the long-term memory and temporal information, LSTM-based methods have turned out to be most effective in solving the vanishing and exploding gradient problem.Accordingly, this study utilized an LSTM to extract temporal information from sequential images.In addition, this study also connected a CNN to an LSTM to extract the spatial features.To enhance the accuracy of human action recognition, various CNN structures were redesigned and tested.Table 5 shows the most appropriate structure as demonstrated through the test results.The optimal frame numbers (5 frames) for CNN-LSTM of Table 5 were experimentally determined with training data, which showed the highest accuracy of human action recognition.In

A. Description of experiment setup and database
Open databases for action recognition acquired from visible light camera environments have already been developed [58][59][60].However, no databases have been acquired from a thermal camera environment.This study conducted an experiment using the DA&A-DB2 database [43], which contains thermal images of long-distance objects obtained in various environments (time zones, weather, seasons, and camera settings) and locations.Although the database consists of both visible light images and thermal images, this study used only thermal images.The database consists of 16 sub-datasets including a total of 266,261 images.Figure 14 and Table 6 provide the detailed information of the database.A desktop computer was used for training and testing.The specifications of the desktop computer include an Nvidia graphics card (Nvidia GeForce GTX TITAN X [61]), Intel CPU (core i7-6700 CPU @ 3.40 GHz (with eight CPUs)), and 32 GB of RAM.The proposed method was implemented using a Python-based Keras application programming interface (API) with a Tensorflow backend engine [62] and the OpenCV library [63].Figure 15 and Table 7 show the height of the camera, the distance between an object and the location of the camera (horizontal distance), and another distance between an object and a camera (D distance).Table 8 shows the types of human actions and the number of images for each action.

B. Training of CycleGAN and CNN-LSTM models
This

C. Testing 1. Testing of image restoration
This section describes the testing results of the image restoration method.Figure 16 shows the image restoration results.The corresponding binarized images are also shown to see how the results differ.As illustrated in Figure 16, the binarized version of the restored image expresses the human body area more accurately than the binarized version of the original image.

Testing of halo effect removal
This section presents the testing results of the halo effect removal method.Figure 17 shows the results of the halo effect removal method.The corresponding binarized images are also shown to see how the results differ.As illustrated in Figure 17, the binarized images with the halo effects removed express the human body area more accurately than those with the halo effects.VOLUME XX, 2019

Testing of skeleton generation
This section describes the testing results of the skeleton generation method, the images of which are shown in Figure 18.

Testing of human action recognition
This section presents the testing results of human action recognition.The 11 methods of Table 9 were applied to the test.Figure 19 shows the input images for each method.Tables 10-25 provides the accuracies of the 11 methods for the action recognition.Table 26 lists the details of the processing time.The processing time was measured in the experiment environment described in Section V.A.   Table 10.Confusion matrix of the results of human action recognition using method 1 (unit: %).As shown in Table 25, method 10 using image restoration and skeleton generation, and method 11 (the proposed method) using halo effect removal and skeleton generation, produced the highest accuracies.In other words, the image restoration, halo effect removal, and skeleton generation were effective at improving the accuracy of human action recognition.

Actual
For examples, as shown in Figures 3(the 1st and 3rd images of (a)), 6(the 1st and 3rd images of (a)), 11(i), 14(j), and 17(the 1st images of (b) and (d)), there are many cases that the temperature of human body is similar to that of background or severe halo effects happen, which produces incorrect segmentation of body area and consequent error of human action recognition.Therefore, we use CycleGAN in order to make the body area more distinctive from background and remove halo effects.
As shown in Table 9, the methods 1 and 2 respectively show the cases before and after CycleGAN.In addition, the methods 3 and 10 (or 11) show the cases before and after CycleGAN, respectively.As shown in Table 25, the accuracy by the method 2 is higher than that by the method 1.In addition, the accuracies by the methods 10 and 11 are higher than that by the method 3. From these results, we confirm that CycleGAN can improve the overall accuracy of human action recognition.
We use the skeleton image for CNN-LSTM instead of the full frame because the motion information based on the skeleton image can be more distinctive than that of the full frame.As shown in Table 9, the methods 2 and 10 show the cases of action recognition without and with skeleton generation, respectively.In addition, the methods 4 and 11 show the cases of action recognition without and with skeleton generation, respectively.As shown in Table 25, the accuracies of action recognition by the methods 10 and 11 are higher than those by the methods 2 and 4, respectively, which confirms that the skeleton image is more effective for human action recognition than the full frame data.
According to Table 26, the processing time of both methods 10 and 11 was approximately 10 fps.The following section describes a comparative experiment with the existing methods.

D. Comparisons
This section compares the proposed method with the existing methods.Table 27 shows the accuracies of five existing methods of human action recognition.As is clear from Table 27, the proposed method shows higher recognition rates than the existing methods.This study proposed human action recognition methods using thermal image restoration, halo effect removal, and skeleton generation.These approaches were combined in various ways to produce different results.Various techniques including CycleGAN, CNN, and CNN-LSTM were adopted for the proposed methods.In addition, an experiment was conducted using the DA&A-DB2 open database, which was built solely for the present study.There are many databases acquired by thermal camera [68][69][70][71][72][73][74][75].However, most of them are for pedestrian or object detection, and there is no existing database including the images of human action with halo effects.Therefore, we collected our DA&A-DB2 database for experiments.For fair performance evaluation, this database is released to other researchers as shown in [43].The proposed methods were compared with five existing methods.In a comparative experiment, the proposed methods achieved the highest accuracy.Moreover, the proposed methods using image restoration, halo effect removal, and skeleton generation were effective and efficient for human action recognition.
Because the existing state-of-the art methods used the images where the front or back side of body area is captured by camera as shown in Figure 8 (a), joint positions can be easily detected from the skeleton image.However, our database frequently includes the cases where the joint positions are difficult to be detected as shown in Figure 8(f) and the right people of Figures 11(j) and (k).Therefore, we use the skeleton image for the input to CNN-LSTM instead of joint positions.
In further studies, we will focus on the removal or intensity reduction of halo effects on thermal images, which are caused by more diverse objects and machines in various environments.We will also develop a method for improving the processing time using a lighter model with fewer parameters and CycleGAN and CNN-LSTM layers.

-Figure 1 .
Figure 1.Example of camera setup and experiment environment.
(a).By contrast, if the background and object have different temperatures, the images acquired are similar to those shown in Figure3(b).When the object was detected in the images of Figure3(a), a portion of the human body area disappeared or was cut out, as shown in Figure3(c), which decreased the detection accuracy for the object.However, when the object was detected in the images shown in Figure3(b), the results were good, as indicated in Figure3(d).

Figure 4 ,
Conv, BN, Relu, and Add denote the convolutional layer, batch normalization layer, rectified linear unit, and addition function, respectively.As illustrated in Figure 5(b), unpaired training data were used to train CycleGAN.

Figure 5 .
Figure 5. Example of paired and unpaired training data.Examples of (a) paired and (b) unpaired images.

Figure 6 (
b), the detection results are satisfactory, as shown in Figure6(d).For the halo effect removal method, the existing CycleGAN structure was fitted to our database by modifying the hyper-parameters, number of filters, number of convolution layers, and size of the filters, as presented in Tables2-4.The cropped thermal image mentioned in Section 4.1 was used as the input of the CycleGAN network, as illustrated in Figure7.Here, Conv, BN, Relu, and Add denote the convolutional layer, batch normalization layer, rectified linear unit, and addition function, respectively, as shown in Figure7.As illustrated in Figure5(b), unpaired training data were used to train the CycleGAN.

Figure 6 .
Figure 6.Example of captured thermal images and results of background subtraction method: (a), (b) thermal images and (c), (d) results of background subtraction using images in (a) and (b), respectively.
(a)-(d).Because the image in Figure8(a) has much more spatial information of the joints than the image in Figure8(c), a skeleton was made by detecting the locations of the joints, as shown in Figure8(b)[40,41].In the case of Figure8(c), where little spatial information of the joint is given, a thinning method was applied to make a skeleton, as shown in Figure8(d)[47][48][49][50].However, Figures8(e) and 8(f) may have spatial information, as shown in Figure8(a), or no such information, as shown in Figure8(c), depending on the environment where the thermal image is acquired.Moreover, if the method of Figure8(d) is applied to the image of Figure 8(e), or the method of Figure 8(b) is applied to that of Figure 8(f), the desired skeleton has difficulty being generated.Thus, this study proposes a method for generating a skeleton from thermal images, as shown in Figures 8(e) and 8(f).

Figure 8 .
Figure 8. Examples of previous skeleton generation methods and captured thermal images.(a) A visible light image, (b) an example of joint detection, (c) a binary image of (a), (d) an example of skeleton generation using the image in (c), and (e), (f) captured thermal images.

Figure 9 .
Figure 9. Skeleton generation using the proposed method.
Figure 12, x(t), y(t-1), and c(t-1) denote the current input, previous output, and previous cell values; y(t) and c(t) indicate the current output and cell values; i(.), f(.), and o(.) indicate the input, forget, and output gait functions (sigmoid function); and I(.) and O(.) are the input and output activation functions (tanh function), respectively.Finally, blue arrows, red box and dashed black boxes indicate weighted connections, gait functions and previous information, respectively.

VII-
in Figure 14(a)) Humidity of 62.6%, wind speed of 1.3 m/s, 21.9 °C, afternoon, cloudy, autumn -Object is clear at the current position but is difficult to see in the left-upper position where the area is brighter in the thermal image II (shown in Figure 14(b)) 6.0 °C, afternoon, cloudy, humidity of 39.6%, wind speed of 1.9 m/s -Object is clear at the current position but is difficult to see in the upper position where the area is brighter in the thermal image III (shown in Figure 14(c)) 14.0 °C, afternoon, sunny, humidity of 43.4%, wind speed of 3.1 m/s -The temperature outside the building is increased by the air heating system of the building in the thermal image IV (shown in Figure 14(d)) 1.2 °C, morning, humidity of 73.0%, wind speed of 1.6 m/s -The temperature of the window of the building is changed over time, making it difficult to visualize the objects in the thermal image -The reflection of the object in the window makes it difficult to visualize the object V (shown in Figure 14(e)) 1.0 °C, afternoon, humidity of 50.6%, wind speed of 1.7 m/s -The intensity of trees and leaves increases when it is sunny in a cold environment VI (shown in Figure 14(f)) 31.3 °C, noon, humidity of 43.4%, wind speed of 3.1 m/s -The temperatures of the human body and background are similar, making it difficult to segment the human area from the background effect is shown below the human area in the thermal image -The top part of the object is not visible in the thermal image owing to a The temperature of the background is higher than that of the human body IX (shown in Figure 14(i)) 18.9 °C, night, humidity of 62.6%, wind speed of 1.3 m/s -A halo effect is shown below the human area in the thermal image -The temperatures of the human body and background are similar -The object is not seen in the visible light image X (shown in Figure 14(j)) 10.9 °C, dark night, humidity of 48.3%, wind speed of 2.0 m/s -The dataset was collected at night and the halo effect is shown below the human area -The object is not seen in the visible light image XI (shown in Figure 14(k)) 10.9 °C, dark night, humidity of 48.3%, wind -The object is not shown in the visible light image -The reflection of the object in the window makes it difficult to visualize the object This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2931804,IEEE Access VOLUME XX, 2019 speed of 2.0 m/s XII (shown in Figure 14(l)) 20.2 °C, dark night, humidity of 58.6%, wind speed of 1.2 m/s -The temperature outside the building is increased by the air heating system of the building in the thermal image -The object is not seen in the visible light image XIII (shown in Figure 14(m)) -2.0 °C, dark night, humidity of 50.6%, wind speed of speed of 1.8 m/s -The intensity of the trees and leaves is high owing to sunlight during the daytime XIV (shown in Figure 14(n)) 12.0 °C, dark night, humidity of 63.1%, wind speed of 1.5 m/s -A halo effect is shown below the human area in the thermal image -The object is not seen in the visible light image XV (shown in Figure 14(o)) 28.0 °C, night, humidity of 45.1%, wind speed of 1.6 m/s -The temperature of the background is higher than that of the human body owing to sunlight during the daytime -The object is not seen in the visible light image XVI (shown in Figure 14(p)) 10.0 °C, dark night, humidity of 63.1%, wind speed of 1.5 m/s -A halo effect is shown below the human area in the thermal image -The object is not seen in the visible light image (a) (b)
section describes the training phase of the proposed methods in detail.All methods used images with size of 224 × 224 pixels for training and testing.When the generator of CycleGAN was trained using the image restoration method, the cycle-consistency loss, identity loss, training epoch, learning rate, mini-batch, loss function, and optimizer were set to 10.0, 1.0, 6,000, 0.00001, 1, mean squared error [64], and adaptive moment estimation methods (Adam)[65], respectively.When the generator was trained using the halo effect removal method, the learning rate and training epoch were set to 0.00001 and 5,000 respectively.In the case of the skeleton image generation method, the learning rate and training epoch were set to 0.00001 and 2,000, respectively.In the case of the human action recognition method, the training epoch, learning rate, momentum, mini-batch, optimizer, and loss function were set to 5, 0.0001, 0.9, 10, Adam, and categorical cross entropy, respectively.The epoch number was determined based on outcome images obtained during training.The learning rate was determined based on the epoch number.VOLUME XX, 2019

Figure 16 .Figure 17 .
Figure 16.Examples of image restoration.(a)-(f) examples 1-6, respectively, where the first, second, third, and fourth images are the original image, restored image, binarized version of the original image, and binarized version of the restored image, respectively.

Method 1 Figure 18 .
Figure 18.Examples of skeleton generation.Left and right image pairs in (a)-(d) show examples 1-8, respectively, where the first and third images are the original images whereas the second and fourth images are the generated skeleton images.

Table 11 .Table 12 .Table 13 .Table 14 .
Confusion matrix of the results of human action recognition using method 2 (unit: %).Confusion matrix of the results of human action recognition using method 3 (unit: %).Confusion matrix of the results of human action recognition using method 4 (unit: %).Confusion matrix of the results of human action recognition using method 5 (unit: %).

G
. BATCHULUUN et al.: Preparation of Papers for IEEE Access (February 2019)

Table 17 .
Confusion matrix of the results of human action recognition using method 8 (unit: %).

Table 19 .Table 20 .
Confusion matrix of the results of human action recognition using method 10 (unit: %).Confusion matrix of the results of human action recognition using method 11 (unit: %).

Table 3 . Detailed description of residual block (Conv, ReLU, BN, and Add indicate the convolutional layer, rectified linear unit, batch normalization layer, and addition function, respectively).
Total params:296,192This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2931804,IEEE Access G. BATCHULUUN et al.: Preparation of Papers for IEEE Access (February 2019)

Table 4 . Detailed description of structure of discriminator CNN (Conv, LReLU, and InsNorm indicate the convolutional layer, leaky rectified linear unit, and instance normalization layers, respectively). Layer number Layer type Size of feature map (height × width × channel) Number of filters
Total number of parameters: 2,268,521 Total number of trainable parameters: 156,656,458

Table 15 . Confusion matrix of the results of human action recognition using method 6 (unit: %).
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2931804,IEEE Access

Table 18 . Confusion matrix of the results of human action recognition using method 9 (unit: %).
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2931804,IEEE Access VOLUME XX, 2019

Table 24 . Overall accuracies of human action recognition methods (unit: %).
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2931804,IEEE Access