Generating Realistic Smoke Images With Controllable Smoke Components

Smoke image generation is an important method to solve sample categories imbalance in smoke detection with applications to forest smoke detection, smoky vehicles detection, etc. This article presents two Controllable Smoke Image Generation Neural Networks (CSGNet and CSGNet-v2). More specifically, CSGNet is proposed to generate various smoke images by integrating a smoke component control module (SCM) and a smoke image synthesis module (ISM) with a multi-scale strategy. By changing specified smoke components latent codes, we can generate smoke images with specified smoke components. By fine-tuning smoke components latent codes in SCM, we can fine-tune the smoke components in generated smoke images. To further improve CSGNet, CSGNet-v2 is proposed to make generated smoke images more realistic by adding a smoke image fine-tuning module (IFM). The experiment results show that our methods achieve by far the best results in generating smoke images with controllable smoke components.


I. INTRODUCTION
Smoke image generation is a very important method to solve sample categories imbalance in smoke detection with applications such as forest smoke detection, smoky vehicle detection [1], [2], etc., in which smoke samples are much lower than non-smoke samples. This article focus on smoke images generation of the vehicle rear regions, and our methods can also be transplanted to other applications to generate images that contains transparent substance.
Smoke image generation is a subtopic of image generation, which can be divided into two categories: Generative Adversarial Networks (GANs)-based methods [3] and Variational Auto-encoder (VAE)-based methods [4]. GANs-based methods often faced the problem of training stability. VAEs-based methods often faced the problem that the generated images are blurred.
Different from all existing methods, we give a new idea to generate smoke images based on deep observations on smoke characteristics. Smoke is usually composed of many tiny particles, which scatter and absorb lights from light sources or environmental reflections [5]. Lights gradually attenuate in The associate editor coordinating the review of this manuscript and approving it for publication was Filbert Juwono . the air, and finally enter into a camera to generate an image. This makes the smoke usually has transparency, and also means that smoke images can be synthetized from non-smoke components and smoke components.
Based on the above considerations, we proposed two Controllable Smoke Image Generation Neural Networks (CSGNet and CSGNet-v2). The main contributions are summarized as follows: (1) CSGNet is proposed to generate various smoke images by integrating a smoke component control module (SCM) and a smoke image synthesis module (ISM) with a multi-scale strategy. By adjusting smoke components latent codes in SCM, we control the smoke components in generated smoke images.
(2) CSGNet-v2 is proposed to make generated smoke images more realistic by adding a smoke image fine-tuning module (IFM). The smoke in generated smoke images are controlled by adjusting smoke components latent codes in SCM, (3) Three main advantages: a) our methods can control the smoke in generated smoke images to generate various smoke images or sequences, such as generate specified light smoke images, but existing methods cannot. b) Our method can generate smoke images as well as their smoke components images which are useful in smoke detection, smoke segmentation, smoke pollution levels estimation for smoky vehicle, and smoke motion characteristic estimation for forest smoke etc., but existing methods cannot. c) With the idea of generating smoke images by adding various smoke components to existing non-smoke images, the training step is more stable compared with GANs-based methods, the generated smoke images are less blurry compared with VAEsbased methods, and the non-smoke components in generated smoke image are more realistic since they come from the real world.
This article is organized as follows. Section II introduces related work on image generation. Section III describes the proposed CSGNet to generate smoke image with controllable smoke components. Section IV describes the proposed CSGNet-v2 to generate more realistic smoke images. Extensive experiments are given in Section V. At last, we conclude the paper in Section VI.

II. RELATED WORK
In this section, we introduce related work on image generation, especially these methods that can generate images with some characteristics controllable. We roughly divided existing image generation methods into two categories: GANs-based methods and VAEs-based methods.

A. GANS-BASED METHODS
The original GAN [3] was first proposed by Goodfellow in 2014 and used for image generation. Currently many GANs-based variants are proposed, but we mainly introduce these methods that can generate images with some characteristics controllable.
Chen et al. [6] proposed information GAN (InfoGAN) that is an information-theoretic extension to GAN to learn disentangled representations in a completely unsupervised manner, such as writing styles from digit shapes on the MNIST dataset [7], etc. Mirza and Osindero [8] proposed the conditional version of GAN (CGAN) constructed by simply feeding the data we wish to condition on to both the generator and discriminator, such as generate MNIST digits conditioned on class labels. Li et al. [9] proposed the triple GAN (Triple-GAN) based on a flexible game-theoretical framework for classification and class-conditional generation in semi-supervised learning to disentangle the classes and styles and transfer smoothly on the data level via interpolation on the latent space class-conditionally. Borrowing from style transfer literature, Tero et al. [10] proposed an alternative generator architecture for GAN to achieve an automatically learned, unsupervised separation of high-level attributes (pose and identity on human faces) and stochastic variation in the generated images (freckles, hair). Zhu et al. [11] proposed CycleGAN to learn a mapping to translate an image from a source domain to a target domain in the absence of paired examples using an adversarial loss, such as collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Choi et al. [12] proposed StarGAN to perform image-to-image translations for multiple domains based on simultaneous training of multiple datasets with different domains within a single network. Shaham et al. [13] proposed the single GAN (SinGAN) that can be learned from a single natural image to capture the internal distribution of patches within the image, and generate high quality, diverse samples that carry the same visual content as the image.

B. VAES-BASED METHODS
The original VAE [4] was proposed by Kingma in 2014 and also used for image generation. Currently many VAEs-based variants are proposed.
Sohn et al. [14] proposed a scalable deep conditional generative model (CVAE) for structured output variables using Gaussian latent variables. Bao et al. [15] proposed the variational GAN that combines a VAE with GAN (VAE-GAN) for synthesizing images of fine-grained categories, such as faces of a specific person or objects in a category. Huang et al. [16] proposed the introspective variational autoencoder (IntroVAE) model for synthesizing high-resolution photo-realistic photographic images with stable training and nice latent manifold. Higgins et al. [17] proposed the beta-VAE for automated discovery of interpretable factorized latent representations from raw image data in a completely unsupervised manner by introducing an adjustable hyperparameter beta that balances latent channel capacity and independence constraints with reconstruction accuracy. Xiao et al. [18] proposed a supervised learning model called DNA-GAN to disentangle different factors of images with DNA-liked latent representations of images, in which each individual piece (of the encoding) represents an independent factor of the variation. Burda et al. [19] proposed the importance weighted autoencoder (IWAE) with the same architecture as the VAE, but used a strictly tighter log-likelihood lower bound derived from importance weighting to learn richer latent space representations than VAEs. Larsen et al. [20] proposed an autoencoder that leverages learned representations to better measure similarities in data space by combining a VAE with a GAN and use learned feature representations in the GAN discriminator as basis for the VAE reconstruction objective.
Currently various GANs-based methods and VAEs-based methods are proposed and have achieved great success. However, these methods still have some disadvantages in our controllable smoke images generation task: 1) These methods cannot generate smoke images with controllable smoke components. 2) These methods cannot generate smoke images as well as their smoke components images which are useful in smoke detection and smoke segmentations. 3) The smoke images generated by VAEs-based methods are usually blurred.
To solve all the above issues, this article proposes the CSGNet model. To further improve the authenticity of the smoke images generated by CSGNet, this article proposes the CSGNet-v2 model. VOLUME 8, 2020

III. CONTROLLABLE SMOKE IMAGE GENERATION NEURAL NETWORKS
In this section, we introduce the CSGNet in detail including network structure, smoke components control module (SCM), smoke image synthesis module (ISM) and loss function.

A. NETWORK STRUCTURE
The whole network structure of the CSGNet model can be seen in Fig.1. The proposed CSGNet consists of two main modules including smoke components control module (SCM), and smoke image synthesis module (ISM). SCM is used to learn the potential distribution of smoke components, and ISM is used to synthetize smoke images via non-smoke components images and smoke components images.
In the training, non-smoke components images, smoke components images and their manually synthesized smoke images are given. We first put smoke components images into SCM to obtain decoded smoke components. Then we put non-smoke components images and smoke components images into ISM together to generate smoke images.
In the testing, only non-smoke images are given. We first randomly generate a smoke components latent code vector, and put it into the smoke components decoder to obtain a smoke components image. Then we put the non-smoke components image and smoke components image into ISM together to generate a new smoke image.

B. SMOKE COMPONENT CONTROL MODULE (SCM)
SCM is used to learn potential distribution of smoke components. We first put smoke components images into this module to obtain the mean vector and the logarithm of variance vector, which can form a Gaussian distribution. Then we sampled from this distribution to obtain smoke components latent codes. Finally, we decode the latent codes to obtain a smoke components image.
A detail design of SCM can be seen in Fig.2. By adjusting each dimension of the latent code vector, we can control the reconstructed smoke components.

C. SMOKE IMAGE SYNTHESIS MODULE (ISM)
ISM is used to synthetize smoke images using non-smoke components images and smoke components images. A multi-scale strategy is adopted to accelerate training and utilize different scale information inspired from [21].
A detail design of SCM can be seen in Fig.3. We first concatenate the non-smoke components and smoke components. Then three different scales are used to aggregate context information of different regions, so as to improve the ability to obtain global information. A shortcut connection is used to accelerate training.

D. LOSS FUNCTION
The loss function of CSGNet is given by where N is the total training samples number, and M is the training batch size. λ SCM and λ ISM are weight coefficients of different terms, respectively. L SCM and L ISM are the loss terms of SCM and ISM, respectively. The terms L SCM and L ISM are given by 201420 VOLUME 8, 2020 where x is input data includes non-smoke components image I non−smoke , smoke components image I smoke and synthesized smoke image I fusion . O smoke SCM is the output smoke components image of the SCM. O fusion ISM is the output smoke image of ISM λ SCM 1 and λ SCM 2 are weight coefficients of smoke components reconstruction error O smoke SCM − I smoke and KL divergence error D KL (q φ (z|x)||p θ (z)) that measures the similarity between q φ (z|x) and p θ (z). The weights of the different loss terms are set to 1.
Given smoke components latent codes z and observation data x. x → z is described by probability distribution q φ (z|x). z → x is described by probability distribution p θ (x|z). The parameter q φ (z|x) in equation (2) is a probability distribution used to approximate p θ (z|x) (calculated by p θ (x|z)). p θ (x|z) is a distribution that can be set to Bernoulli distribution or Gaussian distribution.

IV. CONTROLLABLE SMOKE IMAGE GENERATION NEURAL NETWORKS-V2
In this section, we introduce CSGNet-v2 in detail including network structure, smoke image fine-tuning module (IFM).and loss function.

A. NETWORK STRUCTURE
The whole network structure of CSGNet-v2 can be seen in Fig.4. . The network structure of CSGNet-v2. X and Y denote two kinds of images of different domains including the synthetic smoke images and real smoke images, G1:X->Y and G2: Y->X denote two learned mapping functions between an X domain image and Y domain images. This part of IFM is inspired from [11]. D1(Y) and D2(X) are used to judge the input image belongs to X or Y .
The CSGNet-v2 consists of two main modules including the CSGNet and the IFM. CSGNet is used to generate controllable smoke images, and the IFM is used to make the generated smoke images more realistic.
Given a non-smoke image, we can use CSGNet to synthesize various smoke images. But these smoke images have synthesized feeling and trace. To make it more realistic, the IFM is adopted.
We first put a synthetic smoke image I synthesis to G 1 : X → Y to obtain a translated smoke image G 1 (I synthesis ). Then we put G 1 (I s ) to D 1 (y) to judge it belongs to X or Y , and also put G 1 (I synthesis ) to G 2 : Y → X to make its output as close as possible to I synthesis . Simultaneously, we put the real smoke images I real to G 2 : Y → X to obtain the output G 2 (I real ). Then we put G 2 (I real ) to D 2 (x) to judge it belongs to X or Y , and also put G 2 (I real ) to G 1 : X → Y to make its output as close as possible to I real , respectively. Training in this way, the G 1 : X → Y can transform a synthetic smoke image to a realistic smoke image. G 1 is used to generate images that look similar to images from domain Y , while D 1 (Y ) is used to distinguish between translated samples from G 1 (X ) and real samples y. G 1 aims to minimize this objective against an adversary D 1 (Y ) that tries to maximize it.

C. LOSS FUNCTION
The loss function of CSGNet-v2 is given by where L(G 1 , G 2 , D 1 , D 2 ) contains adversarial losses and cycle consistency losses. The adversarial losses is used to match the distribution of generated images to the data distribution in the target domain; and the cycle consistency losses are used to prevent the learned mappings G 1 (X ) and G 2 (Y ) from contradicting each other. The weights of the different loss terms are set to 1.
The term L(G 1 , G 2 , D 1 , D 2 ) can be expressed by where λ GAN 1 , λ GAN 2 and λ cyc are weight coefficients of adversarial loss L GAN 1 , adversarial loss L GAN 2 and cycle consistency loss L cyc . The terms of L GAN 1 , L GAN 2 and L cyc can be expressed by L cyc = E x∼p data (x) ( G 2 (G 1 (x)) − x 1 ) where G 1 (X ) and G 2 (Y ) are two mapping functions between synthesis smoke image domain X and real smoke image domain Y . x ∼ p data (x) and y ∼ p data (y) denote the synthesis smoke training images and D 2 (X ) are two adversarial discriminators. D 1 (Y ) is used to distinguish between synthesis smoke images x and translated images G 1 (X ), and D 2 (X ) is used to distinguish between real smoke images y and translated images G 2 (X ).
To obtain a better model, we first train CSGNet model based on loss L CSGNet in (1). Then we train the IFM module in CSGNet-v2 based on loss L(G 1 , G 2 , D 1 , D 2 ) in (5). Finally, we train CSGNet and the IFM module together based on loss L CSGNet-v2 in (4).

V. EXPERIMENTS AND ANALYSIS
In this section, extensive experiments are reported to verify the advantages of our methods. Firstly, we introduce the used datasets in this article. Secondly, experiments on smoke image generation are introduced. Thirdly, experiments on smoke image generation with controllable smoke components are introduced. Fourthly, experiments on smoke components estimation in the generated smoke images are introduced. Fifthly, comparison experiments with image quality of the generated smoke images are done. Sixthly, comparison experiments with existing methods are introduced. Finally, experiments on smoke components estimation are done.
Our methods are implemented using PyTorch framework under Linux system with CUDA toolkit 9.2 and two Graphics cards of NVIDIA RTX 2000 series.

A. DATASETS
The datasets used in this article contains two parts: synthetic datasets and real datasets. A detail description of datasets can be seen in Table 1.
Some groups of images from the synthetic datasets can be seen in Fig.5, and some smoke images from real datasets can be seen in Fig.6.
All the images are with the image resolution of 128 × 128 pixels. It should be noted that we focus on the smoke images of vehicle rear, and our method can also be transplanted to other applications such as forest smoke generation.

B. EXPERIMENTS ON SMOKE IMAGE GENERATION
This part verifies the effectiveness of smoke image generation.
If we train K models by setting L to 1, 2, . . . , K , the total number of smoke image kinds is We set L to 10. Given non-smoke images and randomly smoke components latent codes, Fig.7 shows some generated smoke images using CSGNet and CSGNet-v2. We can see that our methods can generate various smoke images. The non-smoke components in Fig.7(a) are clear, and the smoke in Fig.7(b) are realistic.

C. EXPERIMENTS ON SMOKE IMAGE GENERATION WITH CONTROLLABLE SMOKE COMPONENTS
Different from existing methods, the smoke components in the generated smoke images using our methods are controllable, i.e., the smoke components can be designed or finetuned manually with non-smoke components unchanged.
Given a non-smoke image, our method can generate smoke images with a designed smoke component by using designed latent codes selected from I total kind different smoke components. In addition, we can fine-tune the smoke components by fine-tuning smoke components latent codes. Fig. 8 shows some generated smoke images by linearly adjusting the last four dimensions of smoke latent 201422 VOLUME 8, 2020 codes and keeping the other to 0. Our methods can control/fine-tune the smoke components in the generated smoke images while keeping the non-smoke components unchanged. Also, the texture of the road surface can be seen clearly. Fig. 9 shows some generated smoke images by linearly adjusting the first two dimensions of smoke components latent codes and keeping the other to -1. We can see that our methods are effective to control the added smoke components to the given non-smoke images.
In this selected case, the generated smoke images using CSGNet-v2 are more realistic than CSGNet. But the image clarity of using CSGNet are better than CSGNet-v2.

D. EXPERIMENTS ON SMOKE COMPONENT ESTIMATION IN GENERATED SMOKE IMAGES USING CSGNET
Different from existing methods, CSGNet can generate smoke images as well as their smoke components images which are useful in many applications. Fig.10 given the smoke components of the generated smoke images in Fig.8 and Fig.9. From this figure, we can clearly see the smoke components characteristics of the generated smoke images.
The smoke in Fig.10(a) is darker than that in Fig.10(b), and the smoke in center of Fig.10(a) is higher than that in edge. These indicate that the bigger the absolute values of latent codes, the higher the smoke components.

E. COMPARISON EXPERIMENTS WITH IMAGE QUALITY OF THE GENERATED SMOKE IMAGES
This section evaluates image quality of generated smoke images.
When smoke images are randomly generated, the phenomenon of generating repeated samples is called mode VOLUME 8, 2020 collapse. The reason for this issue is that there are many peaks in the actual data distribution, but only a few of them are fitted during training. In fact, the proposed models do not have this issue. The diversity of generated smoke image from CSGNet is mainly based on the diversity of background images and smoke component images. There are sufficient and diverse  Fig.8(a). (b) Generated smoke components images of the smoke images in Fig.9(a).
background images, and the smoke component images are also various as described in V(B). Therefore, the diversity of generated smoke images can be guaranteed.
Some quantitative evaluation indicators are used to evaluate the image quality of the proposed models. Currently, IS (Inception Score) [23], FID (Frechet Inception distance) [24], GAN-test [25], GAN-train [25], Kernel MMDs (Maximum Mean Discrepancies) [26] and 1-NN (Nearest Neighbor) two-sample tests [27] are important strategies to evaluate image quality. Xu et al. [28] proposed a conclusion 201424 VOLUME 8, 2020 that Kernel MMD [26] and 1-NN two-sample tests [27] are the best evaluation indicators. Table 2 shows the image quality evaluation results of the smoke images generated by our methods and the WGAN-GP [29]. Since the mainstream model WGAN-GP is stable in training and can generate realistic samples of considerable quality, we take WGAN-GP as the baseline method. The proposed methods perform better than WGAN-GP in three evaluation indexes, and this shows that the generated smoke image quality by the proposed model is guaranteed, and it is close to the real black smoke image.

F. COMPARISONS EXPERIMENTS WITH EXISTING METHODS
Some existing methods are transplanted to smoke image generation to verify the advantages of our methods.
The main advantages of our methods are 1) can generate smoke images with controllable smoke components and 2) generate smoke images as well as their smoke components images. Since we are the first one to achieve advantage 2), we only select some advanced methods related to feature decoupling to make comparisons with our methods.
Based on the discussion in section II, InfoGAN [6]. Cycle-GAN [11], beta-VAE [17] and IntroVAE [16] are used as baseline methods. It should be noted that some non-core parts of these baseline methods are improved to make them adapted to our datasets and task, and we also tried our best to achieve the best results on our task to achieve fair comparisons. Fig.11 shows the comparison experiments of different methods on controllable smoke images generation. Fig.11(a) shows the results or by adjusting the latent codes c [6] in InfoGAN [6]. It seems that c controls the smoke components. But the generated smoke images are not realistic. Also, we cannot generate more images since we set class number to 10. Fig.11(b) shows the results using CycleGAN [11]. But we cannot fine-tune the smoke components in the generated smoke images. This may lead to samples imbalance of different categories, since it tends to generate images with heavy smoke Fig.11(c) shows the results of beta-VAE [17]. None of the ten dimensions controls the smoke components with non-smoke components  [11]. (c) beta-VAE [17]. (d) IntroVAE [16]. unchanged. Fig.11(d) shows the results of IntroVAE [16]. We give a smoke image and its non-smoke components, and use IntroVAE to achieve images interpolation. But the images VOLUME 8, 2020 must in its trained datasets, and the smoke diversity of smoke is limited.
In addition, the training of the first two (GANs-based) methods are instable, and the generated images in the last two (VAEs-based) methods are blurred. Based on the above analysis, we can see that our methods are better than all the baselines methods in controllable smoke images generation.
The advantages of our methods are as follows: Our methods can generate smoke images as well as their smoke components images. Using the datasets generated by our methods, we can train models to estimate smoke components. The smoke components images are useful in many applications. 1) In smoke detection, if the smoke components in smoke images are higher than a fixed threshold, we can classify this image to smoke. This is a new way to detect smoke. 2) In smoke segmentation, we can segment smoke in different levels by setting different levels of thresholds on the estimated smoke components images using one trained model. But traditional deep learning methods need to label multiple datasets and train multiple models of different levels.
3) In smoke pollution levels estimation, by analyzing the smoke components, we can give a pollution level estimation of the smoke images in the application of smoky vehicle detection. 4) In forest smoke analysis, by analyzing the motion characteristics of smoke components in successive frames, we can forecast the smoke fire burning direction and trend in forest smoke detection. All the above important applications are based on a datasets of smoke images as well their corresponding smoke components images. Therefore, generating smoke images and their smoke components images are very useful.
Our methods can control the smoke components with non-smoke components unchanged. This is also useful in many applications. 1) In some special senses with specified smoke characteristics, we should generate some specified kind of smoke images, such as light smoke images, etc. Our method can generate these kinds of smoke images by setting specified smoke components latent codes. 2) Since our methods can generate smoke images with non-smoke components unchanged, we can generate short smoke videos by fine-tuning smoke components latent codes, such the sense of forest smoke which the non-smoke components are almost non-changed. 3) For the same non-smoke components, we can add various smoke components to it through adjusting smoke latent codes, and so increase the diversity of smoke images.
Our methods to generate smoke images by adding various smoke components to existing non-smoke images have some advantages. 1) Compared with GANs-based methods, CSGNet avoids the training instability. 2) Compared with the VAEs-based methods, the generated smoke images are less blurry, and a new non-smoke image can be added to our model without training again. 3) The non-smoke components in the generated smoke image are come from the real world, and so the images are more realistic. The smoke components may be non-realistic, but we can fine-tune it using CSGNet-v2.

G. SMOKE COMPONENTS ESTIMATION
Our methods can generate smoke images as well as their smoke components images. Using the datasets generated by our methods, we can train models to estimate smoke components. The smoke components images are useful in many applications. Fig.12 shows some experimental results of smoke components estimation of real smoke images using the synthetic smoke images in this article. We can see that the smoke position in the smoke image exactly corresponds to the smoke component position in the smoke component image. The higher smoke concentration in the smoke image, the higher smoke concentration in the smoke component image. This verifies the effectiveness of our generated datasets.

VI. DISCUSSION AND CONCLUSION
Smoke image generation is an important method to solve sample categories imbalance in smoke detection. This article presents two smoke image generation models including CSGNet and CSGNet-v2. CSGNet is proposed to generate various smoke images by integrating the SCM and the ISM with a multi-scale strategy. By adjusting smoke components latent codes in SCM, we can control smoke components in generated smoke images. CSGNet-v2 is proposed to make generated smoke images more realistic by adding the IFM. Compared with existing methods, our methods have three main advantages: 1) can control the smoke components in generated smoke images to generate various smoke images or videos. 2) can generate smoke images as well as their smoke components images which are useful in smoke detection, smoke segmentation, smoke pollution levels estimation for smoky vehicle, and smoke motion characteristic estimation for forest smoke etc. 3) with the idea of generating smoke images by adding various smoke components to existing nonsmoke images, the training of our models is stable compared with GANs-based methods, the generated smoke images are less blurry compared with the VAEs-based methods, and the non-smoke components in the generated smoke image are more realistic since they come from the real world. Our methods achieve some success in smoke images generation and can improve the performance in forest smoke detection. But our methods still have some shortcomings. In special cases, the generated smoke images using CSGNet are not realistic.