Unsupervised Segmentation of Fire and Smoke From Infra-Red Videos

This paper proposes a vision-based fire and smoke segmentation system which uses spatial, temporal and motion information to extract the desired regions from the video frames. The fusion of information is done using multiple features such as optical flow, divergence and intensity values. These features extracted from the images are used to segment the pixels into different classes in an unsupervised way. A comparative analysis is done by using multiple clustering algorithms for segmentation. Here the Markov Random Field performs more accurately than other segmentation algorithms since it characterizes the spatial interactions of pixels using a finite number of parameters. It builds a probabilistic image model that selects the most likely labeling using the maximum a posteriori (MAP) estimation. This unsupervised approach is tested on various images and achieves a frame-wise fire detection rate of 95.39%. Hence this method can be used for early detection of fire in real-time and it can be incorporated into an indoor or outdoor surveillance system.


I. INTRODUCTION
The fire and smoke detectors are an important part of firefighting systems and are also widely used in monitoring indoor buildings and outside environments.The conventional detection systems use inbuilt sensors which do not issue the alarm unless the particles reach the sensors to activate them.To obtain a high precision, the sensors must be distributed densely in close proximities.Hence in real-world situations, these are highly inefficient, and the delayed response may cost the life of firefighters and other human lives.As an appropriate alternative to conventional methods, vision-based fire and smoke detection systems were introduced in the past few years.
The vision-based systems either utilize the color information of fire and smoke or it uses the dynamic motion features [1], [2].The classical approaches operated on RGB [3], YCbCr [4], CIE L*a*b [5] or HSI [6] color spaces and the image pixels were classified according to the appearance model of the fire.But the use of color information gives high false alarm rates due to similar colors present in the surrounding environments.Another approach involving disorder analysis and growth rate was used to minimize these false alarms [7], [8].Further, one of the most popular methods was wavelet analysis, which used various extracted features for the detection process.The wavelet domain energy analysis was done to analyze the time variant behavior of the smoke [9].To improve the performance, an Expectation Maximization based GMM model was obtained by training the pixels of previously occurred events [10].In [11], on the other hand, used a change detection algorithm by extracting foreground pixels.Nevertheless, the system suffers heavily from a change in illumination and hence require fine-tuning of the algorithm parameters.In [12], the optical flow vectors were calculated based on the turbulence characteristics of the smoke and were used to eliminate nonsmoke disturbances.Other methods calculated the motion direction of smoke by assuming grayscale invariance in the optical flow algorithm [13].Recent papers also modeled a Markovian process by considering the motion of the fire [14].Later, Hidden Markov models [15] were used to distinguish between the flame and flame-colored objects.These models evaluated the spatial color variations in flame to reach a final decision.
The state-of-the-art architecture of a fire/smoke detector can be summarized in three steps such as pixel-wise classification of fire/smoke, region-based segmentation and the analysis of these regions.In this paper, a comparative analysis of different segmentation algorithms is done to find the appropriate one for fire and smoke detection.The experiments are conducted on Infrared (IR) datasets available online.Moreover, different feature extraction methods such as optical flow, sift flow and divergence are also evaluated.The feature vector is computed for each of the IR videos and it is further passed on to various segmentation algorithms.In particular, the main methods used in the segmentation of fire and smoke are K-Means Clustering [16], Gaussian Mixture Models (GMM) [17], Markov Random Fields (MRF) [18] and Gaussian Markov Random Fields (GMRF) [19].Finally, the confusion matrix and accuracy are computed to analyze the efficiency of the system.
The novelty of this approach is the following: • It involves the fusion of intensity, divergence, and optical flow-based features to obtain the most discriminative features for segmentation.The divergence and optical flow features are chosen since they calculate the flow at a given point and the displacement of pixels from one frame to another.Hence these motion features are combined with the intensity values representing the variations in temperature to form the most significant feature vector for segmentation.• The cascading of the feature extraction with an MRF framework for segmentation of fire and smoke.The main advantage of the proposed system is that it does unsupervised learning by modeling the likelihood of the data.The latent variable is already known and for this experiment, it can have 3 values corresponding to fire, smoke and background.Since these unsupervised algorithms support pre-training, it can be used real-time for testing.Though various deep learning strategies are currently used to obtain high performance; they are supervised algorithms and hence require a large amount of labeled data for training [39], [40].In the case of real-time firefighting scenarios, the availability of labeled data of both fire and smoke is limited.• The proposed system has low computational complexity and less trainable parameters.It also requires lesser training data to obtain high accuracy in segmentation.Multispectral systems can be developed using these algorithms by mixing IR and UV [41] data.Further, the feature extraction also allows fusing information from different sensors and hence by using this information the fire, smoke, and background can be accurately classified.

II. PROPOSED METHOD
The proposed technique has two main stages, which are feature extraction and segmentation.The first stage consists of characterizing every pixel of a sequence of images by several features, the first one is the magnitude (which expresses the pixel temperature at every instant) and the second one is the optical flow.This is an estimation of the speed of the particles moving in the sequence of images.This feature has two dimensions, that can be expressed in polar or rectangular coordinates.The last used feature is the divergence of the velocity vector field of the images.
We use the feature images to segment the pixels into different classes in a non-supervised way.Clustering methods construct a set of class conditional likelihood functions for each of the possible classes, and posterior probabilities of the classes given the observed pixels.The segmentation is completed by taking a decision on the class of the pixel based on the maximum value for those posteriors (maximum a posteriori criterion).The possible algorithms for segmentation are Kmeans, Gaussian Mixture Model (GMM), Markov Random Field (MRF) and Gaussian Markov Random Field (GMRF).
These algorithms are based on a probabilistic model of the observable data x i as a function of its class.The data can be representative of one of K classes, thus each pattern x i has an associate latent variable z i ∈ {1, • • • K}.The first two algorithms assume that the latent variables are independent and that the observable data are conditionally independent, this is ∀i, j, p(x i |x j , z i = k) = p(x i |z i = k).The models for K-means and GMM are usually represented by Gaussian functions.In K-means, the K class conditional are identical and isotropic.Thus, the observation log likelihood is proportional to the Euclidean distance of the samples to the K means of the distributions.The posteriors are simply approximated to 1 for the distribution with the closest mean, 0 for the rest.GMM assumes variable covariance matrices, which gives more flexibility to the model, and the posteriors are computed using the bayes theorem through the data likelihood and the latent variable priors.The MRF model [36]- [38] uses a likelihood for the data identical to the GMM one, but it assumes that there is a relationship between a pixel and its neighbors, so the latent variables in the same image are remodeled using an undirected graph.The algorithm is usually updated using the Iterated Conditional Modes (ICM) method.The GMRF also models the likelihood in a similar way and here the random variable associated with a pixel is considered to be jointly Gaussian.The methods are summarized below.
A. Feature Extraction 1) Optical Flow: The proposed fire detection algorithm takes advantage of one of the visually detectable characteristics of fire, i.e. motion.Here the motion estimation of fire is done using the Horn-Schunck Optical Flow method [20].This algorithm makes use of the flow vectors of moving objects over time to detect moving regions in an image.It computes a 2-dimensional vector known as the motion vector which indicates the velocities as well as the directions of each pixel of two consecutive frames in a time sequence.The main assumption made during this approach is intensity constancy: intensity values are preserved by moving objects from frame to frame.

Fig. 1: Optical Flow on video frame
Another assumption in optical flow is that objects at time t will always be in the image at time t+1, but with a slight displacement.They can be expressed in terms of intensities as Using the Taylor series approximation on the above equation and by assuming the movement to be small, we obtain the following first order approximation: Thus the 2D motion constraint equation is given as follows:- where the velocity of the x and y components are given by V x and V y and the derivatives of the image at (x,y,t) is denoted using I x , I y and I t .The equation has no unique solution since there are 2 unknowns.This is called the aperture problem of optical flow systems.This problem occurs when the component of the motion perpendicular to the gradient (i.e., parallel to the edge) cannot be determined.Thus Horn and Schunck added an additional constraint to perform the global regularization calculation.It was assumed that the optical flow will be smooth over relatively large areas and the objects in the image undergo only rigid motion.The regularization was done by minimizing the square of the magnitude of the gradient of the optical flow.This flow is defined as a global energy functional and later minimization is done.
The optical flow vector is denoted as V =[u(x, y), v(x, y)] T and the regularization constant is given by α.The parameter α controls the impact of the smoothing factor and thus larger values of it indicates a smoother flow.By solving the associated multi-dimensional Euler-Lagrange equations, the energy functional is minimized.These equations are as follows.
Here L denotes the integrand of the energy functional and further simplification gives the following two expressions.
The above Laplacian is approximated using finite differences and written as ∇u(x, y) = ū(x, y) − u(x, y).The weighted average of u around the pixel location (x,y) is given by ū(x, y).However since the solution is dependent on the neighboring pixel values, an iterative method was devised to solve the minimization problem.The method is repeated once the neighbors have been updated and the iterative scheme derived is as follows.
Thus the average velocity vectors u and v are computed for each pixel in the image.Fig. 1 shows the optical vectors on a video frame with fire and smoke.
2) SIFT Flow: In the optical flow algorithm, the main assumptions include brightness constancy and velocity smoothness constraint.But the pixel displacements in images of distinct scenes can be larger than the magnitude of the motion vectors.Thus, the assumptions used in classical optical flow may not be strong enough.These issues are addressed using the SIFT flow algorithm [21].Primarily the SIFT descriptors are extracted from each pixel location and these descriptors are constant with respect to the pixel displacement field.The SIFT descriptors are brightness independent and viewinvariant image structures.Hence when there is significantly different image content, matching these SIFT descriptors helps to establish meaningful correspondences across the images.These descriptors can be used even when the pixel displacements are large as the image itself.But the smoothness of the pixel displacement across images is still assumed since Fig. 2: Sift Flow on video frame close by pixels tend to have similar displacements.Thus, the search of the correlated SIFT descriptors across the images is formulated as an optimization problem with a cost function as follows: The above function consists of a data term, displacement term, and a smoothness term.The displacement vector at pixel location p = (x, y) is given by w(p) = (u(p), v(p)), corresponds to the spatial neighborhood of a pixel and s i (p) denotes the SIFT descriptor extracted at location p in image i.The first term in the above objective function has an L1 norm calculation to account for outliers in SIFT matching whereas a thresholded L1 norm is used in the third term along with the regularization parameter α to model discontinuities in the pixel displacement field.Further, the optimization is done using a dual-layer loopy belief propagation algorithm.
Here the smoothness term is decoupled and hence allows to separate u and v during message passing [22].Thus at one iteration of the message passing the complexity is reduced from O(n 4 ) to O(n 2 ).The distance transform [23] is used further to reduce the complexity since the functional form of the objective function has truncated L1 norms.
3) Divergence: The amount of flux entering or leaving a point is represented using divergence.When the flux leaves a closed surface, it is termed as positive divergence whereas flux contraction denotes negative divergence.The divergence operator inputs a vector-valued function defining a vector field and outputs the change in density of the flow at each point in the form of a scalar-valued function.Given a vector field D = U i + V j, the divergence formula is given as follows: Hence the measure of expansion or compression of an object in the field is given by divergence.By applying divergence, a clear contrast between widening and narrowing of flow vectors can be visualized.The change in scale in an image is specified using divergence.Likewise, the presence of sinks and sources on the flow can also be found by applying divergence.A vector field is termed as solenoidal when it has zero divergence at every point.The values with divergence lesser than zero are termed as sinks, and greater than zero divergence is considered as the source.
B. Segmentation Methods 1) K-means: K-means clustering [24] is an unsupervised learning algorithm.It partitions the data points into K clusters and each of the data points belongs to the cluster with the closest mean value.Based on the feature similarity, the algorithm works iteratively to assign each data point to one of K clusters.The algorithm inputs the data set and the number of clusters .The initial estimates for centroids are either generated randomly or selected from the data set.Then it iterates between two steps: • Data assignment step: Here each centroid defines one of the clusters.Based on the squared Euclidean distance, each data point is assigned to its nearest centroid.Let c i be the collection of centroids in set C, then each data point x is assigned to a cluster K based on the following: • Centroid update step: The mean is calculated for all data points assigned to a centroid's cluster and thereby the centroids are recomputed.
where S i be the set of data point assignments for each i th cluster centroid.
The algorithm iterates between these steps and it converge when none of the data points change clusters and the sum of the distances is minimized.
2) Gaussian Mixture Models: A probabilistic model can be used in the representation of normally distributed subpopulation in a dataset.Gaussian mixture models [17] are such models which learn about the subpopulation without knowing which subpopulation a data point belongs to.This constitutes a form of unsupervised learning since the assignment of the subpopulation is unknown.The mixture of Gaussians is represented as follows: where x i denotes the observed variables.The mixture component weights and the component mean and covariances characterizes a Gaussian mixture model.In the multivariate case µ k denotes the mean whereas Σ k corresponds to the covariance matrix.For each latent variable z k , we define prior probabilities π k .The total probability distribution normalizes to 1 with the constraint that When the number of components K is known, expectation maximization is employed to estimate the parameters of the mixture model.It is a numerical technique used in the estimation of maximum likelihood.It is an iterative technique with the property that with each subsequent iteration the maximum likelihood of the data increases strictly.Hence it reaches a local maximum at the end of the procedure.The expectation maximization consists of two steps.In the Expectation step, the posterior probability γ ik that, each data point belongs to each cluster is calculated using the current estimated mean vectors and covariance matrices.While in the Maximization step, the cluster means and covariances are recalculated based on the probabilities calculated in the expectation step.The steps are repeated until the algorithm converges, providing a maximum likelihood estimate.Thus, the main algorithm is as follows: • Evaluation of the log likelihood after initializing the means, covariances and the mixture component weights.• E-step: Calculation of the posterior probability that the data point x i belongs to component z k .Thus γ ik = p(z k |x i , π, µ, Σ) • M-step: Re-estimate the new parameter values using the γ ik calculated in the E-step.
• Evaluation of the log likelihood function using the If there is no convergence,the E step is repeated and finally using the fitted model density estimation and clustering is done.
3) Markov Random Field: In images, neighboring pixels exhibit similar properties such as intensity, texture and color information.The Markov random field (MRF) [26] is an undirected graphical model which makes use of this contextual information and represent them in probabilistic terms.Based on the Markov random field theory, any digital image consists of a discrete set of pixels which can be modeled using a set of random variables.The site is a term which is used to denote every pixel in an image and each site is given a label y which represents the intensity value of a pixel.Let an M × N digital image be described as where S is a rectangular grid.The relations between the sites in S are defined using a neighborhood system and a set of sites in S is said to be a clique C if every pair of sites in C is neighbors to each other.Hence there exists two random fields; the label random field y = {y i |s i ∈ S} and the observable random field x = {x i |s i ∈ S}.
The main goal of segmentation [27] is to find the optimum estimation of hidden field y from observed field x i.e., to estimate the correct classification for each pixel.The MRF uses the maximum a posterior probability estimation to minimize the probability of misclassification.
The Hammersley-Clifford theorem [25] states that any MRF can be described by a probability distribution P(y) which follows Gibbs form.
where P(y), Z and T denotes the prior probability, normalization constant and temperature parameter respectively.The energy function U(y) can also be represented as follows.
where V c (y) denotes the potential function.Here U(y) is the sum of clique potentials V c (y) over all possible cliques C. It is also assumed that one pixel has at most 4 neighbors.Therefore the clique potential can be either singleton, doubleton and other higher orders depending on the number of neighbors.Thus a clique consisting of two neighboring pixels is given as follows: V c (y i , y j ) = βδ(y i , y j ) where β is the coupling coefficient and when it increases the regions becomes more homogenous.
The segmentation problem is solved using one of MRF's pixel labeling algorithm named Iterated Conditional Modes (ICM) [28].This algorithm iteratively optimizes a statistical criterion by approximating the Maximum A-Posteriori (MAP) estimate.In the MAP approach, a posterior probability measure P (y|x) and we try to find an optimal labeling x which maximizes this probability.It is also similar to minimizing the posterior energy function U (y|x).ICM is thereby a greedy algorithm which tries to find a local minimum.For each pixel, the algorithm initially provides an estimate of the labeling and it chooses the label giving the largest decrease of the energy function.The posterior energy U (y|x) is given by the sum of the likelihood energy function and the prior energy function as follows: ICM, when compared with other approaches such as simulated annealing, doesn't allow the temporary increase in the potential function to obtain minimum potential.The ICM algorithm can be summarized using the following steps.
• Initialize by assigning an arbitrary labeling y at step n=0.
• At step n, we find, • Repeat the above step until convergence is obtained.4) Gaussian Markov Random Field: A Gaussian Markov random field (GMRF) is an undirected gaussian graphical model with values of the random field at the nodes to be jointly Gaussian [29], [35].GMRFs fit nicely into a Bayesian framework since they are analytically tractable.It is a continuouslyvalued random vector having a multivariate Gaussian distribution of the following form: where Σ −1 = Λ is the the inverse covariance matrix.The quadratic form of the exponent is given as follows: There does not exist an edge between y i and y j in the model when Λ i,j = 0 and hence the neighborhood system is determined by the matrix Λ.The nonzero pattern of Λ helps to determine whether two nodes are conditionally independent.
Here Λ is sparse, that is Λ i,j = 0 if and only if y i andy j are conditionally independent.In practice, the GMRFs are defined using the quadratic energy function [34] given by: where b ∈ R. In the application of Bayesian image processing [30], consider the image to have a similar M × N rectangular lattice structure as of MRFs.When a suitable prior p(y) is chosen, the maximum a posteriori (MAP) is estimated to find the optimal labels for segmentation using the ICM labeling algorithm.

III. EXPERIMENTS AND RESULTS
The proposed experimental framework reads a captured video and extracts the frame.The infra-red videos are collected from Google and the frame rate of the videos are 30fps.Here the information extracted from the first 10 frames will act as prior knowledge for the test images.Initially, the training is done using images with fire and smoke and the primary step during this phase is feature extraction.The main features used for experimentation are the intensity values of the image, magnitude of motion vectors, SIFT flow features and the divergence of the image.The intensity values are taken into consideration since fire will be having a higher intensity value compared to the smoke and background.The divergence of the image is another feature which gives the amount of flow passing through a surface surrounding a pixel.Additionally, the SIFT flow features preserve spatial discontinuities and it helps to compute pixel-wise SIFT features between two images.
The motion features are computed using the Horn Schunck Optical Flow algorithm [31] which computes optical flow for all the pixels in a frame.Since the optical flow is the distribution of apparent velocities of different objects in the frame.By estimating optical flow between the frames, the velocities of the objects in the video can be measured.The velocity along the x and y-direction as well as the magnitude and direction can be calculated for consecutive frames.Further various clustering techniques are used to separate the fire and smoke from the background.Thus these computed feature vectors [32] are given as input to different clustering techniques such as K-Means, GMM, MRF and GMRF to segment the smoke and fire from the desired frame.Here Fig. 4 shows the basic block diagram of the proposed algorithms.Each of this clustering [33] provides the indices for the pixel values corresponding to each of the classes.These pixel indices are mapped to the original frames to perform segmentation of the region of interest.The final classification of fire, smoke, and temperature are done based on the intensity values belonging to those clusters.Hence the labeling of the clusters is done by assigning the cluster with the highest intensity values to fire, intermediate intensity values to smoke and the cluster which has a lesser intensity to background.Further, the performance evaluation is done by calculating the accuracy values for different algorithms.But the classification accuracy alone is not sufficient to select a model since it hides the details required to better understand the performance of the model.Hence the confusion matrix was computed to overcome the limitations of using only the accuracy as a decision parameter for performance evaluation.The estimation of the confusion matrix involved the manual labeling of the test frames to obtain the ground truth.Here the pixel-wise comparison is done between the segmented and ground truth values to obtain a summary of predictions made by the algorithm for each class.

A. Sample Segmentation comparison for different algorithms
Fig. 5 and Fig. 7 shows the segmentation results using different combinations of feature extraction and clustering algorithms.The main features used for this experimentation are the divergence, intensity, sift flow and the optical flow values.It can be seen that the cascaded system using these feature vectors and MRF performs a more successful segmentation of fire and smoke than the rest of the algorithms.The rest of the segmentation algorithms misses some pixels of inner parts of smoke and fire.The MRF based approach was able to detect the smoke regions which appears blurry and indistinguishable for the human eye.Hence this prior information can be of paramount importance for the first responders during real-time fire-fighting situations.
In comparison with the test results shown in Fig. 5, it is evident that the proposed algorithm is able to disregard the unwanted artifacts from the frames.Thus, it is observed that only MRF is able to capture both the static and dynamic properties of the area of interest.The clear definition of the shape of fire and smoke will help in further analysis such as

B. Confusion matrix
The confusion matrix is an evaluation metric used widely for the analysis of semantic segmentation.It is a square matrix in which each row has instances of the true class and each column has instances of the segmented class.Hence C mn represents the pixels of class m which are classified as class n.Fig. 6 shows the comparative analysis of the confusion matrix table for different methods.It can be seen that both GMM and MRF based segmentation was able to segment fire in a more accurate way than the other algorithms.In case of smoke MRF was able to perform the segmentation with 90% accuracy whereas the rest of the methods were unable to distinguish accurately between background and smoke.Further, Fig. 8 shows the final calculated accuracy from the confusion matrix for the proposed methods.It can be seen that the feature extraction methods boosts the performance of the segmentation algorithms.It is also observed that the feature extraction using optical flow, divergence and intensity values and segmentation using MRF gives the highest accuracy of 95.39%.
IV. CONCLUSION This paper introduces a novel method for fire and smoke characterization in IR images.This approach can perform unsupervised testing in real time and it can be trained parallelly in offline mode.The feature extraction methods proposed for this problem are by using optical flow, divergence, and intensity.Even though sift flow features were tested but it did not give any significant improvement in segmentation compared to the combination of the above feature extractors.The unsupervised segmentation methods used for the comparative analysis were K-Means, GMM, GMRF, and MRF.It was found that MRF showed better performance in the classification with a higher accuracy of 95.39%.It has been tested visually and quantitatively that MRF was able to distinguish fire, smoke, and background in a more precise manner.Even though GMM was able to segment most of the fire regions but it gave a much lower accuracy of 76% for smoke segmentation.Thus, the fusion of information in the proposed method was able to produce results that outperform the traditional approaches.
The future work aims to use multispectral data from UV and RGB sensors to make more accurate predictions for realtime firefighting scenarios.We would also like to extend the experimentation on dynamic and complex fire environments to test the robustness of these approaches.