Spatio-Temporal Self-Attention Network for Fire Detection and Segmentation in Video Surveillance

Convolutional Neural Network (CNN) based approaches are popular for various image/video related tasks due to their state-of-the-art performance. However, for problems like object detection and segmentation, CNNs still suffer from objects with arbitrary shapes or sizes, occlusions, and varying viewpoints. This problem makes it mostly unsuitable for fire detection and segmentation since flames can have an unpredictable scale and shape. In this paper, we propose a method that detects and segments fire-regions with special considerations of their arbitrary sizes and shapes. Specifically, our approach uses a self-attention mechanism to augment spatial characteristics with temporal features, allowing the network to reduce its reliance on spatial factors like shape or size and take advantage of robust spatial-temporal dependencies. As a whole, our pipeline has two stages: In the first stage, we take out region proposals using Spatial-Temporal features, and in the second stage, we classify whether each region proposal is flame or not. Due to the scarcity of generous fire datasets, we adopt a transfer learning strategy to pre-train our classifier with the ImageNet dataset. Additionally, our Spatial-Temporal Network only requires semi-supervision, where it only needs one ground-truth segmentation mask per frame-sequence input. The experimental results of our proposed method significantly outperform the state-of-the-art fire detection with a 2 ~ 4% relative enhancement in F1-score for large scale fires and a nearly ~ 60% relative improvement for small fires at a very early stage.

setting, these rule-based approaches require constant threshold tuning and may not work well in a real-world environment.
In the past few years, the advent of deep learning allowed automatic feature extraction. Advancements like Convolutional Neural Networks (CNNs) are now state-of-the-art in many video/image related tasks [8]- [10]. The shift away from handcrafted features allowed fire detection approaches to be more robust and adaptive to real-world settings. Some CNN-based examples include, [11]- [13]. While these methods have good results in fire classification, one limitation is that they suffer when the fire is still small-which should be detected to prevent more damage. Additionally, they only assume a single input image and do not take advantage of discriminative fire temporal features like flickering, luminosity, color and warmth changes.
We consider a distinct method to understand this enigma and perceive the fundamental differences of fire to common objects of interest in object detection/segmentation problems. Fire is incredibly unique because of its unpredictable spatial characteristics. It can be big or small, and it can have arbitrary shapes. This property makes it harder to learn by conventional CNNs and adds a level of complexity to our fire detection and segmentation problem. Additionally, there is a scarcity of generous video fire datasets with ground-truth segmentation masks, making supervised learning difficult.
In this paper, we propose a fire detection method that identifies fire regions which will enable the first responders to understand the intensity and growth of the fire over time. Additionally, our network is explicitly designed to handle small-sized fires, making it suitable as an early detection system. To be more conspicuous, our pipeline comprises of two-stage. In the first stage, we parallelized the data propagation through two streams, termed Spatio-Temporal Network, which treats the spatial and temporal information separately. We design a semi-supervised network for the temporal streams that segregates fire region features from the background in a video based on a given keyframe. We combine spatial and temporal features using self-attention, learn robust fire-distinctive dependencies, and extract quality segmentation masks used as region proposals. The second stage is an error-correcting mechanism to refine predictions. Additionally, it is designed to learn scale-invariant features to be more robust against arbitrary-sized fires. As one of our contributions, we constructed a fire video dataset with ground-truth segmentation masks that are manually created. Moreover, for evaluation, we also created a dataset containing videos of small-sized fires, some of which are synthetically generated. We performed several experiments to prove our method's effectiveness, and we show that it compares auspiciously in opposition to the state-of-the-art-including on small-sized fire scenarios.
To summarize, our main contributions are: • We propose a novel two-stage fire-detection approach.
In the first stage, we implement two streams, termed the spatial-temporal network. We design a semi-supervised network for the temporal stream that segregates fire region features from the background in a video based on a given keyframe. The spatio stream uses static features from a single frame, such as color and texture. • Our proposed approach uses self-attention on Spatio-Temporal features that are discriminative of fire, enabling our network to produce superior segmentation masks to use as region proposals. CNN-based binary classifiers classify these region proposals in the second stage, which is essential because some objects are also similar to fire. • We constructed a video dataset containing manually generated ground-truth segmentation masks. Additionally, since one of our goals is early fire detection, we created synthetic videos with small-sized fires for evaluation purposes. The paper is organized as follows. Section II discusses some related works and the progression towards the stateof-the-art. In Section III, we explain the challenges related to this work and motivate our approach to justify the design choices we made to solve the problem. We explain our approach in Section IV and discuss evaluations in Section V before finally arriving at a conclusion in Section VI.

II. RELATED WORK
In the last few years, video surveillance has become nearly a defacto standard in various fields, including anomaly detection [14], pedestrian detection [15] and fire detection [11]. And multiple attempts have been made to find more effective and efficient methods for coding surveillance videos [11], [16], [17].

A. FIRE DETECTION
Early works on image-based fire detection rely on handcrafted features descriptive of fire. For instance, Töreyin et al. [3] propose a wavelet transform to extract temporal features and rule-based decisions, which rely on thresholds to identify fire regions. Chen et al. [6] used RGB and HSI color spaces to analyze fire behavior in multiple frames and proposed heuristics to detect fire-regions. Vipin et al. [18] used YBbCr color space to separate luminance from chrominance and classify if pixels are fire regions or not. Recent works are CNN-based and stray away from handcrafted features and heuristics [11]- [13]. For instance, Sharma et al. [19] investigated fire detection by finetuning popular VGG16 and Resnet50. Muhammad et al. [20] applied a model similar to GoogleNet to extract features from the image for earlystage fire detection. They also explored in [11] light-weight SqueezeNet [21] for fire detection and localization. Dunnings et al. [12] use super-pixels with CNN architectures based on Inceptionv1, AlexNet, and VGG16 for fire detection. As part of their effort, CNN models are simplified by keeping only some convolution, pooling, and dense layers to decrease model complexity while preserving accuracy. Using Multiple Instance Learning, Aktas et al. [22] extend the current CNN-based fire detection method in video sequences. Xie et al. [23] used both deep static and motion flicker-based dynamic features for detecting fire. The researchers [24] used a multi-scale feature extraction mechanism based on AlexNet to gain spatial detail information of fire in an image. They apply channel attention to emphasize the contribution of different feature maps. Oh et al. [25] presented a method for detecting wildfires using light-weight EfficientNet framework. As a means of resolving the classes imbalance problem, they utilized the focal loss. Wang et al. [26] proposed a suspicious region localization using the cauchy-mixture model in a five-dimensional feature space. Moreover, they designed light-weight Squeeze-and-Excitation shuffleNet for the classification of the suspicious region. Li et al. [27] designed a dilated convolutional network for fire localization and classification, even perform better than fine-tuned CNNs. Shen et al. [28] used a one-stage detector to detect flames, such as YOLO. Based on the spatial features, Kim et al. [29] applied Faster R-CNN to detect suspected fire regions, which LSTM then used to interpret the dynamic fire behavior. Apart from detection, some CNNs also allowed for segmenting the fire in an image [30], [31]. One limitation of CNN-based approaches is that they suffer if fire regions are small, which is an inherent limitation of conventional CNNs due to their fixed-size receptive fields [32]. To mitigate this problem, we incorporate design decisions [33], [34] that better preserve localization information. Additionally, we also incorporate temporal features and show that it can further improve segmentation performance.

B. OBJECT DETECTION AND SEGMENTATION
There is a wide variety of possible applications for object detection [35]- [37] and segmentation [38]- [40], [40], [41], including remote sensing [42]- [44], object counting [45]- [47], and image editing [48]- [52]. In this work, we focus on fire detection and segmentation, which comes with unique challenges. For instance, objects found in popular datasets [53], [54], like cats, dogs, or cars, usually have a defined shape. On the other hand, fires have an unpredictable nature. It can have an arbitrary shape, size, and even location on the image, making it harder to learn. Additionally, there are no large datasets containing fire and ground-truth segmentation masks, adding another layer of complexity. To address these limitations, our method is trained in a semisupervised manner and only requires the ground-truth mask of one frame. Additionally, we adopt a transfer learning strategy and pre-train our network on ImageNet [54] to learn background information.

C. TEMPORAL FEATURES
Early approaches on fire detection using handcrafted features harnessed the power of temporal features through wavelet transforms or frame differences [3], [6], [18]. One advantage of using temporal features for this problem is that fire behaves very distinctively across video frames. The presence of fire results in flickering luminosity, changes in color warmth, and rapid optical flow movements. Instead of relying on handcrafted features, we use convolution layers to learn temporal features. We augment spatial with temporal features using a self-attention mechanism, popular in NLP [55], to learn Spatio-temporal dependencies useful for segmenting fire regions.

D. SPATIO-TEMPORAL ATTENTION
Visual attention has been broadly applied in video-related tasks [56]- [59]. Liu et al. [56] enhanced the vanilla LSTM network's ability by appending Spatio-temporal attention for 3D action recognition, which selectively focuses on the action sequence's discriminative joints with the help of global contextual features from skeleton data. In a further study, Liu et al. [57] introduced a dynamic attention mechanism to progressively enhance recognition capability and improve network performance. Wang et al. [58] presented a recurrent Spatio-temporal attention model that adaptively learns essential information from video context to intensify the ability of action representations. Wang et al. [60] introduced a nonlocal module to compute the spatial-temporal dependencies. In work for video captioning, Yan et al. [59] proposed an encoder-decoder architecture by embedding Spatio-temporal attention; thus, the decoder chooses essential regions from the most appropriate temporal segments for word prediction dynamically.

III. MOTIVATION
To reduce losses in fire disasters, we propose a method that can detect and segment fire regions in videos. Unlike traditional sensor-based technology, our approach can recognize small fires, enable early detection, and track its intensity progression through segmentation masks. We argue that since fires cause unexpected changes in size or shape, special design considerations should be made. In order to tackle the arbitrary characteristics of the fire's size and shape, we take inspiration from popular network architectures like [33], [34], and incorporate skip structure between the encoder and the decoder path. Additionally, to reduce our network's reliance on spatial features like size or shape, we incorporate temporal features learned by 3D convolution layers. We use an attention mechanism to augment spatial with temporal features. As shown in our experiments, this strategy allows us to know Spatio-temporal dependencies that improve our network's segmentation quality. Moreover, we apply a twostage pipeline similar to existing object detection networks [35], [61]. The first stage extracts fire regions from the background based on a keyframe, and the second stage classifies the region. However, unlike object detection networks, our region proposals are segmentation masks, providing information about the fire's size and intensity. This feature makes it especially useful as a fire detection system.

IV. PROPOSED APPROACH
We propose a fire detection approach sensitive to fires of varying sizes-from small to big. As shown in Fig. 1, our method has two stages: (1) region proposal and (2) classification. In the region proposal stage, we use a Spatio-temporal network that adopts self-attention to augment spatial with temporal features to extract high-quality segmentation maps. In the second stage, we utilize a classier network to detect and verify fire regions accurately. This section is arranged as follows: in IV-A, we elaborate more about the first stage and discuss the Spatio-temporal network, followed by the extraction of region proposals in IV-B. Lastly, in IV-C, we discuss the fire classifier found in the second stage.

A. SPATIO-TEMPORAL NETWORK
Network Overview As its name suggests, the Spatio-Temporal Network takes advantage of spatial and temporal features to extract fire segmentation masks. In Fig. 2, we show an overview of the network. It has 3 major parts: (1) TemporalNet, (2) SpatioNet, and (3) FuseNet. TemporalNet learns features related to the time component and takes in a sequence of frames f i to f i+T , where f i is the initial frame and f i+T is the final frame. On the other hand, SpatioNet only takes in a single frame f i+T /2 as input. TemporalNet and SpatioNet would provide each output with a 64-channel feature map. In our implementation, TemporalNet takes in frames f i to f i+14 , and SpatioNet takes in frame f i+14/2 . Inspired by [62], we concatenate the feature maps and pass them through a 1 × 1 convolution layer, which effectively learns how to shrink their size. Finally, FuseNet takes in the 1×1 conv layer output and learns spatial-temporal dependencies using a self-attention mechanism. Self-attention between spatial and temporal features extracts important relationships like how certain texture regions behave across time. These relationships are beneficial for fire detection and segmentation, as revealed in our experiments, which is discussed in Section V-D.
Training Overview The network is trained in a multistage manner where we first train the SpatioNet and Tempo-ralNet, then finally the FuseNet. SpatioNet and TemporalNet are trained independently to extract the fire segmentation mask of frame f i+T /2 . As shown in Fig. 2, these networks each output a 64-channel feature map. However, during the training stage, we augment these networks with another layer to output an H × W tensor corresponding to a segmentation mask. We phrase the segmentation problem as pixel classification and optimize the networks to reduce a binary crossentropy also called as Log loss.

1) SpatioNet
Inspired by UNet++ [34], we use skip pathways structure to reduce the semantic gap between encoder and decoder feature maps. As shown in Fig. 3, our SpatioNet utilizes 2D VGG blocks, as depicted in Fig. 4, which concatenate the output of the previous block and the corresponding up-sampled output of the lower block. The dense skip connection enables the shallow layers to share information with deep layers easily. We use this design because localization information can be found in the shallow layers. By connecting it to deeper layers, we improve the network's ability to detect small-size fires, which is critical in early fire detection systems.
In Eq. 1, we formally formulate the output of each block as B i,j . The blocks are denoted as L i,j , where i is the level of down-sampling, and j denotes the level of skip pathway. The function V (·) is a VGG convolutional operation, C(·) is a concatenation, and U (·) is an up-sampling layer. At level j = 0, nodes only receive one input from the previous layer. At level j > 0, nodes receive j + 1 inputs which j inputs are the outputs of previous nodes in the same skip pathway, and one input is the up-sampled output from the lower skip pathway. In our work, we use four layers of up-sampling and down-sampling.

2) TemporalNet
This sub-network learns features from a series of frames f i to f i+T . These features are especially useful for our purpose because fire has specific temporal behavior. For instance, fire causes the luminosity of frames to flicker, or its color  temperature to change. It may also exhibit rapid movements across frames. As discussed previously (Section IV-A), we train the Tem-poralNet to output a segmentation mask. Because groundtruth labeling is expensive, we propose an architecture that takes in frames f i to f i+T but only requires the ground-truth segmentation mask of f i+T /2 . TemporalNet's architecture is shown in Fig. 5. We use VGG blocks (shown in Fig. 4) with 3D for temporal behavior and max-pooling in the encoder to reduce the feature map's dimension. In the decoder, 2D up-sampling recovers resolution's spatial (height and width) dimension and ultimately outputs a segmentation mask of the middle frame f i+T /2 . This strategy allows for a semisupervised learning approach that only needs one frame's ground truth per input sequence. Nevertheless, this is not straightforward since the encoder primarily deals with 4dimensional temporal information, and the decoder deals with 3-dimensional spatial information. To solve this problem, we utilize 1D max-pooling to reduce the feature map's temporal dimension from the contracting path of the encoder. The decoder uses 2D VGG blocks. We also incorporate skipconnections from the encoder to the decoder path, which is used to extract multi-scale features and retain detailed temporal information using 1D max-pooling. There are five convolutional layers and four max-pooling layers in the encoding path. In the decoding path, there are four upsampling layers and four convolutional layers.

3) FuseNet
As shown in Fig. 2, we concatenate features extracted from SpatioNet and TemporalNet using concatenation and a 1 × 1 convolution layer. Within the FuseNet, we use a self-attention mechanism inspired by [63] to extract dependencies of spatial and temporal features. The overview of our FuseNet is shown in Fig. 6. It has a Self-attention module between a down-sampling encoder and an up-sampling decoder. Before sending the feature maps into the Self-attention module, we down-sample the feature maps to reduce the calculation in the Self-attention module. Fig. 7 shows an overview of the Self-attention module. Its goal is to get matrix S ∈ R N ×N , where each point of S ij denotes i th position's impact on j th position. This impact is regarded as self-attention, and it can learn pairwise correlations of features. Since FuseNet takes in Spatial-Temporal features, it can effectively learn how each image region behaves in time with respect to the other areas. These features are especially critical for fire detection because a fire in one part of the image would always affect the surrounding areas. For instance, the surrounding area's luminosity, color temperature, and shadow movements are correlated with the fire's intensity and behavior. Specifically, to compute for the Self-attention matrix S ∈ R N ×N , we first use three 1 × 1 Convolution layers to transform the encoder output into three different feature spaces, X A , X B and X C . We reshape the feature maps X B and X C to B and C , where B , C ∈ R C×N and N = H × W and transpose it to A , where A ∈ R N ×C .
Using Softmax on A and B , we can get S, which is formally defined in the following equation: Next, we perform another matrix multiplication between C and S, and then reshape the result to R C×H×W . Finally, we perform element-wise sum with the input feature maps X to get the final output Y , formally defined as follows: Herein, C denotes the reshaped output of X C and dot (·) denotes matrix multiplication. Inspired by [63], α is a learnable parameter. Lastly, in the decoder shown in Fig. 6, we up-sample the feature maps back to the size of the input image and use a 1×1 convolution layer to get the final fire segmentation mask.
In our experiments, we show that the Self-Attention module improves the performance of the network.

B. FINDING REGION PROPOSAL
After using the Spatio-Temporal Network, we want to extract the region proposals from the segmentation mask. To obtain this, as shown in Fig. 8, we convert the segmentation mask into binary and compute the bounding boxes for each connected component using component labelling of OpenCV, which is an algorithmic application of graph theory employed to determine the connectivity of "blob"like areas in a binary image. Also, we extended each connected component's region to find a single region of interest. Accordingly, dimensions of each region are enlarged from [x, y, width, height] to [x, y, x + width, y + height], then overlapping bounding boxes are merged into, and this process repeated iteratively. Finally, overlapping bounding boxes are consolidated into one region proposal, as shown in Fig. 9. These region proposals will be classified in the next stage.

C. FIRE BINARY CLASSIFIER
The Fire Binary Classifier takes in region proposals from the first stage and identifies if it contains fire. Inspired by DenseNet [64], our classifier connects each layer to the other in a feed-forward way. For each layer, the feature maps of every previous layer are used as input, and its output feature maps are used as input for every layer behind. This strategy reduces gradient vanishing and enhances feature propagation. In this work, we call our classifier DenseFire, derived from the original DenseNet [64]. In our experiments, we also compare with different classifiers used in state-ofthe-art fire detection approaches including, InceptionV1 [12] and SqueezeNet [11].
Because the fire dataset is small, we adopt a transfer learning strategy and train our classifier network for other tasks, indirectly enhancing fire classification performance.
Specifically, we pre-train our classifier on ImageNet [65] so that it can learn useful features that can discriminate background objects from fire.

V. EXPERIMENTS
We will describe the experimental setting in detail in this section. First, we compare each sub-networks of our Spatio-Temporal Network and evaluate its segmentation quality. We replace our FuseNet with UNet to prove that the Selfattention module increases our network's segmentation performance. Then we perform several groups of experiments to prove the viability of our method. We compare our two-stage architecture with other state-of-the-art methods on publicly available and self-concreated fire datasets, including smallsized fires. We compare the computational cost of different state-of-the-art classifiers on the NTUST fire dataset. Finally, we test the robustness of the proposed framework.

A. IMPLEMENTATION DETAILS
All experiments are conducted on the machine (Intel(R) Core(TM) i7-7700K) with a RAM of sixty-four GIGabytes memory capacity and NVidia GTX 1080Ti graphics processing unit (GPU) of eleven GIGabytes. As for the software, all codes are implemented using the Pytorch deep learning framework on the Ubuntu system. We independently train SpatioNet and TemporalNet to optimize a binary crossentropy loss. After these networks are trained, we freeze their weights, then train FuseNet. This multistage strategy allows us to train our model, despite memory constraints. Each network is trained using an Adam Optimizer with a learning rate of 3e-4. We set the batch size to 4 and trained for 10000 epochs on the NTUST fire dataset. Table 1 includes details about the datasets. Our approach aims to obtain the fire regions from a sequence of frames; therefore, we created our own NTUST dataset containing videos for training and testing. We collected two datasets 1 : NTUST fire dataset and Small-sized fire dataset, a subset of the NTUST dataset combined with synthetically generated small-sized fire videos. NTUST fire dataset We collected a total of 1033 videos, with 559 containing fires and 434 containing normal scenes. These videos contain diverse samples like scenes of burning wood, car, and trash. It also contains objects similar to fire, like sunsets and flashing lights. Fig. 10 shows some examples 1 Will be made available upon acceptance of manuscript of our NTUST fire dataset. For each video, we only take 15 frames and manually create a ground-truth segmentation mask for the 7-th frame. We only create one ground-truth mask per fire video to maximize our dataset's scenery variety and because manually creating segmentation masks is a tedious task.

B. DATASETS
Small-sized fire dataset We define small-sized as occupying only 5% of the total pixels in the whole image. We gathered small-sized fires from the internet as the test set too. Additionally, we also generate synthetic videos by blending fire videos and normal videos frame by frame, as shown in Fig. 11, and use these images to augment our small-sized fire dataset. In total, we collected 100 small-sized fire videos, with 200 normal videos sampled from the NTUST dataset. Fig. 12 shows some examples of small-sized fires from our dataset.
To ensure fair evaluation and quantitatively appraise the achievement of our proposed method, we also used a publicly available dataset [7] and compared the results with other state-of-the-art techniques.
Foggia dataset [7] Provides 31 video clips with 62690 frames, which contains different situations; only 14 video clips hold the fire scene. Sample video clips from the Foggia dataset are shown in Fig. 13; the fire region of each video has a relatively substantial proportion of the images.

C. EVALUATION CRITERIA
The following metrics are used to examine the quantitative performance of the proposed approach.
The recall and precision are defined as: The F1-score is defined as: The accuracy is defined as: T P represents True-Positives, where the number of fires detected that ground truth are fires. F P represents False-Positives, where the number of fires detected that ground truth are not fires. F N represents False-Negative, those fires that have not yet been detected. T N represents True-Negatives, where ground truth are not fire and predicted as False.

Segmentation Mask
In the first stage, our pipeline outputs a segmentation mask using the proposed Spatio-Temporal Network (STNet), a fusion of sub-networks, SpatioNet, and TemporalNet. To analyze the individual contributions of SpatioNet and TemporalNet, we show the performance of  each sub-network in terms of segmentation quality. As an evaluation metric, we use the dice coefficient (also known as F1-score) shown in Eq. (7), where H and W denote the height and width of the input image, X denotes the semantic ground-truth, and Y denotes the predicted segmentation mask. Dice coefficient is a commonly used metric to evaluate segmentation quality [33], [34].

Dice
In Table 2, we show the dice coefficient of SpatioNet, TemporalNet, and Spatio-Temporal Network (STNet) on the test set of the NTUST dataset. Observe that the score of our TemporalNet is higher than the SpatioNet, highlighting the importance of temporal features in fire segmentation. Additionally, the result of our full network, Spatio-Temporal Network, proves that fusing the spatial and temporal features achieves the best results. We also show the output segmentation masks of each network configuration in Fig. 14. In the first row, we show an example of a small-sized fire. It can be observed that only the SpatioNet failed to detect the fire, confirming our hypothesis that spatial features are not robust enough against arbitrarilysized objects. In the second row, we show an input sample containing many objects that are brightly colored, similar to fire. The output of SpatioNet shows that it is sensitive to these objects that are not fire. On the other hand, it is harder to fool the TemporalNet because not many objects exhibit temporal features similar to fire. However, it could be observed that small patches on the right side of the image are still incorrectly labelled as fire. By combining spatial and temporal features, the Spatio-Temporal network shows the best segmentation masks.
Fusion Our FuseNet, as shown in Fig. 6, consists of a Self-attention module that learns global fire dependencies from temporal and spatial features. In this experiment, we analyze the contribution of the Self-attention module in terms of improvements in segmentation quality. First, we obtain the dice coefficient of FuseNet alone and then compare the dice coefficient of FuseNet with the Self-Attention module and FuseNet with a UNet [33] structure. The resulting dice coefficient scores are shown in Table 3, and it can be observed that the Self-attention module achieves a better score than UNet, which justifies its use. Self-attention's success is attributed to its ability to learn how each patch relates to the entire image. For fire segmentation, these relationships are critical because the presence of fire affects its surrounding regions. Self attention Each stream of the Spatio-Temporal Network provides specific information. And to further verify the significance of self-attention in the spatial and temporal streams in producing the output mask directly. We also examine the model by adding attention to the individual streams feature. We compare the dice coefficients for SpatioNet with self-attention, TemporalNet with self-attention, and the Spatio-Temporal Network (STNet) with self-attention on the NTUST dataset. The outcomes of the segmentation model are summarized in Table 4. It is noticed that the dice score of our SpatioNet is lesser than the TemporalNet (Table 2). It verifies that adding self-attention to individual streams does not make much significance (Table 4). SpatioNet with selfattention achieves a dice score of 0.775, whereas, without attention, it attains a dice score of 0.771. Similarly, Temporal-Net with self-attention and without self-attention reach a dice score of 0.840, 0.839 respectively, which are almost similar. Additionally, the Spatio-Temporal Network result proves that fusing the spatial and temporal features with self-attention achieves the best results. Two stage classifier For more comprehensive validation of two-stage classification, the ROC curve is added to estimate the fire detection of our network. The average values of area under the ROC curves for the NTUST Dataset is shown in Fig. 15.a. True positive rate is plotted against False positive rate in the ROC curve. It can be observed  Fig. 15.b shows the precision-recall curves. It can also be seen that the AUPRC values for two-stage (STNet+DenseFire) are relatively higher than STNet.

E. RESULTS ON THE NTUST DATASET
Segmentation results on the NTUST dataset. In this work, we utilize UNet [33] as a baseline. And to validate the proposed framework's segmentation performance, we compare it against different deep learning-based models such as UNet++ [34], AttUNet [66] and R2UNet [67]. The qualitative results of our STNet with other deep CNN methods are shown in Fig. 16, which is based on the testing set of the NTUST Dataset. The binary mask outcomes indicate that our model is competent in capturing fire information. UNet++ shows good performance as compared to R2UNet and attention UNet. It can be noticed that the segmented fire areas using the conventional UNet model are worst among all. Furthermore, the quantitative evaluation score is listed in Table 5. We can see our model achieve higher recall and F1scores (1, 0.848), respectively. Comparing results of the fire binary classifiers on the NTUST dataset. We compare our fire binary classifier with other state-of-the-art fire detection methods. In the second stage of our pipeline, we use DenseFire to identify if the input contains fire or not. Usually, our DenseFire takes in region proposals from the first stage of our pipeline as shown in Fig. 17. However, in this experiment, first, we test the individual performance of our DenseFire without the first stage and see how it compares to other methods. We compare with InceptionOnFire [12] based on the Inception Network [68] and CNNFire [11] based on Squeeze Net [12]. We extracted a total of 13256 images from our NTUST dataset and used 80% for training and 20% for testing. We train each network as a binary classifier of fire or normal, and in Table 6, we show each network's performance in terms of recall, precision, and F1-score. We observe that DenseFire achieves the best F1score.

Method
Recall Precision F1-score InceptionOnFire [12] 0.9870 0.9499 0.9681 EMNFire [13] 0.7422 0.9729 0.8420 CNNFire [11] 0 Sometimes, it is difficult to distinguish between a real fire and an object that looks like a fire from a long distance by relying only on the above rules. Therefore, we considered a two-stage classifier. From the first stage obtained, the proposed region is re-classified by binary classifiers. Individual classifier's success with STNet is measured in a recall, precision, and F1-score and presented in Table 7. We can observe that the performance is further enhanced. The classifiers discarded some of the region proposals that are identified as fire by STNet. It is apparent from the analysis that our method STN+Densefire is improved in various ways and achieved a recall of 100%, precision of 98.4% and F1score of 99.2%, which indicates a more appropriate fire detection system in practice.

F. RESULTS ON THE SMALL-SIZED FIRE DATASET
Segmentation results on the small-sized fire dataset. To test the versatility of our segmentation network, we compare it against various deep learning-based models such as UNet++ [34], AttUNet [66] and UNet [33] on the smallsized fire dataset. Fig. 18 shows visual comparisons with others. From row one, we can see that UNet cannot segment small fires, while UNet++ and AttUNet are partially able.
Using the proposed approach, we can segment fire regions with excellent quality. From row two, we can observe that the proposed method and UNet++ correctly recognize the fire region while AttUNet over estimated fire region. UNet and AttUNet cannot accurately segment the fire in the third row while UNet++ exceeded the fire area. In comparison to UNet++, AttUNet, and UNet, STNet appears to be performing more salutary. Also, our quantitative results, shown in Table 8, confirm that our F1-score is the best among all methods, ensuring a high degree of specificity and sensitivity in identifying small fires.
These CNNs have some convolutional layers followed by a few fully connected layers. CNN, the image is converted into a vector which is primarily used in fire recognition. They are effective for fire recognition problems when fires are relatively large (Table 6). However, small fires are still giving them some trouble (Table 9). CNN layers reduce the  images from high to low resolution, and a fully connected layer causes loss of spatial information. Consequently, small fire features they extract on the first layer (and a few of them to start with) disappear between the layers and are never actually used for classification. In Table 9, the very low recall value for InceptionOnFire, CNNFire, EMNFire and shuffleNet reveals that the classifier yields many results, with maximum results mislabeled for small fires. In contrast, the segmentation network does not have fully connected and only contains a convolutional layer. The image is converted into a vector and then converted back to an image using the exact mapping by preserving the original structure, also known as pixel-based classification. STNet provides us with a far more granular understanding of the fire in the video. It can be seen from Table 9, for STNet, the value of recall is the best, and the precision is low, which implies a high false positive. DenseFire, alone, achieves poor performance for recall and best for precision. It shows that detecting small fires is a nontrivial problem, and better performance can be achieved by using a 2-stage approach.
In Table 10, we compare the results of our STNet using different binary classifier architectures in its second stage. It could be observed that 2-staged approaches achieve significantly better results than the single staged approaches (as presented in Table 9). The classifier in the second stage discards region proposals, mistakenly identifies as fire by STNet, which can be attributed to the success of two-staged approaches.

Method
Recall Precision F1-score InceptionOnFire [12] 0.1 1 0.182 CNNFire [11] 0.04 0.667 0.075 EMNFire [13] 0  the proposed framework's fire segmentation performance, we compare it against different deep learning-based models on the Foggia dataset such as UNet++ [34], AttUNet [66] and UNet [33]. The qualitative results of our STNet with other deep CNN methods are presented in Fig. 19. As we can observe in the first row, the AttUNet and UNet are incorrect in their segmentation based on the color of the fire. UNet++ overestimated the fire area. However, we can distinguish it from the proposed architecture. In the second row, UNet is unable to segment the fire while AttUNet exceeded the fire area. In the third row, the segmentation results, except the proposed, all other comparison methods used have a drawback. Table 11 shows the qualitative values of evaluation metrics received from the Foggia dataset, which indicate that the proposed method produces better results than other methods.  Comparing results of the fire binary classifiers on the Foggia dataset. We analyzed our results with other fire detection algorithms such as InceptionOnFire [12], CNNFire [11], EMNFire [13], and ShuffleNet [69] by considering a set of metrics such as recall, precision and F1-score. The experimental outcomes are shown in Table 12. We can see that ShuffleNet [69] reaches the recall of 0.845, which is worse than others. CNNFire [11] and EMNFire [13] perform similarly in terms of recall. However, the precision of EM-NFire [13] is better than CNNFire [11]. DenseFire showed satisfactory performance in terms of precision. It is evident from Table 12, and our STNet+DenseFire has surpassed precision, recall, and F1-score values compared to others, indicating a more reliable fire detection ability.

H. ANALYSIS OF COMPUTATIONAL COST
The following section will see different deep learning model's performance in computational complexity, model complexity, and inference rate for fire detection.
Multiply-adds estimate the computational complexity of each deep learning model based on the floating-point operations (FLOPs). For comparison, differences in computational complexity associated with various deep learning models for fire detection, CNNFire [11], EMNFire [13], GNetFire [20], ShuffleNet [69], DenseFire and UNet [33]+DenseFire are considered. As shown in Table 13, DenseFire requires 415×10 6 FLOPs counts. Densefire (96.9% accuracy for the NTUST dataset and 80.3% for small-sized fire dataset), with a 50% lower FLOPs count than CNNFire. Nevertheless, on both datasets, CNNFire performance is less than DenseFire. GNetFire requires 1500×10 6 FLOPs counts and performs well on the NTUST dataset in F1-score and accuracy (0.917, 90.2%) respectively, where its performance on the smallsized fire dataset is only hitting F1-score of 0.251 and accuracy of 32.5 %. EMNFire has the lowest FLOPs counts and a 27.7% lower FLOPs count than DenseFire. Compared to EMNFire, DenseFire has improved the accuracy by 1.1% on the NTUST dataset and 31.5 % on the small-sized fire dataset. ShuffelNet requires 542×10 6 FLOPs counts and a 27.7% Higher FLOPs count than DenseFire. The accuracy of DenseFire is higher by 8.1% on the NTUST dataset and 15.2% on the small-sized fire dataset compared to Shuf-felNet. Furthermore, EMNFire, GNetFire and ShuffleNet gain F1-score (0.180, 0.251, 0.350) respectively, surpassing Densefire in F1-score on a small-sized fire dataset. Due to the poor robustness of the above classifier on challenging scenes, we also explore a two-stage classifier. The twostage such as Unet+DenseFire classifier, needs 1705×10 6 FLOPs, which is the highest. The performance as measured by the F1-score improved significantly on both datasets. The value of the F1-score on the NTUST dataset reached 0.895, while the accuracy reached 80.6%. On the small-sized fire dataset, its performance reached F1-score (0.742) and M. Shahid et al.: Spatio-Temporal Self-Attention Network for Fire Detection and Segmentation in Video Surveillance accuracy (76.1%). Our two-stage STNet+DenseFire classifier requires 935×10 6 FLOPs counts. STNet+DenseFire obtain an accuracy of 99.5% for the NTUST dataset and 96.5% for the small-sized fire dataset. It implies that a two-stage classifier increases computational cost but also affect performance. Our STNet+DenseFire achieve F1-score of (0.992, 0.941) respectively on both datasets, which outperforms other methods given in Table 13.
Model complexity is also a standard metric for evaluating deep learning models. Counting the number of learnable parameters allows us to analyze the complexity of models. This information is quite helpful in determining how much GPU memory is needed for each model. We can also see in Table 13 the number of parameters for existing CNNs and our proposed network. The two-stage UNet+DenseFire classifier requires 36.7×10 6 parameters, while our STNet+DenseFire require 8.5×10 6 parameters. ShuffleNet introduces 5.4×10 6 parameters and achieves the F1-score (0.884) and accuracy (89.4%) for the NTUST dataset. In contrast, on the small-size fire dataset had the F1-score (0.350) and accuracy (65.2%). Although CNNFire has the lowest parameter and lower parameter count than ours, it yields the worst performance on the small-sized fire dataset.
The frames per second (fps) unit is also a vital evaluation metric for the fire detection method. The comparison results are shown in Table 14 for fps, based on NVidia GTX 1080Ti graphics processing unit (GPU) of eleven GIGabytes. We can observe that one-stage algorithms such as CNNFire [11], EMNFire [13], ShuffleNet [69], GNetFire [20] and Dense-Fire detect more quickly, which could detect more than 22 frames/s. CNNFire and EMNFire, operate faster than our approach. EMNFire attained an inference rate of 65 fps while maintaining an accuracy of 95.8% on the NTUST dataset. For the small-sized fire dataset, only reach an accuracy of 38.8% (Table 13). Similarly, CNNfire achieved an inference rate of 47 fps while maintaining an accuracy of 94.4% on the NTUST dataset. However, the small fire dataset had the worst accuracy of 21.7% (Table 13). ShuffeNet has a similar inference rate to ours but is not competent in performance. As shown in Table 13, our two-stage approach is more reliable on both datasets than others. We reached an inference rate of 32 fps for our STNet+DenseFire model. Thus, our model is considerable enough for real-time fire detection, maintaining the F1-score of 0.992 on the NTUST dataset and 0.941 for the small-sized fire datasets. Although our two-stage method has a slower speed than EMNFire, both F1-score and accuracy are considerably higher. Our method achieves 96.5% accuracy on the small-sized dataset, which outperforms EMNFire by 57.7%. It is worth mentioning that the detection accuracy of our approach on the small-sized fire datasets outperforms that of the other methods by a large margin. In future work, we will further minimize model complexity to improve the inference rate for fire detection, providing a better balance between accuracy and inference.

I. MODEL ROBUSTNESS
Surveillance videos are primarily normal in real-world scenarios. A robust fire detection algorithm should have a minimum false-positive and false-negative on normal videos. Like false-negative, false alarm call-outs create a considerable drain on the fire and rescue service, also cause substantial disruption with loss of productivity to businesses. Moreover, firefighters diverted from real emergencies by answering false alarms may delay emergency response times, placing others at risk, such as children in schools, hospitals, and airports. Thus, in addition to analyzing computation costs with state-of-the-art methods, we also test the robustness of our networks to confirm detecting the fire in the video sequence. Fig. 20 shows three of the fire videos selected from the different datasets. Top row: NTUST dataset provides an indoor scene, middle row: small-sized fire dataset is considering, the camera may be far from the scene in some fire accidents, or fire is at an initial stage, and bottom row: Foggia dataset provides outdoor location. We examine various conditions such as a) no rotation, rotated in b) clockwise 90, c) clockwise 180, d) clockwise 270 degrees around the horizontal axis, e) fire entirely occluded by some object and f) adding noise to video to evaluate under possible attacks. From Fig. 20, we can see that the proposed method performs well in most cases. Also indicating that it is more efficient at detecting fires in unknown conditions with varied atmospheres.

VI. CONCLUSION
In this paper, we proposed a two-stage architecture for early fire detection in videos, incorporating design strategies that can accurately detect small-sized fires. Precisely, we combined spatial features with temporal features in the first stage using a Self-attention module to extract quality segmentation masks used as region proposals. Next, in the second stage,  we classified the region proposal using a state-of-the-art classifier. Due to the lack of fire datasets, we employed semi-supervised learning, where we only needed a single ground-truth segmentation mask per frame-sequence input. Additionally, we also adopted a transfer learning strategy and a pre-trained classifier on the ImageNet dataset. To train and evaluate our network, we constructed a fire video dataset with ground-truth segmentation masks. Since our goal is early detection, we also created a dataset of small-sized fires for evaluation. Using several evaluation metrics, we compared with other methods and shown that our approach performs best. Our proposed model's state-of-the-art performance can be attributed to the combination of learned temporal and spatial features, which allowed our model to detect fire based on its behaviour over time and its spatial features that can widely vary. Future work will be devoted to making a light-weight model to run on devices with computational or memory constraints.