FTNet: Feature Transverse Network for Thermal Image Semantic Segmentation

Thermal imaging is a process of using infrared radiation and thermal energy to collect information about objects. It is superior to visible imaging for its ability to operate in darkness and tolerate illumination variations. In addition, it has potential to penetrate smoke, aerosol, dust, and mist, which are critical inhibitors for visible imaging applications, including semantic segmentation. Unfortunately, current state-of-the-art image semantic segmentation methods (i) mainly concentrate on visible spectrum images, which do not adequately capture the context of corresponding pixels, particularly edge details in thermal images, and (ii) accept a trade-off between higher accuracy and lower speed, or vice-versa. Here, a novel end-to-end trainable convolutional neural network architecture, feature transverse network (FTNet), has been proposed to solve the aforementioned problems. FTNet captures and optimizes feature representation at the multi-scale resolution, thereby improving the capability to process high-resolution images and producing quality output with a lower computational cost. Extensive computer experimentations were conducted on publicly available benchmarking thermal datasets, including SODA, MFNet, and SCUT-Seg, to demonstrate the effectiveness of the proposed FTNet compared to state-of-the-art methods. This comparison includes multiple aspects, including the quantitative accuracy and speed of the various approaches. The source code is available at https://github.com/shreyaskamathkm/FTNet.


I. INTRODUCTION
Image segmentation is the process of partitioning images into multiple segments [1], and it is one of the most challenging tasks in computer vision. It paved the way towards scene understanding, whose importance is highlighted by the fact that an increasing number of applications nourish from inferring knowledge from imagery, including autonomous driving [2]- [4], computational photography [5], [6], biomedical analysis [7], [8], and augmented reality [9]- [13].
Semantic image segmentation (SS) is a high-level task formulated as a classification problem of pixels with semantic labels [11]. Semantic segmentation algorithms identify regions of different objects in the scene by grouping parts of the image together based on the same object of interest and assigning a label to each pixel of an input image. In contrast, instance segmentation treats multiple objects of the same The associate editor coordinating the review of this manuscript and approving it for publication was Xiaojie Su . class as distinct individual instances. Panoptic segmentation assigns two labels to each pixel of an image, namely a semantic label and an instance id. The identically labeled pixels belong to the same semantic class, and instance their ids distinguish its instances. Despite significant advancements, semantic image segmentation is still considered a challenging task due to the adverse environmental conditions caused by imaging limitations of the visible spectrum. For instance, visible cameras are susceptible to lighting conditions and become invalid in total darkness. Furthermore, their imaging quality decreases significantly in adverse environmental conditions, such as rain and smog [11].
Thermal imaging is a process of utilizing infrared radiation and thermal energy to gather information about objects. It is superior to visible imaging for its ability to operate in darkness, and across illumination variations. It offers the capability to penetrate smoke, aerosol, dust, and mist [14]. The global thermal imaging market for the mobility industry is expected to reach $3.22 billion by 2025 [15]. The growth of the market can be attributed to growing awareness and lower prices of thermal cameras. This has led to their application in many computer vision tasks, such as detection [16], [17], tracking, segmentation [18]- [21], and individual and emotion identification [22]- [25]. Thermal imagebased computer vision will be instrumental in improving driver-assist systems that are increasingly quintessential in consumer car models. These sensors offer additional information to existing autonomous driving sensory systems, which strives to improve performance in identifying objects within a vehicle's surroundings to enhance driving reactions. However, the inter-class variance of objects in thermal images is extremely low, making accurate labeling near boundaries difficult, resulting in a large amount of semantic ambiguity and intensifying the challenge of semantic segmentation. The state-of-the-art (SOTA) semantic segmentation approaches focus on diminishing semantic ambiguity using the rich context information of visible images. However, redundant and noisy semantic information from thermal images may clutter the final semantic maps. An example of ambiguous boundary and the noise-induced from thermal sensors is illustrated in Figure 2.
Convolutional Neural Networks (CNN) are among the most effective and widely used deep learning (DL) architectures in computer vision, including classification, detection, and segmentation. Semantic segmentation models usually follow an encoder-decoder architecture. A deep convolutional neural network (CNN) typically computes a feature hierarchy layer by layer in the encoder stage. It develops an inherent multi-scale pyramid shape. At the decoder end stage, a highsemantic feature map is up-sampled and fused with the previous layer feature map through lateral connections to recover higher spatial dimensions. After extracting spatial details, the network predicts the class for each pixel to complete the segmentation process.
However, this progress has come with a voracious appetite for computing power which will rapidly become technically and economically prohibitive [27].
This article presents a novel end-to-end trainable convolutional neural network architecture, named feature transverse network (FTNet), to address these issues. The proposed FIGURE 2. Illustration of RGB and thermal images from MFN dataset [26]. The object boundaries in thermal images can be visualized as ambiguous and noisy compared to their RGB counterpart, which will adversely affect segmentation.
FTNet network will be designed and optimized to perform image segmentation of thermal images. FTNet consists of two main components: a high-low feature traversing and an edge guidance part. The architecture is equipped with skip connections between these two networks to use high-resolution image details during the reconstruction. An example of the results obtained using FTNet is illustrated in Figure 1.
Some of the notable contributions of FTNet include: 1) a unified end-to-end trainable network that captures discriminative thermal image features from multiple resolutions and combines them in a fully connected approach; 2) a network that captures and optimizes feature representation at the multi-scale resolution, thereby improving the capability of handling high-resolution images and producing quality output at a lower computational cost; 3) a network whose main representations are shared between the semantic segmentation and edge guidance structures, which means that the FTNet simultaneously achieves semantic segmentation and edge detection without significantly increasing the model complexity; 4) an extensive computer simulation performed on challenging thermal semantic segmentation tasks on benchmarks datasets including SODA [11], MFNet [26], and SCUT-Seg [28], which validate the performance of the proposed model compared with state-of-the-art methods such as MCNet [28], PSPNet [29], DeepLabv3 [30], and HRNet [31]; 5) the source code, which will be made available on GitHub for the research community.
The remainder of the paper is organized as follows. In section II, the recent related literature is reviewed. A detailed description of the FTNet architecture and its analysis is provided in section III. Section IV presents the experimental results, including training details, ablation studies, and benchmark results. Finally, a brief discussion and conclusion are provided in sections V and VI, respectively.

II. RELATED WORK
This section provides an overview of some of the most prominent DL architectures in use for the computer vision community for visible and thermal image semantic segmentation. Some of the earliest segmentation approaches include thresholding [32]- [34], histogram-based bundling, region growing [35], k-means clustering [36], watersheds [37], active contours [38], and graph cuts [39]. Most traditional semantic segmentation algorithms are based on low-order visual information of the images. Therefore, the semantic maps produced by these methods are often not ideal when complex segmentation tasks which require artificial auxiliary information are presented [40]. However, DL architectures have shifted the paradigm in the field of segmentation with remarkable performance improvements on popular benchmarks [41].
Long et al. [42] developed one of the first semantic segmentation DL architectures using a fully convolutional network. It was able to produce an output of the corresponding size with arbitrary size input and effective reasoning. However, it did not utilize the global context information efficiently. Noh et al. [52] generated dense segmentation masks using a sequence of deconvolution operations. The network consisted of deconvolution and unpooling layers, which alleviated a few of the existing limitations.
Eliminating downsampling may increase resolution; however, it affects the receptive field in subsequent layers, increasing context loss. To overcome this, Chen et al. [53] and Yu and Koltunet al. [54] used dilated convolution to enlarge the receptive field of neural networks. Chen et al. [30] further combined cascaded and parallel modules of dilated convolutions.
Badrinarayanan et al. [43] introduced unpooling layers for upsampling as a replacement for transposed convolutions. This network eliminated the parameters required for learned upsampling, thereby achieving a balance between memory and precision. Liu et al. [55] proposed a method that models global context directly instead of relying on the largest receptive field of the network. This method merged the output of the global pooling layer from previous layers with the current map of the posterior layer to generate the final classifier prediction with both having the same size [56].
Multi-scale feature analysis has been extensively studied and has been deployed in various neural network architectures. One of the most prominent multi-scale feature analysis models was FPN [57], [58], which introduced multi-scale feature fusion by setting a top-down pathway. Following this, various feature pyramidal-based architectures have been introduced; for example, PANet [48] suggests an additional bottom-up path augmentation for preserving the local context, and NASFPN [59] was introduced for object detection. This network exploited the neural architecture search framework. Wang et al. [31] proposed a multi-branch parallel structure that can efficiently utilize the fine-grained spatial information, which is generally lost in encoderdecoder-based models due to the downsampling and upsampling process. However, it does not consider global context information and boundary information [60]. Zhao et al. [29] further developed a method to learn feature representations at different scales. However, it has a sizeable model complexity and computational requirements.
Edge guidance is simple but effective in indicating the semantic separation between different regions [61], [62]. In fact, there exists few traditional high-order conditional random field (CRF) [63], and CNN based [64], [65] semantic segmentation methods utilize superpixels for retaining boundary information. However, superpixel based approaches are unlearnable and not robust [66]. Liu et al. [66] addressed this issue using edge loss reinforced structures constructed from encoder and decoder to retain spatial boundary information for remote sensing images.
In the thermal image segmentation domain, Li et al. [11] designed a gated feature-wise transform layer to adaptively embed edge information as the guidance of a semantic segmentation network. This network extracted edges utilizing the HED (Holistically nested Edge Detection) network [67] and embedded the edge features into a network proposed by Chen et al. [30]. Xiong et al. [28] developed a thermal image semantic segmentation method that utilized multilevel edge knowledge to get more edge and shape features. Ha et al. [26] incorporated RGB and thermal information to perform segmentation. This network utilized two separate encoders to extract features from visible and thermal images and fuse them in the decoder to produce the probability map for the semantic segmentation results. Other methods that used RGB and thermal images are described in [68]- [70]. Table 1 provides a chronological list of various other image segmentation methods, along with a brief explanation for each method.

III. PROPOSED METHOD
This section presents an end-to-end trainable convolutional neural network architecture called the feature transverse network (FTNet). A high level flow diagram of the proposed system is provided in Figure 3. This paper aims to construct a function f (I ) developed specifically to link each pixel in an image, where I is an input image of any arbitrary size (m, n) to a class label with the same dimension. This network combines the low-level layers with poor semantic features and strong resolution with the high-level layers that have rich semantic features and scarce resolution. Following this theme, the novel FTNet comprises an encoder network, a corresponding transverse decoder network, and a final pixel-wise classification layer. This network aims at capturing and optimizing feature representation at the multi-scale resolution, thereby improving the capability of handling high-resolution images and producing quality output at a lower computational cost than the SOTA techniques. Additional details of these components are provided in further subsections.

A. ENCODER NETWORK
Since the main focus of the proposed network is to build a position-sensitive model capable of pixel-level classification, FTNet employs existing SOTA backbones that follow the design rule of LeNet-5 [71]. The spatial size of the features in these classification-based backbones is gradually reduced from a high-level representation to a low-level representation, thereby allowing FTNet to capture features with different representation capabilities.
For simplicity of exposition, consider the case in which ResNet50 is used as a backbone. The basic structure of the encoder network is visualized in Figure 3 (a). The encoder network aims at acquiring features at different resolutions by subsampling at various stages. The convolutions in these networks can be divided into four stages, and the output of each stage's last block from the encoder network can be represented as {E i |i = 1, 2, 3, 4}. This bottom-up pathway extracts and establishes a feature pyramid by incorporating features from the last convolutional layer in each stage. The extension to other backbones is straightforward.

B. FEATURE TRANSVERSE NETWORK (DECODER)
Recent developments have demonstrated the evidence for the necessity of exploring all the multiple-resolution representations for a broad range of vision problems [31]. Following this, a transverse network with a top-down path comprising feature aggregation to produce final semantic features at high-resolution is introduced. An illustration of the proposed decoder network is displayed in Figure 3 The proposed FTNet considers the features from the stages {E i |i = 1, 2, 3, 4}, which comprises multiple resolutions r = 1/4, 1/8, 1/16, and 1/32, respectively. These feature maps are passed through a set of residual units U as proposed by He et al. [72]. The illustration of this unit is provided in Figure 3 (d). Each residual unit can be defined as formulated in equation (1). where l is the input feature map for the l th residual layer, ω l and b l are the associated set of weights and biases respectively, K denotes the number of weights, denotes the combination of layers CONV → BN → ReLU → CONV → BN, R denotes the ReLU activation function, and I is the identity map which may comprise of weight ω s when the features maps do not have the same number of channels.
A set of residual units along each resolution form a residual stream. The output of these residual streams can be formulated as { i |i = 1, 2, 3, 4}. These streams are fused in a fully connected fashion to take advantage of information exchange across multi-resolution representations. The integration of multi-resolution features maps r(m) at the i th stage is a summation of different feature maps with a corresponding function f. In certain cases, when dilated convolutions are used for semantic segmentation purposes, the last three layers comprise dilated convolutions. To support this mechanism, the corresponding residual streams contain dilated convolution to maintain the same resolution. A broad formulation of both these cases can be defined as shown in (2) The function f ij (·) is dependent on feature resolutions. It can be formulated as shown in III-C.
To downsample (↓) by a factor of 4, two strided convolutions with kernel size 3 × 3 are utilized. For upsampling (↑), a resize convolution with a bilinear kernel and a convolutional layer of kernel size 1 × 1 are utilized. No function is applied when the input and output feature maps' resolutions are identical and along the same residual stream. When the residual stream is different, dilated convolution is applied to maintain the same resolution. Finally, all the feature maps are upsampled to the original resolution and passed through , which comprises CONV → BN → ReLU → CONV with convolution kernel size 1 × 1.

C. EDGE GUIDANCE (EG)
Thermal image features are coarse due to low resolution and contrast. The object boundaries are ambiguous due to thermal crossover, and the images are noisy due to the design of thermal sensors. Due to these concerns, the reconstruction of the semantic maps generally depends on low-level features and edges details.
Considering these observations, edge map detection is introduced in the decoder section. The edges are extracted from the E 3 layer and passed through CONV → BN → ReLU. This is upsampled to the original resolution, passed through a CONV layer, and finally appended to the feature maps before applying the function . It should be noted that the edge ground truth is obtained by a simple calculation of the semantic ground truth gradient, which does not require additional labeling effort. A detailed study of edges extracted from various parts of the encoder is provided in further sections.

D. LOSS
An edge-based loss function is employed to ensure the prediction of crisp edges along the boundaries of semantic maps. In edge detection cases, the labels for edges and backgrounds are highly imbalanced. A binary cross-entropy with an adaptive balancing mechanism proposed by Xie and Tu [67] is utilized to overcome this issue. For an image with a ground truth which comprises of Z + edge pixels and Z − background pixels, the predictionp can be formulated as shown in (4).
This loss function handles the class imbalance by providing equal weights irrespective of the ratio of Z + and Z − between the two classes.
A cross-entropy loss defined in (5) is utilized to supervise the semantic maps generated.
where η (x m | z) denotes the probability at pixel m with the parameter η, y m is the ground truth. The total loss adapted to train FTNet is denoted as: where α and β are continuous hyper-parameters and denote the weights for edges and semantic loss respectively. For experimental results, β was fixed to 1, while the α were varied. More discussions are provided in the later sections. This configuration helps in obtaining refined, spatially consistent, and crisp boundary-located semantic maps.

IV. EXPERIMENTAL RESULTS
This section provides the performance evaluation of the FTNet. After describing the experimental settings, datasets, and training details, the performance comparisons with SOTA methods are provided to demonstrate the effectiveness and generalization ability of the proposed FTNet.

A. DATASET
For training and testing purposes, the SODA dataset [11] was employed. It comprises 2,168 annotated images among which, 1,168 images are used for training, and 1,000 images are used for testing. Due to the scale of this dataset, synthetic images were utilized for pretraining the network. These synthetic images were obtained by translating the Cityscapes [73] dataset from RGB space to thermal space. The original Cityscapes dataset contained 5,000 images, including 2,975 training images, 500 validation images, and 1,525 test images. Following the training protocol described in [11], all the training, validation, and testing images were combined for pretraining purposes. As the dataset does not comprise the ground truth edge maps, they were generated following the strategy used in [74]. In this protocol, the ground truth semantic maps are utilized to generate edges for each class. However, as FTNet incorporates only binary information, the protocol was adapted to produce binary edge maps instead of the generating edges for each class.
An example of set of images utilized for training are provided in Figure 4.

B. TRAINING DETAILS
For training the network, a progressive learning algorithm is adopted. Initially, the encoder parameters are loaded with pretrained ImageNet weights. For pretraining the model with thermal features, synthetic thermal Cityscapes images were utilized. The thermal input images were augmented by performing random cropping (from 640 × 480 to 480 × 480), and random scaling in the range of (0.5, 2), and performing random horizontal flipping along with the corresponding masks. The SGD optimizer [75] with base learning rate 0.01, momentum 0.9, and weight decay 0.0001 is employed to train the model. As the decoder section is randomly initialized, the learning rate of this network is increased by a factor of 10.
The batch size was set to 16. The network was trained for 100 epochs, and poly learning rate policy with the power of 0.9 is used to drop the learning rate.
As the Cityscapes dataset labels are different from the SODA dataset, the last layer of the network is discarded after pretraining and adjusted to match the number of classes from the SODA. The protocol for training remains the same except for the initial learning rate, which is dropped by a factor of 10.
The experiments were conducted on the PyTorch platform [76]. The models were trained on 2 V100 GPUs, and it takes around 12 hours to complete FTNet training (Cityscapes pretrain and SODA training). To evaluate FTNet, Intersection over Union (IoU), also known as the Jaccard Index, is utilized. It provides the ratio of the intersection of the pixel-wise classification results with the ground truth to their union. A higher percentage depicts how close the predicted class maps are to the ground truth.

C. ABLATION STUDIES
Extensive ablation studies on the various architectural components of FTNet architecture were performed. The ResNet 50 architecture was utilized as the encoder for baseline results. The number of filters in the decoder was set to 32, and the number of residual units U for each residual stream i was set to 4. Equal weights (α = β = 1) were provided to both edge and semantic maps for all the ablation studies except for loss weight analysis.

1) IMPACT OF ENCODER STAGES EXTRACTIONS AND DILATION
This ablation study comprises evaluating the model's performance when features are extracted from various encoder stages and an analysis of dilation. The results are provided in Table 3. Initially, all the resolution scales, including the image space was utilized to reconstruct semantic maps. Gradually, the image space was discarded, and further analysis led to the elimination of features from E 0 stage. For semantic segmentation techniques, down-sampling may cause loss of spatial information; however, it is required to understand the scenes and reconstruct the semantic maps with finer details.
Excluding down-sampling may increase resolution; however, it affects the receptive field in subsequent layers, increasing context loss. To overcome this, dilated convolutions were employed to adjust receptive fields of feature points without decreasing the resolution of feature maps [30].
Replacing strided convolutions with dilation-based convolution in FTNet provided superior results for all three cases. However, there exists a trade-off between the accuracy and the number of FLOPs. The models in all three cases have the same number of parameters for tuning, but FTNet with dilated convolutions have approximately 100 G extra FLOPS in all cases. As indicated by Minaee et al. [41], an ideal model should consider multiple aspects, which include quantitative accuracy, speed (inference time), and storage requirements (memory footprint). Following this suggestion, FTNet aims at decreasing the FLOPS, thereby decreasing inference time while achieving higher accuracy. As FTNet with {E i |i = 1, 2, 3, 4} without dilation has lower FLOPs with acceptable accuracy, the rest of the ablation study utilizes the strided convolution in the model.

2) IMPACT OF EDGE GUIDANCE (EG)
Considering that thermal images are generally blurry and lack color information when compared to the visible domain, low-level features and edge details are crucial for generating semantic maps. In deep CNNs, there is a trade-off between semantics and resolution at the low-level and high-level layers. This trade-off is quantitatively shown in Table 2. Edges were extracted from each stage {E i |i = 1, 2, 3, 4} and various combinations of E i were investigated. It can be verified that extracting edges from E 3 the stage provides the best mIoU. Furthermore, the accuracy increased by 1.21% from 55.5% to 56.71%. This confirms that high-level semantics with sufficient resolution provides better edges. This significant improvement proves that the edge guidance increases the learning effectiveness of a neural network by capturing varying structures to encode meaningful features.
On the contrary, the combination of edges from various encoder stages does not perform as expected for thermal imagery. This is because initial encoder stages have poor semantic and high-level layers, especially E 4 , which has rich semantics, but low resolution. This was validated by examining the E 1 and E 4 combination. The edge extracted from E 1 provided 56.14% and E 4 provided 55.99%, while the combination of both resulted in 55.03%, which is subpar.

3) IMPACT OF ENCODER TOPOLOGIES
The successful use of CNNs in image classification tasks has accelerated the research in architectural design. Since then, numerous network architectures have been proposed to address this task. Typically, these networks are used as encoders for complex tasks such as object detection, classification, and semantic segmentation. This section aims at evaluating the performance of the FTNet decoder with various encoder architectures on the SODA dataset. For a fair comparison, all the decoder components of the FTNet network were fixed, and only the encoder was replaced. The results of this study are provided in Table 4. It can be seen that the mIoU scores of deep stem ResNet-based architecture underperformed while ResNet and ResNeXt provided superior mIoU results when the filter sizes were set to 32 and 64. The ResNet and ResNeXt models were further investigated with 128 filter sizes and two residual units. The ResNeXt model using this setting outperformed other architectures from the ResNet family. It indicates that the inclusion of cardinality is of paramount importance to achieve better semantic maps. This feature of the ResNeXt model is more effective and of central importance, in addition to the dimensions of width and depth.

4) IMPACT OF EDGE LOSS WEIGHTS
As the defined loss from equation IV) comprises of two components, namely, semantic loss and edge guidance loss, it is necessary to adapt them to the same order of magnitude to obtain optimum results. A small edge loss weight may lead to a failure of edge supervision, while a large weight may dominate the semantic loss. It is necessary to optimize them as semantic loss is always higher than edge loss. In this ablation study, different α values were used to empirically determine the best edge loss weight. Table 5 provides the complete set of variations and their corresponding mIoU scores. When setting α = β = 1, the boundaries are not crisp when compared to the results obtained with α = 20 and β = 1. This discrepancy explains that setting the magnitude of different losses is very crucial to gain better accuracy. From the table, it can be determined that α = 20 provides superior accuracy.
The performance was tested on the MFNet dataset [26], SODA dataset [11], and SCUT-Seg dataset [28] to demonstrate the generalization ability of FTNet. MFNet dataset is a public semantic segmentation dataset based on the driving scene with RGB-T images. It comprises 1569 images divided into 784, 392, and 393 images for training, validation, and testing. The SCUT-Seg dataset is a set of thermal images collected with different driving scenarios. This was an extension of the SCUT FIR Pedestrian Dataset [79]. It consists of 1345 training and 665 testing images. Furthermore, pretrained weights of cityscapes data were utilized as initial parameters. The rest of the training details remain the same.
For better understanding, the per-class IoU is provided in Table 6. The complete set of mIoU for all datasets is provided in Table 7. These results demonstrate the performance of FTNet when compared to the SOTA methods. FTNet achieves 60.09% accuracy on the SODA dataset while the top four SOTA methods were MCNet, HRNet, PSPNet, and DeepLabV3 with 50.32%, 58.33%, 58.87%, and 58.37%, respectively. Furthermore, FTNet's performance on MFN and SCUT-Seg datasets is notable when compared to SOTA.
Qualitative evaluation is essential in image segmentation assessment, and the segmentation maps of FTNet and SOTA are assessed based on the human visual system. This analysis is done using humans as an observer [80]- [84]. Human visual analysis is critical in identifying characteristics of algorithms that quantitative metrics may not identify correctly. For instance, in a less detailed or incorrectly labeled dataset, quantitative metrics will automatically penalize segmentation algorithms for correctly segmenting the object.
The qualitative comparisons are provided in Figure 6 and Figure 7. These examples show complex indoor and outdoor scenarios with numerous object instances in multiple scales and partial occlusion. These scenes were also captured with diverse lighting conditions, including day and night. These outdoor examples show challenging scenarios with various objects, such as cars and pedestrians, in close proximity to each other and far apart. FTNet effectively addresses these challenges and yields reliable semantic maps. Figure 6 panels [a,c] and Figure 7 panels [a,b] show that the person class and other objects have well-defined edges compared to SOTA methods. Figure 6 panel c shows that the two pedestrians near the car have crisp boundaries with finer labels representing the input.
Similarly, in Figure 6 panel b, the car has better edges, and FTNet and PSPNet, DANet, and UNet++ could detect the pole. However, FTNet was able to detect the pole with higher similarity to the input image. In Figure 7 panel b, the shape of the person and monitors were more clearly defined when compared to others. In Figure 7 panel c, most of the SOTA results were erroneous, but FTNet had a clear semantic map of chairs and tables. Overall, FTNet yielded acceptable results even though the SODA dataset had few indoor data representations. The proposed network reconstructs semantic maps with a higher correlation to the ground truth despite the poor quality of the thermal images. These examples illustrate the ability of the FTNet to perform better in circumstances where ambiguous object boundaries are introduced by thermal crossover compared to the other models.
Since the processing time of CNN-based semantic segmentation tasks is crucial, the inference speed of the network is computed and tabulated. A 640 × 480 image was run through the network 300 times, and then the average of the results was considered a single run to calculate the computation time. This experiment was repeated ten different times, and the average time of these ten runs is provided as the inference time in Table 8. This table provides the runtime result of different approaches without any optimizations and provides the number of parameters. Runtime was measured on an Intel i9-9900K 3.60GHz CPU system and an Nvidia RTX 2080 Ti GPU. The simulation results show that the FTNet's runtime performance is comparable to other SOTA methods. In terms of the number of parameters and FLOPS, this model has less memory overhead (-2.23M) and calculation (-37.42G Flops) when compared to MCNet. Even though the number of parameters is slightly higher than HRNet by 3.9M, the inference time is comparable and provides better accuracy. These observations demonstrate the potential of FTNet for application on edge devices and intelligent systems such as automated driving and video surveillance applications.

E. DISCUSSION
To the best of the authors' knowledge, the most similar works to the FTNet are MCNet [28] and HRNet [31].
MCNet introduced multiple structures to preserve boundary information rather than post-fine-tuning the semantic segmentation results. They utilize two feature representations E 1 and E 4 (see Figure 3 (a)) from a dilated encoder network. It employs a loss function that spans across multiple levels of a correlation matrix correction module. However, FTNet does not use dilated networks, thus reducing the number of parameters. FTNet exploits all the feature representations {E i |i = 1, 2, 3, 4} in a transverse structure to aid the network in producing high-quality semantic maps. Furthermore, a novel edge guidance mechanism is developed to produce crisp boundaries. Finally, a weighted loss function is explored to ensure that the edge and semantic losses have the same VOLUME 9, 2021 FIGURE 6. Qualitative comparison of thermal image semantic segmentation of outdoor environments with SOTA methods. In addition to the segmentation output, the edge map reconstructed from FTNet is provided. It can be seen from the images that FTNet has provided clear boundaries when compared to SOTA methods. In panel (b), the car has finer boundaries near the tires and the pole was detected with clear distinction even though they were missing in the ground truth. In panels (a) and (c), the person class was segmented more finely while SOTA methods had ambiguous maps.  order of magnitude. This loss function preserves the boundary information along with accurate semantic maps.
HRNet extracts features from high-resolution feature maps in parallel with the low-resolution feature maps. The extracted feature maps from multiple parallel streams are fused to obtain high-resolution representations. However, the encoder network is not interchangeable with existing encoder backbones such as VGG, ResNet, and ResNeXt.
On the contrary, FTNet is carefully designed to transverse through multiple streams of existing serially connected encoder networks. This mechanism exploits all the feature maps at various resolutions, including the low-level feature maps containing high semantic information. The introduction of the edge guidance counterpart into this network has an immense impact on the performance shown in both quantitative and qualitative analysis. Existing SOTA encoder networks such as Xception [85], DenseNet [86], and MobileNet [87] can be repurposed with the decoder of FTNet for various applications, including image denoising and recoloring. Furthermore, the decoder in FTNet can be readily extended to incorporate the outputs of dilated convolution in applications where it is necessary to preserve the resolution.
The benchmarking datasets used in this article [11], [26], [28] have promoted the research of semantic segmentation using thermal images. However, the annotations provided in these datasets are coarse and less detailed compared to RGB datasets. This is most likely due to the extremely low inter-class variance of objects in thermal images, making accurate labeling near boundaries difficult. Additionally, some of the labels in the datasets were misclassified in SODA Dataset. The challenges mentioned above can be visualized in Figure 6 and Figure 7 (input and ground truth). Moreover, the distribution of the semantic labels is highly imbalanced among these benchmark datasets. For example, the SODA dataset comprises 1,304 road images, whereas the number of monitor images is 75. These issues lead to a negative impact on the performance of the proposed and SOTA methods.

V. CONCLUSION
In this work, a novel deep learning-based semantic segmentation network, FTNet was presented. This network aims at exploring the multi-resolution representation to perform pixel-wise classification accurately. The proposed FTNet is an end-to-end trainable architecture with ResNeXt encoder and employs a novel transverse-based decoder network, efficient in terms of parameters/operations and computation time. This transverse-based network captures discriminative features from multiple resolutions and combines them in a fully connected fashion to achieve semantic maps close to the ground truth. An edge guidance mechanism is proposed to overcome the poor quality, single-channel, and blurry object boundary attributes of thermal images. The introduction of weighted loss further improves spatial boundary information and reduces semantic ambiguity. Extensive quantitative analysis demonstrated that FTNet achieved mIoU of 60.08%, 47.12%, and 66.73% on SODA, MFN, and SCUT-Seg Dataset with 33.44M parameters and 94.55G FLOPS. Furthermore, the qualitative analysis showed that FTNet reconstructed rich semantic maps with crisp boundaries. These results show that FTNet can potentially optimize thermal image perception in intelligent systems such as automated driving and video surveillance applications, computational photography, biomedical analysis, and augmented reality.
As a part of future work, the authors intend to explore dilated convolution with reduced parameters and check the system's performance on RGB datasets. Furthermore, FTNet will be tested on other position-sensitive vision applications, such as facial landmark detection, image super-resolution, image recoloring, and image denoising. Another promising future work will be applying the proposed model in other domains such as hyperspectral imaging.