FHI-Unet: Faster Heterogeneous Images Semantic Segmentation Design and Edge AI Implementation for Visible and Thermal Images Processing

The same class of objects clustering process in a frame is known as semantic segmentation. The deep convolutional neural network-based semantic segmentation needs large-scale computations and annotations for data training to reach real-time inference speeds. The heterogeneous image segmentation is a more challenging task to categorize each pixel of an image. However, the heterogeneous image semantic segmentation method extracts the features of visible and thermal images separately. We designed an efficient architecture with the multi-hybrid-autoencoder and decoder for Faster Heterogeneous Image (FHI) Semantic Segmentation. The proposed corresponding architecture has fewer layers resulting in lower parameters, higher inference speed, and Intersection over Union (IoU). The specialty of this architecture is the discrete autonomous feature extraction framework for RGB image and Thermal (T) image inputs with individual convolutional layers. Later, we combined the 4-channels (RGBT) convolution features to reduce computational complexity and robust the model performances. The proposed FHI-Unet semantic segmentation model experimented on NVIDIA Xavier NX edge AI platforms with standard accuracy under the real-time inference requirement. The proposed FHI-Unet model has the highest mIoU of 43.67 and the fastest real-time inference of 83.39 frames per second on edge AI implementation. The proposed approach improves 31.36% inference speed, 7.16% mAcc, and 5.1% mIoU on the Multi-spectral Semantic Segmentation Dataset compared with the existing works.


I. INTRODUCTION
Semantic segmentation is the fundamental technique for autonomous application. The human-computer interaction, virtual reality, and medical representation analysis rely on semantic segmentation. The semantic segmentation does fine-grained reason by employing compact categorization and labeling for each pixel. The use of the convolutional The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang. neural network has been increased considerably in recent years with the rapid development of deep learning applications. The computer vision-based object detection model identifies the location and detects object, classifies each object, determines their class number, predicts the direction, and many other things [1]. However, the object detection model marks a bounding box corresponding to each class in the frame, but nothing explains the object shape. The image segmentation technique creates the pixel-wise mask for each object in an image, which is a grainier understanding of the picture. Image segmentation techniques make a massive impact on autonomous security systems [2], military surveillance [3], self-driving cars [4], traffic congestion systems [5], the granular manner in the medical image [6], and so on. The research community has made encouraging progress on Convolutional Neural Network (CNN) model architectures for semantic segmentation process, for example, coarse-to-fine semantic segmentation [7], DeepLab [8], PASCAL VOC [9] to learn image illustration methods. The Fully Convolutional Network (FCN) [10] is a foundational work that accepts input images for semantic segmentation and modifies the fully connected layers into convolution layers for classification networks such as the AlexNet [11], the VGG16 [12], and the GoogLeNet [13]. Whereas FCN network reduces the sizes of the final predictions because of several convolutional strides and spatial pooling functions resulting in the loss of granular picture information and erroneous predictions.
The ability of deep neural networks, amount of training data, quality of input images, and the lighting source of image and video inputs all aspect have a significant role in robust performance. Some neural networks [14] and [15] have been built on 3-channel RGB input images from near-infrared visible cameras. Unfavorable lighting conditions, such as darkness, cloudy or foggy weather, and glare from automobile headlights, make significant obstacles for RGB picture deterioration [16]. The thermal imaging camera creates heat radiation pictures, which can see in various light conditions [17]. In this work, we acquire images fusion of thermal and visible images to obtain more accurate semantic segmentation. Commercial appliances such as remote sensing, autonomous surveillance, automotive driving assistance systems, military surveillance, embedded module systems require faster inference speed and appropriate object segmentation.
In this study, the various image data are divided into two categories based on camera functions: thermal images and visible-light images. A thermal image employs an object with fluctuating degrees of thermal radiation energy to create the temperature distribution map. Furthermore, its perception range makes it suitable for usage at night view, cloudy and foggy weather, and in the presence of glare from opposing headlights at absolute and abnormal temperatures [18]. The RGB visible image contains rich information such as object color, texture, and clear boundary, which is comparatively easy to extract features in a lighter environment. Although, the image discrimination ability decreases in a dim environment. Therefore, the combination of both image characteristics of these two inputs complements each other to alleviate the environmental interference and obtain better semantic segmentation results, which is called heterogeneous image semantic segmentation. If the RGB and thermal images convolute directly without processing, it's hard to improve the precision. Therefore, we adopted the concept of Unet architecture and extended it through the proposed FHI-Unet model. We have developed the independent convolutional network with the multi-hybrid-autoencoder for RGB and Thermal (T) image inputs feature extraction separately.
The visual and thermal pictures are semantically segmented using the multi-hybrid-autoencoder and decoder through the proposed heterogeneous image segmentation architecture. We utilized 4-channels (RGBT) inputs autonomous encoder and feature fusion encoder to match the heterogeneous dual image features and extracts the thermal and visible image features particularly. The following is a list of significant contributions of this work.
1. We designed the independent convolutional network with the multi-hybrid-autoencoder for RGB and Thermal (T) image inputs feature extraction separately, which reduces the computational complexity of the proposed architecture. 2. A feature fusion encoder combines and fuses the 4-channel (RGBT) convolution features that enhance inference speeds. 3. The proposed FHI-Unet model has fewer layers, lower parameters, lower-rung read and write memory which increases the FPS and accuracy. 4. The proposed design has been implemented on NVIDIA Xavier NX edge AI platforms for investigating the faster heterogeneous image semantic segmentation.

II. RELATAD WORK
The semantic segmentation algorithms require large-scale and high-quality data to robots the performance while dealing with numerous instabilities. The semantic segmentation methods are categorized into traditional approaches and deep learning algorithms. The sparse representation approaches [19], k-means clustering [20], Markov random fields [21], and the random forest [22], clustering [23] are counted as traditional approaches. The traditional techniques are replaced by convolutional neural networks (CNNs) progressively. In recent years, researchers are investigating the CNNs based algorithms for semantic segmentation with the rapid growth of deep learning algorithms. The PSPNet [24] techniques represented the dilation convolution method. The state-of-the-art network DeepLabv3+ [25], two-part of neural networks aggregated multi-scale contexts to enlarge the receptive field and lead to a higher-resolution and compact FCN based pixel prediction. The Visible and Thermal image fusion [26] for few-short semantic segmentation based on bimodal images. The Edge conditioned convolutional neural network [27] for thermal image semantic segmentation built on a feature-wise transform layer. The GMNet [28] categorized feature extraction to the multilevel for feature fusion.
In recent years, the encoder-decoder base models have been investigated actively in semantic segmentation with the popularity of deep learning algorithms. The dependability and flexibility of encoder-decoder-based models are more suitable in real-world applications such as robots and autonomous applications. ABMDRNet [29] multimodality feature fusion network employs a bi-directional image-to-image translation through two-stage networks. The SegNet [30] is a deep convolutional neural network that employs the encoder and decoder to conduct semantic pixelwise segmentation. The model encoder consists of 13 convolutional layers of VGG16, which serves for down-sampling and max-pooling. In addition, the pooling coordinates solve the pixel location information loss causes by multiple pooling layers. Besides, the decoder employs the associated maxpooling index value for up-sampling. Finally, the Softmax classifier predicts each pixel's class output feature map. The Unet [31] model predicts tiny medical picture segmentation by linking the symmetric relationship between encoderdecoder. Two convolutions use at the encoder to perform the four down-samplings by max pooling. The decoder uses up-convolution to perform up-sampling and concatenation of the corresponding size of the encoder feature map. The MFNet [32] was designed based on the dual-encoder architecture, with RGB and Thermal images two parallel branches being simulations by the encoders. The encoder fuses the RGB feature, and Thermal feature maps using element summations and is sent to the decoder for convolutional operation through the nearest-neighbor interpolation for up-sampling. The RTFNet [16] consists of an RGB encoder, a thermal encoder, and a decoder to extract features using RGB and thermal data fusion whereas, the ResNet [33] is the backbone network. The thermal feature maps fuse the RGB encoder through the element-wise summation. The MMNet [34] consists of two-stage networks to feature extraction and refine details. The GMFNet [35] is composed of three parallel Unet for modality and multimodal fusion. The FuseSeg [36] architecture used a dense native representation for laser range scanner data introduction. The effectiveness of the method is LiDAR and RGB data fusion for segmenting the LiDAR point clouds. The dual attention network for image segmentation [37] method extracts the feature map spatial dependency through the location channel attention mechanism.
Some of the semantic segmentation models mentioned above have good performance. However, the complexity of the architecture and frame structure leads to computational problems which require costly graphics cards for the implementation and the inability to use embedded systems for real-time preceding to trade-off the accuracy and higher speed. Figure 1 displays the proposed Faster Heterogeneous Image (FHI) Semantic Segmentation architecture. The proposed FHI-Unet architecture consists of 2 modules: the multihybrid-autoencoder for feature extraction and decoder for feature map sampling. The first autoencoder extracts independent 4-channels RGB and Thermal (T) input image features separately at the initial stage, whereas needs to do several convolutional operations, batch normalization with Leaky Relu activation function, and max-pooling at the next steps. Later, another feature fusion autoencoder combines the 4-channel (RGBT) convolution features for further process. The convolutional feature fusion speeds up the model operation and computation. The individual input feature extraction autoencoder saves operation time and speeds up the performance. We employed the customize convolution to implement this experiment. The customize convolutional computation speed is faster than the typical deep convolutional architecture [38]- [41].

III. THE PROPOSED NETWORK
The typical convolution complexity is measured by adding all input and output channels together. If the input image size is 240 × 320 with 4-channel RGBT data, and the output channel is 32 with the kernel size (K) is 3 × 3, the total flops are 88.473 for the initial layer. The proposed FHI-Unet uses a multi-hybrid-autoencoder to extract RGB and Thermal (T) picture features individually. The proposed architecture generates 8 output feature channels for each input channel based on the same kernel size. The proposed FHI-Unet has 22.118M flops for the first stage. Following the process, the second and third stages also reduce the overall flops. Table 1 illustrates the multi-hybrid-autoencoder operation details for the proposed FHI-Unet semantic segmentation architecture. Each stage has different convolutional layers, different input & output channels, and different feature sizes. Whereas, H , W & C stands for the height, weight, and channel number of each image separately. The 1 st stage, 2 nd stage, and 3 rd Stage has individual convolutional operation. The 1 st stage extracts RGB and Thermal image features separately. The 2 nd stage and 3 rd stage create hybrid concatenate layer for output features. The 4 th stages and 5 th stage do the combine convolution and pass information to the decoder for further process.
The decoder performs for feature map up-sampling and restores the target to 480 × 640 resolution. The design of the decoder is mainly for recovering the input image feature upsampling size. Since the encoder used four down-sampling operations, the decoder also performed four up-sampling operations to recovery the same size feature map. There is different way to calculate the recovery up-samples and reduce  the computational effort. We used the nearest-neighbor interpolation method to reduce the computational complexity. In addition, the performance of up-sampling used same size feature map for the decoder and encoder to reduce the loss by adding them together. Table 2 demonstrates the detailed operational computational function of the decoder. Whereas up-pooling and Conv BN Leaky Relu blocks utilized for the operation.

IV. MULTI-HYBRID-AUTOENCODER AND DECODER COMPUTING INSTRUCTIONS
The encoder and decoder computing instructions belong to down-sampling and up-sampling. The convolution kernel settings, layers computation, shortcut blocks, and all other aspects of the multi-hybrid-autoencoder, decoder operation are described in depth at these sections.   Relu problem that prevents backpropagation from being terminated if the value is less than 0, which calculating by the equation (1). Figure 2 shows the convolution, batch normalized, and activation function details. Wearers, the Leaky Relu negative slope = 0.2.
The max-Pooling function picks the maximum value from each kernel, the highest value creates a significant impact in the image [43]. When the kernel size is 2 × 2, half of the values denote the actual value, which increases the receptive field. In this study, the max-pooling operation reduced the feature maps by the convolution as down sampling. Besides, the input feature map H x W is scaled down by a 2×2 pooling layer in the encoder. The output feature map becomes half of H and half of W with the maximum of 2 × 2 kernel filters. Figure 3 shows how to shrink the feature map from stage 1 to stage 5 in the autoencoder operation while saving computing time.

C. CONCATENATE LAYER
A concatenation layer accepts many inputs and concatenates them along with specific dimension. However, the entries must have the same size in all aspects, which increases the precision of learning [44]. The autoencoder concatenates thermal (T) image characteristics with the visible RGB image features as the same size, and the channel features are integrated and simplified, as shown in Figure 4. In stage 3, the autoencoder combined RGBT features at same size and joined along the axis and dimension make it easy to perform decoder convolution operations.

D. CONV + SHORTCUT LAYER
The shortcut connections skip the imperfect low-level features training layers by transferring immediately to highlevel features [33] and solves the gradient drifting problems. We incorporated the shortcut connections in the designed architecture. The purpose of the Conv and Shortcut layer is to perform one convolution of the input feature map followed by residual joining. Besides, avoids the disappearance of the backpropagation gradient during training. The Conv and Shortcut layer speeds up the convergence of the architecture shown in Figure 5. The residual function overcome the vanishing gradients problem and mitigates the deterioration problem during stages four and five of multi-hybridautoencoder operations.

E. UP-SAMPLING
The up-sampling operation transforms a small image input into a large image output. The feature maps up sampling works by repeating the rows and columns features at the decoder for restoring the target resolution. The up-sampling rate can be considerably high while guaranteeing the higher quality of the up-sampled results [45]. In this work, the nearest interpolation method employed for features up-sampling, resulting in a doubling of each row and column for input data (see Figure 6). From stage 5 to stage 1, the decoder used 2 × 2 nearest-neighbor interpolation to reduce the complexity and speed up the computational calculation.

F. SHORTCUT BLOCK
The shortcut block in the decoder acquires context information, creates semantic characteristics, and enables features fusion between multiple output resolutions [46]. The shortcut block maintained the detail features of the encoders and added with the decoder feature map shown in Figure 7. From stage 5 to stage 1, added all feature resolution instead of concatenating. This shortcut block technique took less time to compute the output feature maps and significantly reduced the memory requirements in the system.

V. DATA ANALISYS AND EXPERIMENT A. THE DATASET
The dataset is one of the most important parts of machine learning performance while dealing with deep neural networks. It's critical to collect and construct a comprehensive turbulence-degraded image dataset before designing a semantic segmentation model in degraded conditions. The RGB-Thermal image dataset with pixel-level annotation and multi-spectral semantic segmentation dataset [47] was used for this experiment, which execute pixel-by-pixel labeling for visible and thermal images. The image dataset consists of three channels of viewable image with a horizontal field viewing angle of 100 degrees and a one-channel thermal image with a viewing angle of 32 degrees. The dataset is stored in 4-channel PNG format whereas, 1568 training data (820 for daytime and 749 at nighttime), 392 validation data, and 393 test data. In general, most of the road picture segmentation is available in the dataset. The dataset has nine category objects: background, car, person, bike, curve, car stop, guardrail, color cone, and bump, with each component having its different color.

B. TRAINING DETAILS
We used PyTorch frameworks for the proposed faster heterogeneous image semantic segmentation architecture to conduct the experiments. The AMD 5600, Intel Core i7 CPU, NVIDIA 3090 with 24GB graphics card, CUDA 11.1, and cuDNN v8.0.4 are all employed in the training procedure. For the FHI-Unet experimental evaluation, we used the Frames Per Second (FPS), Mean Accuracy (mAcc), and Mean Intersection over Union (mIoU) as evaluation metrics, as well as an Adam optimizer (Adaptive moment estimation) for weight update, Cross-Entropy Loss (CEL) function for training loss calculation, and Batch size parameter is 4.

C. EVALUATION METRICS
Many real-world applications demand for firster inference speed into the production environment; hence network latency time correctly is one of the most significant aspects of installing a deep network. To calculate the multispectral image semantic segmentation inference speed for the proposed FHI-Unet model the equation 2 is the following.

Inference speed=
Running test time Test numbers (2) Two validation measurement models are used to assess the heterogenous image semantic segmentation performance. The first one is Accuracy (Acc) per class of pixels (equation 3), and the second one is Intersection over Union The ''mAcc'' indicates the mean value of the accuracy function, and ''mIoU'' represents the mean value of Intersection over Union. The values of mAcc and mIoU can be calculated by the following equations 5 and 6, where the total number of object categories denoted by µ = 9.
The fewer parameter refers less computational complexity to the convolutional neural network's whereas, the number of parameters influences the memory size. Besides, more computations make the network's more complex and corresponded to the model execution time. Assuming that the input channel is C in , the convolution kernel size is K × K , and the output channel is C out , the input feature map size is H in × W in , the output feature map is H out × W out , then the number of parameters and the computational quantity as in Equations 7 and 8. However, the value of G refers to 1 without making any groups.

VI. EXPERIMENTAL RESULTS
In this experiment, we have considered the real time inference speed, Intersection over Union, and accuracy on edge AI platform. The six models, namely the RTFNet, the FuseSeg, the FuseNet, the MFNet, the SegNet, the U-net, and proposed FHI-Unet performance compared on the GPU as well as Nvidia Xavier NX Edge AI platform.   models. Among these models the FuseSeg and RTFNet had slowest inference speed on GPU and Nvidia Xavier NX Edge AI platforms, which is marked as red color, both models aren't suitable for real-time proceeding applications. The MFNet architecture has good performance on Parameter, Memory, Madd, Flops, and read-write memory but the FPS performance is still similar as the FuseNet and lower than SegNet, as well as Unet. The proposed FHI-Unet semantic segmentation model achieved height and fastest inference speed on GPU and Edge AI platforms. Besides, the Parameter, Memory, Madd, Flops, and read-write memory performances are better than the RTFNet, FuseSeg, FuseNet, SegNet, and Unet. Considering the real-time applications, the inference speed accelerates the performance of devises. The proposed FHI-Unet semantic segmentation model has achieved maximum FPS on both platform (marked as green color) among those models. Therefore, the proposed FHI-Unet model could be a good solution for real-time applications for Edge AI devises.

B. EDGE AI IMPLEMENTATION
For the Nvidia Xavier NX Edge AI Implementation, Table 4 demonstrations the performance of FuseNet, MFNet, SegNet (4C), Unet (4C), and proposed FHI-Unet segmentation model. The FuseNet achieved best result for object detection accuracy (mAcc) of 52.35 which is noticeable in green color. Besides, the proposed FHI-Unet model achieved second-highest accuracy on 50.84 that marked as purple color. The MFNet and The Unet (4C) has decent presentation for object detection accuracy. However, the SegNet (4C) performed lowest result (red color) for object detection in terms of accuracy. On the other hand, the Intersection over Union (IoU) for object detection performance, the proposed FHI-Unet model has achieved the best mIoU of 43.60 (marked as green color) and MFNet has the second-highest accuracy of mIoU value whish shown in purple color. The FuesNet and Unet (4C) shows an average performance for intersection over union of object detection. Whereas, the SegNet (4C) has lowest mIoU only 29 that marked as red color. All models show different values for each class of object segmentation, but the 'Guardrail' color pixels are only 0.0095 percentage among other classes, which is difficult to classify for all models. Figure 8 shows the FHI-Unet model implementation system on Nvidia Xavier NX edge AI platform for heterogenous image semantic segmentation. We have connected the Nvidia Xavier NX devise with our desktop computer for system implementation. The RGB-Thermal image dataset  ''multi-spectral semantic segmentation dataset'' introduced by MFNet model was used for this experiment. Table 4 illustrated the details results of proposed FHI-Unet performance, which achieved best FPS of 83.39 and mIoU of 43.59. Figure 9 illustrates the performance of the FuseNet, MFNet, the SegNet (4C), the Unet (4C), and the proposed FHI-Unet model on the Nvidia Xavier NX Edge AI platform in terms of mIoU and inference speed comparison. The FuseNet has the lowest mIoU and inference speeds. Alternatively, the proposed FHI-Unet image semantic segmentation model has state-of-the-art performance than the rest of models. Table 5 illustrates the performance (flops, mAcc, mIoU, and FPS) comparison of Unet (4C) and proposed FHI-Unet on Nvidia Xaviar XN edge AI platform. The proposed FHI-Unet model has less computation and higher mAcc and mIoU value than Unet (4C) model. Furthermore, the proposed FHI-Unet achieved better inference speed on edge AI platform, which is state-of -the art performance.
To evaluate the segmentation results of different models, we considered four RGB images and four Thermal images as inputs with night views and daytimes perspectives which shown in first and second rows at Figure 10. In addition, the third row demonstrates the ground truth of RGBT images fusion results. The FuseNet, the MFNet, the SegNet (4C), the Unet, and the proposed FHI-Unet models' performance are evaluated based on segmentation results in the columns (a), (b), (c), and (d). The wrong prediction and failure segmentation are marked by red circle of those columns.
The columns (a) and (b) represent the segmentation results of night view images, besides columns (c) and (d) signify the segmentation results of daytime images. The FuseNet did some wrong prediction and segmentation in the column (a) and (d). Besides that, the model is unable to predict and segmentation of the bicycles at column (c). Similarly, the MFNet also did wrong prediction and segmentation for example, the model is unable to predict person and full car segmentation at the column (b). In addition, the model did some wrong prediction and segmentation in the column (a) and (d) for car and person segmentation. The SegNet (4C) model has missed some objects in the columns (a), (b), and (d) while doing the segmentation. The color temperature of the bicycle and the background appear to be similar in the RGB and Thermal image at the column (c); as a result, the Unet (4C) model is incapable for bicycle segmentation that shown in the column (c). In addition, the U-net (4C) model did some wrong segmentation in the column (a) and (d). However, the proposed FHI-Unet model has excellent performance for the heterogenous image semantic segmentation similar as ground truth objects, whereas the other models couldn't dose the proper way.
The FuseNet has highest accuracy result, but still did some wrong prediction and segmentation. Whereas the U-Net (4C) model has better segmentation performance than the FuseNet, MFNet and SegNet (4C). Considering to the object prediction and segmentation, the proposed FHI-Unet model achieved better performance than the Unet (4C). Whereases, the Unet (4C) model has adopted to design the proposed FHI-Unet model and an extended form of that model. However, the segmentation result shows that the proposed FHI-Unet model has better performance than other models. Furthermore, the proposed FHI-Unet model achieved second-higher accuracy, best inference speed, intersection over union as well as object segmentation on Nvidia Xavier NX platform compares to other models.   Figure 11 illustrates the performance of proposed FHI-Unet and other models on the Nvidia Xavier NX platform. The FuseNet has better accuracy performance than the MFNet, and Unet while the SegNet has the lowest accuracy, whereas the proposed FHI-Unet achieved the second-highest accuracy. Furthermore, the proposed FHI-Unet displays best inference speed and highest mIoU among those approaches. On the other hand, the FuseNet and MFNet have the slowest inference speeds. The SegNet has the smallest mIoU compared to other models. Finally, in terms of FPS, mIoU, and mAcc, the proposed FHI-Unet model beats the state-of-theart performance on the Nvidia Xavier NX platform.

VII. DISCUSSION
The Pytorch 1.6 framework has been employed for proposed FHI-Unet which is implemented on Nvidia Xavier NX edge AI platform. For considering the higher speed on real-time applications, the proposed FHI-Unet model has less computation and higher inference speed. The proposed model accomplishes edge computing for heterogeneous image segmentation and reduced computational complexity. We intend RGB and Thermal images for daytime and nighttime as training datasets which improve the FPS performance. However, the background data makes up most of the total pixels in the dataset, and the number of object category was imbalanced. The frequency of each item category was not modified individually for each image as a result the accuracy was bit low. For improving the accuracy rate, need to increase the amount of training data and balances the number of training categories in practice. In future, we may increase the convolutional layer to higher accuracy performance of the proposed FHI-Unet semantic segmentation model.

VIII. CONCLUSION
The proposed FHI-Unet semantic segmentation model for visual and thermal image feature fusion minimizes the computational complexity and speeds up the real-time operation. A multi-hybrid-autoencoder is included with architecture for individual RGB and Thermal image input feature extraction and down sampling operations. Later, another feature fusion encoder combines the 4-channel feature maps for further process. An efficient decoder is utilized to recover the feature map to compensate of feature loss during up sampling, which reduces the number of parameters and computational complexity. The convolutional layers were generated using the Leaky Relu activation function to avoid back-propagation errors. The experimental result shows the proposed FHI-Unet model has the highest mean Intersection over Union value (43.39) and inference speed of 83.39 FPS for the multi-spectral semantic segmentation dataset. The proposed FHI-Unet model could be a suitable approach for real-time application on edge AI platforms.