Class Based Thresholding in Early Exit Semantic Segmentation Networks

We consider semantic segmentation of images using deep neural networks. To reduce the computational cost, we incorporate the idea of early exit, where different pixels can be classified earlier in different layers of the network. In this context, existing work utilizes a common threshold to determine the class confidences for early exit purposes. In this work, we propose Class Based Thresholding (CBT) for semantic segmentation. CBT assigns different threshold values to each class, so that the computation can be terminated sooner for pixels belonging to easy-to-predict classes. CBT does not require hyperparameter tuning; in fact, the threshold values are automatically determined by exploiting the naturally-occurring neural collapse phenomenon. We show the effectiveness of CBT on Cityscapes, ADE20K and COCO-Stuff-10K datasets using both convolutional neural networks and vision transformers. CBT can reduce the computational cost by up to 23% compared to the previous state-of-the-art early exit semantic segmentation models, while preserving the mean intersection over union (mIoU) performance.


INTRODUCTION
Deep learning is developing fast, and new state-of-the-art models are continuously being announced.The increase in the performance of the state-of-the-art models is often due to the increased model size [1][2][3].Larger models can learn more complex patterns and hence achieve higher performance, but they also require more floating point operations, which means they have higher inference cost.In an age where everything is being decentralized to run at the edge (e.g.mobile phones and IoT devices), it is important to reduce the inference cost of large models to be able to deploy them to resource-constrained devices.
In order to reduce the amount of computation without harming the model performance, various methods have been used.Knowledge distillation methods start from scratch and train a smaller model based on the output of the larger model [4].Quantization methods reduce the bit length of the model weights [5].Pruning methods set redundant model weights to zero so that these weights are not used in the computations [5,6].Early exit networks allows "easy" data samples to exit the model early to save computation [7,8].Among these methods, early exit networks exploit the fact that real world data is heterogeneous, i.e., not all data samples have the same "difficulty".Early exit networks have also close ties with the phenomenon of neural collapse [7,9].

23% 11%
Fig. 1.Comparison of CBT with the previous state-of-the-art on the Cityscapes dataset for HRNetV2-W48 model.
The neural collapse phenomenon states that as one travels deeper in a neural network model, the intermediate representations of the data samples become more and more disentangled and distinct clusters can be identified at the last layer, which makes classification easier [9].Recent works expand on this phenomenon and show that clusters begin to form even at earlier layers [7,10], resulting in a so-called cascading collapse.In the supervised setting, each cluster corresponds to a class where the model is trained on, and the mean of the cluster is referred to as simply a class mean.
For the task of image classification, we have previously shown that one can design a low-complexity early exit network by taking advantage of the neural collapse phenomenon [7].Specifically, a feature that is sufficiently close to a class mean at any given layer can be allowed an early exit, without significant penalty in classification performance.However, the same design is not immediately applicable to the task of semantic segmentation since one now needs to perform pixelwise classification.In fact, in an image classification task, there is one input and it belongs to one class.Therefore the intermediate layer outputs will be close to only one class mean, and a meaningful prediction can be performed based on the distances to the class means [7].On the contrary, in semantic segmentation, one input has many pixels that belong to many classes.Moreover, the spatial locations of the pixels matter.These make it infeasible to calculate the class means for the  pixels using existing algorithms (e.g.[7]).Nevertheless, utilizing the neural collapse phenomenon for semantic segmentation would be particularly useful because state-of-the-art semantic segmentation models preserve high resolution intermediate representations throughout the model, which increases the amount of computation significantly [11][12][13][14].
In this work, we propose a novel algorithm, "Class Based Thresholding (CBT)", which reduces the computational cost while preserving the model performance for the semantic segmentation task.Similar to previous state-of-the-art, CBT employs a thresholding mechanism to allow the early termination of the computation for confidently predicted pixels.CBT utilizes the neural collapse phenomenon by calculating the mean of the prediction probabilities of pixels in the training set, for each class.The thresholds for each class are calculated via a simple transformation of the class means.We show the effectiveness of CBT on the Cityscapes [15] and ADE20K [16] datasets using the HRNetV2-W18 and HRNetV2-W48 models [11].By efficiently utilizing the neural collapse phenomenon, CBT can reduce the computational cost by up to 23% compared to the previous state-of-the-art method while preserving the model performance as shown in Fig. 1.

CLASS BASED THRESHOLDING
We build on the state-of-the-art early exit semantic segmentation method, "Anytime Dense Prediction with Confidence Adaptivity (ADP-C)" [17].ADP-C adds early exit layers to the base semantic segmentation model and introduces a masking mechanism based on a single user-specified threshold value t to reduce the computational cost.If a pixel is predicted confidently at an exit layer, i.e., the maximum prediction probability over all classes is greater than the threshold t, that pixel is masked for all subsequent layers.Any masked pixel will not be processed again at later layers.The computational cost is reduced due to the induced feature sparsity.
A big room for improvement for ADP-C stems from the observation that the same user-specified threshold value t is used for every class.However, it is more plausible that different threshold values should be used for different classes, and the threshold values should reflect the dataset and class properties, rather than just being a user-specified number.This is because pixels belonging to different classes have different difficulty levels of being predicted correctly.For example, using t = 0.998 for person class as in ADP-C makes sense because we may want to be really certain about pixels belonging to people.However, pixels belonging to the sky class will be often easier to predict than pixels belonging to the person class, which means the model will be confident about them much sooner.Therefore, a lower threshold value can be used for the sky class without significant penalty in prediction accuracy.Otherwise, more computation will have to be performed for the sky pixels although it is not necessary.
Given a model trained on a semantic segmentation task with K classes, we propose using different masking threshold values for each class, based on the dataset and class proper- K be the threshold vector that we wish to determine, where the k th element T k corresponds to class k, and k ∈ {1, 2, . . ., K}.Let there be N exit layers in the model.Let p n denote the prediction probabilities at layer n, where n ∈ {1, 2, . . ., N }.
Our algorithm is illustrated in Fig. 2. At each exit layer n ∈ {1, 2, . . ., N }, for each class k ∈ {1, 2, . . ., K} in the training set, we calculate the mean of layer n's prediction probabilities using all training set pixels that belong to class k (denoted as S k ).This yields The i th element of p k n denotes the average probability of a pixel belonging to class i when the ground truth for that pixel is class k.Next, we compute which is the average of p k n over all layers.Hence, information across layers is shared.
We initialize the threshold T k to be the difference between the largest and the second largest elements of P k .The difference serves as a confidence score.If the confidence score is high, then the masking threshold should be low so that the computation can terminate easily.After all components of T are initialized in this manner, we inversely scale T according to two parameters α and β so that the maximum and minimum class confidence scores determined by T will be converted to masking threshold values α and β respectively, where α < β.Specifically, the scaling is done via a single application of the update rule The inference is performed as follows: Let π ∈ [0, 1] K be the prediction probabilities for a pixel at an exit layer.Let j = arg max π.If π j > T j , this pixel will be marked as confidently predicted (predicted as class j) and will be incorporated to the mask M as in ADP-C [17].The mask will have 0 at the locations of the confidently predicted pixels, and 1 at every other place.By doing so, the outputs of subsequent layers at these locations will not be calculated.Instead, already computed values will be used.

RESULTS
We validate the effectiveness of our method on Cityscapes [15] and ADE20K [16] datasets using the HRNetV2-W18 and HRNetV2-W48 models [11].We use mean intersection over union (mIoU) as our performance metric and number of floating point operations (FLOPs) as our computational cost metric.We report the performance on the validation sets.

Models
HRNet is the state-of-the-art model for the semantic segmentation task.The intermediate outputs are kept in four different resolutions throughout the model [11].We use HR-NetV2 in our experiments, where the different resolutions are combined before giving an output.Similar to ADP-C, we use HRNetV2-W18 and HRNetV2-W48 in our experiments, where HRNetV2-W18 is more lightweight than HRNetV2-W48 due to smaller number of channels being used.
We attach 3 early exit layers to HRNetV2-W18 and HRNetV2-W48 models as in ADP-C.The exit layer structures and positions are exactly the same as ADP-C.Each model has N = 4 exits in total.We used the pretrained models provided by the authors of ADP-C for our Cityscapes experiments.For the ADE20K experiments, we trained the models ourselves.The training is done by using the weighted sum of the exit losses.Similar to ADP-C, we gave all exit losses the same weight of 1.We used a single NVIDIA RTX A6000 GPU for training.We trained the models for 200000 iterations.We used a batch size of 16.

Experiments
We have evaluated CBT with numerous α-β pairs, and compared it with the baseline ADP-C [17].We kept β = 0.998 in all our experiments for a fair comparison because ADP-C achieves the best performance with t = 0.998 as stated in [17].As shown in Tables 1 and 2, we have also used a limited and uniform set of values for α; namely α ∈ {0.7, 0.8, 0.9, 0.9, 0.95, 0.99} so as to avoid excessive hyperparameter tuning.
The mIoU and GFLOPs for the first exit do not differ between CBT and ADP-C, because the masking procedure starts there.Until the first exit, the mask consists of all 1's so that all pixels are operated on.According to the prediction probabilities at Exit 1, the mask changes for deeper layers.
Looking at as seen in Fig. 3.
From Table 2, we see that CBT can save computational cost on ADE20K dataset too, which has significantly more number of classes compared to Cityscapes.More specifically, CBT [0.90, 0.998] decreases the computational cost by 6% while losing only 0.97 mIoU for HRNetV2-W48.The reason why the performances at the first three exit is low for both ADP-C and CBT is because the model cannot perform well enough due to large number of classes.It needs significantly more computation (e.g.94.31 GFLOPs instead of 15.07) to have better performance.Also, this is why CBT cannot reduce the computational cost on ADE20K as much as it does on Cityscapes dataset.Another interesting observation is that when the the model size is small, CBT can increase the performance by up to 22% compared to ADP-C even when α is as low as 0.7.We believe this is because when the model size is small and the dataset is complex, the model cannot reach high confidence for the majority of the pixels although it could predict them correctly.Relaxing the threshold helps model to perform correct predictions.

CONCLUSION
We proposed a novel algorithm that utilizes the naturally occurring neural collapse phenomenon to reduce the computational cost of early exit semantic segmentation models.Experiment results on different datasets and models suggest our method is effective in reducing the computational cost without significant penalty in model performance.

Fig. 2 .
Fig. 2. Overview of Class Based Thresholding for N exit layers and K classes.Best viewed in color and zoom in.

Table 1 ,
it can be seen that CBT [0.99, 0.998] decreases the computational cost by 23% while losing only