Recursive Visual Explanations Mediation Scheme Based on DropAttention Model With Multiple Episodes Pool

In some DL applications such as remote sensing, it is hard to obtain the high task performance (e.g. accuracy) using the DL model on image analysis due to the low resolution characteristics of the imagery. Accordingly, several studies attempted to provide visual explanations or apply the attention mechanism to enhance the reliability on the image analysis. However, there still remains structural complexity on obtaining a sophisticated visual explanation with such existing methods: 1) which layer will the visual explanation be extracted from, and 2) which layers the attention modules will be applied to. 3) Subsequently, in order to observe the aspects of visual explanations on such diverse episodes of applying attention modules individually, training cost inefficiency inevitably arises as it requires training the multiple models one by one in the conventional methods. In order to solve the problems, we propose a new scheme of mediating the visual explanations in a pixel-level recursively. Specifically, we propose DropAtt that generates multiple episodes pool by training only a single network once as an amortized model, which also shows stability on task performance regardless of layer-wise attention policy. From the multiple episodes pool generated by DropAtt, by quantitatively evaluating the explainability of each visual explanation and expanding the parts of explanations with high explainability recursively, our visual explanations mediation scheme attempts to adjust how much to reflect each episodic layer-wise explanation for enforcing a dominant explainability of each candidate. On the empirical evaluation, our methods show their feasibility on enhancing the visual explainability by reducing average drop about 17% and enhancing the rate of increase in confidence 3%.

selecting which explanation to trust for making a final decision.
Moreover, several studies [7], [8] attempted to improve the task performance by reflecting the attention map in the task path, and the qualitative results show that more sophisticated visual explanation also can be derived by applying the attention modules. However, such attentions also show different aspects according to the layer-wise policy of applying attention modules (apply or not), and even the visual explanations derived from the layers where the attention modules are not applied are modified by the other attention applied layers in the process of feed-forward and back-propagation, which makes it more complex to choose a reliable explanation from such various episodes for making a final decision. In this paper, we denote episode as one of possible cases for applying attention modules to the layers of the model.
In order to solve such difficulty on choosing a reliable explanation from conflicting explanations varying on target layers and layer-wise attention policies, we propose a series of methods to integrate such complex explanations into a single explanation of human manageable level. The key concept of the proposed methods is to integrate the explanations from multiple episodes by reflecting the partial region in each episodic explanation, where the regional reflecting ratio is adjusted based on the two quantitative explainability indicators. However, it requires training several models one by one from scratch for generating multiple episodes of applying attention modules, which results in enormous time consuming overhead.
To realize this practically, we mainly address the three main technical issues.
• First, we propose DropAtt to enable handling of various attention episodes that vary with the layer-wise policy of applying attention modules (apply or not) in a training cost effective way, allowing generation of several attention variant episodes through training a single amortized network once, while maintaining the stable consistency on task performance over multiple episodes.
• In order to ensure the consistency between task path and explanation integration path, we also constructed the adversarial game to search initial settings for mediating the explanations, where the generator that tries to integrate explanation for faking as task path and the discriminator that distinguish task path from explanation integration path compete each other.
• From the such initial settings for mediating the explanations, the episodic layer-wise explanations that show higher explainability among any two indicators (multidisciplinary) are selected as debating candidates, and the regional reflecting ratio of each debate candidate is adjusted incrementally to mediate the conflicts of complementary explanations, which induces to improve both multi-disciplinary explainability.
To evaluate the feasibility of the proposed methods, we conducted an experimental evaluation on the satellite imagery dataset, and it is shown that the proposed explanation mediating scheme enhances the visual explainability.
In the following sections, the backgrounds of visual explanation and attention mechanism are presented in Section II. The problem descriptions are delivered in Section III, and the details of the proposed method for solving the problem is presented in Section IV. Finally, empirical evaluation on the proposed method is addressed in Section V.

II. RELATED WORK A. VISUAL EXPLANATION
As a method for deriving a visual explanation of the predictions from a task model, quantifying and representing the importance of each pixel in the input image contributed on the prediction process is a common approach [9]. One of prevalent methods is observing changes in predictions of task model by invoking perturbation or occlusion on input [10], [11].
As an more convenient way for deriving the visual explanation, it is widely shown that the activation map of the convolutional layer can show localization on objects even though the network is only trained on image-level classification with the help of characteristics in weakly supervised learning [12], [13]. Using such characteristics, class activation map (CAM) [14] attempts to represent the pixel-wise distinct regions for identifying each class. However, such structure remains restrictions that the global average pooling (GAP) should be applied to the last convolutional layer and CAM is only derived at the last layer basically.
In order to make up such limitations, gradient-weighted class activation mapping (Grad-CAM) [6] produces a localization map emphasizing the important pixels in the image for predicting the task by weighting the activation map with its pixel-wise gradients. Extended from Grad-CAM, Grad-CAM++ [15] tries to use the weighted combination of the positive partial derivatives of the last convolutional layer feature maps, and Ablation-CAM [16] attempts to apply ablation analysis to estimate pixel-wise class importance. Moreover, several studies attempts to deliver the interpretable explanations on various application domains [17], [18].
However, the visual explanations that is extracted by such CAM-based methods show different aspects each other depending on which layer the explanation is extracted from. In practical [19], the explanations extracted from the rear layers of the task model can represent the information of adjacent final prediction layer largely, but the precision of localization is degraded due to pooling through the several previous layers. On the contrary, the explanation extracted from the front layers show the relatively high precision of localization, but the redundant information is also included as the information of the final prediction layer is faded out passing through the rear layers. In an attempt to consider such different aspects of visual explanations according to the target VOLUME 11, 2023 layers, LayerCAM [19] proposed a method of aggregating the visual explanations (local explanations) extracted from the multiple layers into one global explanation, but there still remains a limit of containing the redundant information due to its simple fusion method, which just conducts an element-wise maximization among local explanations.

B. ATTENTION MODULE
Likewise to the CAM, attentions play an important role in human perception [20], [21], [22], and human visual system does not process the whole scene at once, but tries to focus on salient parts that they selected among series of partial glimpses. Based on this mechanism, several studies commonly attempt to improve task performance by adding a path to generate an attention in parallel to the layer of the task network and reflecting it to the task path again.
As a representative method, residual attention network [23] introduces attention module in the form of an encoderdecoder. CBAM [8] attempts not to compute 3D attention map directly, but to separate into channel attention and spatial attention, and reflects channel-wise and spatial-wise attentive analysis by applying each attention sequentially. Moreover, attention branch network [7] constructs attention branch that consists of attention generating path and attention based task inferencing path, and introduces training the whole network with loss function in which the both attention branch and perception branch are bound together.
Such attention modules can also extract attention maps that replace the visual explanations of CAM-based methods, but since the class-wise distinguishing features like CAM cannot be extracted except the last layer, only the attention maps from the last layer are extracted, or CAM-based methods are orthogonally applied to the other layers to extract visual explanations of each predictive labels [7], [8], [24]. However, looking only at the explanations of the last layer lacks precise localization due to the pooling over the several layers, while the visual explanations extracted from the front layers show relatively severe noise but highly precise localization characteristics as trade-off [19].
Moreover, different aspects of visual explanation are also acquired depending on how the attention module is applied to each layer as well as target layers, and the computational inefficiency arises for observing each aspect individually as it requires training the multiple models one by one as shown in Fig. 1. We will address that issue in detail at Section III.

C. QUANTITATIVE METRICS OF VISUAL EXPLAINABILITY
Aforementioned studies [15], [16], [25] on CAM-based visual explanation not only observe the qualitative result but also attempt to evaluate its feasibility quantitatively. Average drop in percentage is introduced to evaluate how much the highlighted regions on the visual explanation contributes to the confidence of prediction from the model. If the visual explanation emphasized the important regions for making decision correctly, bypassing only the highlighted regions For the conventional methods of attention mechanism [7], [8], [24], development of the attention-based DL model requires manual decision for adapting attention module. The attention modules are generally applied to all layers or only applied to a certain layer as a fixed form, and it can only reflect a single episode. As such model can only reflect a single episode of attention, in order to observe the multiple episodes of attention, multiple heavy overhead of training each model on attention episode is inevitably required.
(removing the others) may mostly derive the lower drop in confidence of the prediction, and the lower drop in confidence represents the better visual explainability. In this basis, average drop is calculated as the ratio of drop in confidence of predictions by occluding (removing redundant regions) the original input image with the derived visual explanation, which is averaged among whole target input instances.
Moreover, in order to evaluate such interpretability more accurately, remove and retrain (ROAR) [26] also attempts to measure how much the accuracy is decreased when retraining the model with the input data where a certain ratio of unimportant pixels are removed gradually, which requires heavy retraining overhead instead.
On the contrary, good visual explanations can also encourage the model to concentrate on only the discriminative important regions, and it can rather results in increase in confidence of prediction when we input the only highlighted regions in image [15]. Accordingly, rate of increase in confidence measures the frequency rate of the event where the increase in confidence of prediction occur on bypassing only the highlighted regions in input images, and the higher rate represents the better explainability.
Furthermore, other various metrics are introduced, for example, sanity check on such saliency map is introduced to measure how sensitive to model parameter [27]. Some study [28] attempts to evaluate how consistently the explanation is derived among data in a model-agnostic way. Other study [29] also introduces fidelity and sensitivity of 4308 VOLUME 11, 2023 explanations that measures how similar to the impact of various pixel-wise perturbation and how sensitive to the pixelwise perturbation.
Among such various metrics, we adopt average drop and rate of increase in confidence as main criteria for evaluating the quality of derived visual explanation as an example.

III. PROBLEM DESCRIPTION
A. COMPLEXITY ON VISUAL EXPLANATIONS OVER MULTI-LAYERS As described above, CAM-based methods [6], [14], [15] show its feasibility on providing class activation map as a visual explanation for distinguishing pixel-by-pixel object classes on input images. However, such CAM-based methods are mainly applied to the last layer of the task model, which shows the poor precision of localization as it propagates through pooling over the several layers. Moreover, in case of some methods like Grad-CAM, they can also be applied to various layers but the different aspects of visual explanations are extracted over the different target layers, which rather increases the difficulty on making a final decision by enforcing human user to select a reliable explanation among the several candidates [27], [30].
In order to identify such problem, in practical, we extract and observe the results of Grad-CAM on 4 residual blocks of ResNet-18 trained by the land use classification dataset on satellite imagery [31], and the top row of Fig. 2 shows the corresponding results. As shown in the results, the visual explanations from the first and second layer show inconsistency over the visual explanations from the third and fourth layer. The explanation extracted from the rear layers show relatively poor localization performance due to the spatial pooling over the several layers, and on the contrary, the explanations extracted from the front layers show high localization precision but relatively high variance of redundant information is also included together.
Moreover, in order to quantitatively identify such characteristics, we also examined the average drop in confidence and rate of increase in confidence that are widely used as quantitative metrics [15], [16], [25] for evaluating explainability of the derived visual explanation on the same dataset and network, and the results were shown in Table 1. As shown in the table, we can see that ResNet-18 (without any attention modules) shows relatively large drop in confidence among overall target layers, and in particular, the average drops in confidence on the first and second layers (residual blocks) are relatively inferior to the that on the subsequent layers. In other words, as the visual explanations show different aspect according to the extracted target layers, in order to provide a reliable explanation to the human users, it is required to distinguish trade-off between each explanation to remove the redundant information, and integrate into a single global explanation by selecting only the information that practically activated to the prediction.

B. VARIOUS VISUAL EXPLANATIONS EXTRACTED OVER MULTIPLE EPISODES
In addition, as aforementioned in Section II-B, visual explanations show different patterns according to the layer-wise policy (i.e. episode) of applying attention modules (apply or not) as well as the target layers, therefore it is also necessary to consider the diversity of such attention episodes in the process of extracting a single integrated explanation. In order to observe such problem, in addition to the previous experimental results, we also observe the aspects of visual explanations when the attention modules are applied to the specific layers. We add ABN [7] as attention modules to the first and second residual blocks of the previous ResNet-18, train it with the same dataset, and extract Grad-CAM results on each layer. The corresponding qualitative and quantitative results are obtained as Fig. 2 (bottom row) and Table 1 respectively. VOLUME 11, 2023 [6], [15] and attention modules [7], [8] contain the diversity of which layers to apply, and it results in difficulty on making a final decision as conflicts among such various explanations may occur. Accordingly, we propose a new scheme of mediating various explanations by selecting and aggregating the complementary parts into a single integrated form.
As shown in Fig. 2, by adding the attention modules, the visual explanations (bottom row) on each layer shows different aspects in overall comparing to the previous results (top rows). In particular, even the visual explanation extracted from the layer where the attention module is not applied is also affected by the other attention applied layers. The quantitative results in Table 1 also shows the corresponding result that the explainability on attention unapplied layer (third and fourth block) is affected (changed). Moreover, as shown in the results, the exlplainability can be improved on some layers by applying the attention module, but the distinct tendency between the improvement of explainability and the layer-wise policy (i.e. episode) of applying attention modules is not observed. For example, in the results, the average drop is improved in the third layer but the rate of increase in confidence is degraded, while the fourth layer shows the opposite result.
Accordingly, in order to provide a higher explainability, it is required to distinguish trade-off among various explanations extracted from various attention episodes as well as target layers through the quantitative criteria to remove the redundant information and integrate into a single global explanation by selecting only the information that practically contributed to the prediction.

C. TRAINING COST INEFFICIENCY FOR GENERATING VARIOUS ATTENTION EPISODES
However, as the attention modules [7], [8] are generally applied to the model in a fixed form as shown in Fig. 1, acquiring multiple episodes of attention modules inevitably requires substantial heavy computation overhead of training multiple episodic networks one by one. Alternatively, it can be considered that just training the super network (i.e. model with applying attention modules to all layers) and then applying layer-wise attention modules adaptively from the super network in inference phase, but it results in critical deviation on task performance for generating multiple episodes.
To identify the problem, we practically train the super network by adding ABN attention modules to all residual blocks of ResNet-18 with the same dataset, and adaptively remove attention modules from the trained super network according to the episode in the inference phase. Table 2 shows the aspects of task performances (top-1 accuracy) with regard to the various episodes. As shown in the results, the task performance falls to maximum -40.95% for generating multiple episodes adaptively from the super network, which results in the problem that any meaningful visual explanation cannot be acquired unless maintaining the normal prediction performance.
Therefore, prior to considering the various attention episodes for extracting a sophisticated explanation, in order to acquire the multiple episodes in a training cost effective way, a new training method or model structure that enables to generate various attention episodes while showing the stable task performance by training only a single model is required first.

IV. A PROPOSED MODEL
To solve such problems, we propose a scheme of explanation mediation that selects only the complementary parts of each visual explanation from various target layers and attention episodes and integrates them into a single sophisticated explanation as shown in Fig. 3. Specifically, we propose a new layer architecture (DropAtt) that generates multiple episodes while maintaining the stable task performance by training a single amortized model, and also propose a recursive mediation method that identifies and aggregates the parts of each explanation for enhancing the visual explaianbility from the multiple attention episodes pool produced by DropAtt. The details of each components on our methods are presented as follows.

A. DropAtt: GENERATING THE MULTIPLE EPISODES OF ATTENTION MODULES FROM AN AMORTIZED MODEL
First, to overcome the computational inefficiency problem of training the multiple models one by one generating multiple episodes caused by the existing methods of attention modules [7], [8], we propose a new layer architecture that generates multiple attention episodes with maintaining the stable task performance by training only a single amortized model.
As aforementioned in Section III, in the case of training the super network (i.e. model with applying attention modules to all layers) and then applying layer-wise attention modules adaptively from the super network in inference phase, as the gradient is divided into two backward paths (original task path on network and path for attention module) on training a network, task information is partly reflected on the attention module. Therefore, the task performance is vulnerable to be degraded when some attention module is adaptively removed from a super network that is trained by applying attention modules to all layers in a fixed form.
To overcome such vulnerability, similar to dropout's mechanism, as a method to ensure consistent task performance on multiple episodes from a single amortized network, we propose DropAttention (DropAtt) that can reflect all possible episodes of applying attention modules in training phase, and therefore the multiple episodes can be generated while maintaining the stable task performance in the inference phase. Fig. 4 shows how the proposed DropAtt works on each layer of the convolutional neural network. As shown in Fig. 4, the proposed DropAtt is applied to each layer of the task network, and the attention module on each layer is randomly applied in training phase by DropAtt. From the layer-wise attention gating (0/1) random variable z l ∼ Bernoulli(p) that follows a Bernoulli distribution and determines whether to apply the attention module AB l (·) on l-th layer or not, the feed-forward computation of DropAtt is represented as follows: where p denotes the probability of not applying the attention module (in every data sample), and F l , F ′ l denotes input/output of attention module at l-th layer. In the training phase, gradient at input path of l-th layer is calculated as follows: therefore, the layers of the task network can be mainly trained by ∂F l ∂θ AB l , while being robust to the randomly passed gradients from attention paths. In other word, different from the existing methods of attention modules [7], [8] that train the model in a fixed form, the proposed DropAtt regularizes the overfit on a specific layer-wise attention by invoking the randomness on applying layer-wise attention modules.
In practice, in order to identify the feasibility of DropAtt, we conducted the empirical evaluation of whether the proposed DropAtt maintains task performance regardless of layer-wise policy of applying attention modules. In the evaluation, ResNet-18 and the satellite imagery land use classification dataset [31] are targeted, and ABN [7] is adopted for attention module. As shown in Table 2, the task performance (accuracy) falls to maximum −40.95% among various attention episodes from the super network trained by applying attention modules to all layers in a fixed form, but when we train the amortized network by using the proposed DropAtt, the accuracy on various attention episodes maintain stable consistency (smaller than 1% changes in maximum). In addition, the task performance of the amortized network itself is also slightly improved by using DropAtt.
Based on the amortized network that can adaptively apply the attention module on each layer and generate multiple episodes of attention through DropAtt, the following subsections deal with the problem of how to integrate visual explanations that show different characteristics/levels over various attention episodes and target layers.

B. EXPLANATION CONSISTENCY VIA CLASS-WISE FEATURE DISCRIMINATOR
Various characteristic visual explanations can be obtained from the amortized model with the help of the proposed DropAtt, and different characteristic visual explanations can also be acquired among different target layers. However, as aforementioned, since the existing CAM-based methods [6] extract the different aspects of visual explanations according to the target layers and attention episodes, it is required to distinguish distinguish trade-off between each explanation to remove the redundant information, and integrate into a single global explanation by selecting only the information that practically contributed to the prediction.
Accordingly, in order to make these various episode-variant and layer-variant visual explanations into one integrated explanation, we propose a scheme for explanation mediation where the regional reflecting ratio for each episodic layer-wise explanation is adjusted differently according to the degree of multi-disciplinary explainability on each explanation, and the several episodic layer-wise explanation are synthesized into a single integrated explanation by reflecting each allocated regional ratio of explanation as shown in Fig. 5.
For practicality of the explanation, in order that the integrated explanation implies consistent information with the predictions from task path, we induce it by constructing mutual competition between generator that derives regional reflection ratios in explanation path and discriminator that tries to distinguish fake of prediction in explanation path from real predictions in task path.
Specifically, in the explanation path, each regional reflecting ratio (0 ≤ ρ (e,l) ≤ 1) for the explanation obtained at e-th attention episode and l-th layer is derived by feeding layer-wise feature map (z ∼ G z (x)) as input to the parameterized generator function (G ρ (·)). As one of methods for selecting the pixel-wise principal region from the episodic layer-wise explanation (LE (e,l) ), we propose to filter out only the top ρ (e,l) · 100% value pixels in a masking format (0/1) as an example (denote function of this procedure as R(LE (e,l) , ρ (e,l) )). Accordingly, integration of several episodic layer-wise explanations is conducted by summing only the allocated regions of each explanation: As the existing CAM-based methods [6] only extracts the explanation on a specific episode at a certain layer, the extracted visual explanations are vulnerable to contain the redundant information of certain episode. On the other hand, as the proposed mechanism of integrating explanations can block out the inflow of redundant information from each episodic explanation by adjusting the regional reflection ratio on each explanation individually, through searching the proper value of reflection ratio recursively, the proposed method can extract a more sophisticated single explanation aggregating only the parts of explanations that practically contributed to the prediction.
Based on the integrated explanation, the feature score ( i jL E c i,j ) of estimating each object class can be quantified by summing up pixel-wise activated values in explanation (L E c ) for each class (c), and a relative prediction probability for a particular class can be calculated as applying softmax on each feature score of class (denote as ). Likewise, object class prediction probability can also be quantified in a softmax form by summing up the class-variant output (f (Y = c|x, e)) for each different network episode in task path (denote as S t (x) = exp( e f (Y =c|x,e)) c ′ exp( e f (Y =c ′ |x,e)) )). From these basis, by configuring a parameterized discriminator (D(·)) that distinguishes real class prediction (S t (x)) in task path from fake prediction (S e (x, G ρ (z))) in explanation path, we can construct zero-sum two-player (G ρ , D) game as following Lemma 1.

Lemma 1: (Adversarial loss for explanation consistency)
The integrated explanation that is consistent with estimated output in task path can be searched by solving zero-sum twoplayer (G ρ , D) minimax game, where the adversarial loss is constructed as: (4) Proof: Equation 4 where the discriminator D the generator G ρ compete each other constructs adversarial training, and the solution of such zero-sum two player game is obtained by Nash equilibrium where the discriminator can not distinguish the generations of the generator network from the real distribution.
□ To solve the corresponding two-player minimax game, as shown in Algorithm 1, the discriminator is updated from the sampled minibatch of predictions in task path and the sampled minibatch of input for the generator in advance, and then the generator is updated from the other sampled minibatch to fake the discriminator. By conducting such updating step iteratively, we can derive the initial region reflection ratio for each episodic layer-wise explanation that can induce the initial integrated explanation implying consistent information with task predictions as much as possible. Table 3 shows the task performance (top1 accuracy) results of predicting task by derived explanation itself as S e (·), which shows how much the derived explanation contains consistent information for task prediction. The hyper-parameter settings for training are same with the previous evaluation in Table 2,   TABLE 3. Comparison of task performance predicted by using only the derived explanation itself (using S e (·)). and class-wise prediction score is calculated by averaging over all layer-wise explanations for the 5 methods: Grad-CAM, ABN, DropAtt, DropAtt + Discrim., Whole Proposed Scheme (Algorithm 1) (detail settings of each method is described in Section V-C). As shown in the results, the proposed DropAtt itself can derive more task consistent explanation than the applying Grad-CAM [6] to the network without attention (Grad-CAM) and the visual explanation derived from the network with ABN [7] in a fixed form (ABN). Moreover, by applying the proposed class-wise feature discriminator (and generator) with DropAtt, it can further improve the task consistency on the integrated explanation, and our final Algorithm 1 shows the highest task consistency on the derived explanation.

C. RECURSIVE MEDIATION OF EXPLANATIONS FROM DEBATING CANDIDATES
As a method of combining the various explanations, we proposed to filter out only the certain ratio of pixels from each layer-wise episodic explanation to drop the redundant VOLUME 11, 2023 features. However, such filtering process contains the concerns of losing meaningful features together. Accordingly, in order to minimize such risk, we also constructed the additional step that attempts to search the appropriate regional reflection ratio for each explanation by identifying whether the pixel-wise features can enhance the explainability, through recursively incrementing the regional reflection ratio and evaluating the explainability. Therefore, different from the conventional methods like Grad-CAM that only extracts a specific explanation biased on a certain target layer and an attention episode, our new framework of mediating explanations generates more sophisticated explanation by integrating only the positive aspects of explanations from various target layers and attention episodes.
The term ''debate'' used in this paper indicates complementary conflicts among various episodic layer-wise explanation in terms of multi-disciplinary explainability. Specifically, the debating candidates for deriving the integrated explanation is selected based on two multi-disciplinary explainability indicators (average drop and increase in confidence), which are widely used as quantitative metrics to measure the explanability of various visual explanations [15], [16], [25]. The first indicator of explanability considered in this work is a remaining confidence percentage (RCP), which is defined as average value on the ratio of remaining confidence on a label (c) when inferencing (f (·)) with the augmented input where the original image (x) among dataset (D) is masked by visual explanation (LE) derived on certain e-th episode (same concept with the average drop in confidence, but the higher RCP value means higher explainability): where ρ D denotes the regional ratio for passing out principal values on such discipline. The second explainability indicator is a confidence increase ratio (CIR), which is defined as ratio of the events where the confidence derived from masked input is higher than the confidence inferred from original input: Based on these two disciplinary indicators, if an explanation dominant in both indicators is found, the regional reflecting ratio for the explanation can be applied to 100%, but in most cases, there is a problem that only the conflicting explanations appear in terms of explanatory indicators, such as high RCP but low CIR, or high CIR but low RCP.

17:
ifL E ′ improves RCP, CIR then 18: ρ (e,l) ← ρ ′ (e,l) ,L E ←L E end if 20: end for 21: end for prior to integrating explanations, the process of selecting debating candidates is conducted by selecting only the episodic layer-wise explanations that show any dominant explainability among two disciplines (RCP, CIR) over current integrated explanation (L E) as shown in Algorithm 1, and then the regional reflecting ratios are mediated among the debating candidate set for improving whole explainability metrics together.
The procedure of mediating various explanations to derive the integrated explanation (L E) based on the selected debating candidates consists of incrementally adjusting the regional reflecting ratio of each debating candidate, and only adopt such adjustment when the newly integrated explanation from mediation shows any improvement among two disciplinary explainability without any degradation. By conducting this process iteratively for various debate candidates and attempting to mediate among them, it is intended that candidate explanations dominant on a specific discipline (RCP or CIR) can complement the integrated explanation within bounds of not exacerbating the weak discipline (explainability).
In an ideal case, as stated in Lemma 2, the confidence derived from the input masked by visual explanation can be enhanced by further reflecting the partial regions of the other explanation where the dominant explainability is emphasized, and RCP or CIR of updated explanation can be improved according to the dominant explainability of candidate.
Lemma 2: (Confidence enhancement via adding discrimi- . From the assumptions, we can bring: where Equation 9 can be derived by i j R(LE A , ρ A ) ∪ R(LE B , ρ B ) ≤ ρ D HW , and Equation 10 is equal to f (x · R(LE A , ρ D )) as ρ A < ρ D . □ Based on this theoretical basis, we attempt to synthesize a new integrated explanation in the form of extending the partial region of dominant explainability on complementary explanation candidates by debating trade-off between each pair (L E, LE (e,l) ). In the real world environment, as the dominant explainability of previously integrated explanation is likely to be diluted in the mediation procedure by extending the reflection ratio of the debate candidate excessively, the mediation is conducted by adjusting the regional reflecting ratio for each debate candidate to a small amount of increment (ϵ) but iteratively.
In an ideal case, by conducting these searching and explanation mediating procedure iteratively, as stated in Theorem 1, the integrated explanation that converges to maximum value of RCP and CIR within the tangible range (∀LE (e,l) ) can be derived.
RCP(LE (e,l) ), (11) and CIR(LE (e,l) ), (12) Proof: For a certain debate candidate LE (e,l) , from the assumptions, the updated explanationL E with integration with several iterations at least converges to show close value to dominant indicator (RCP or CIR) of debate candidate LE (e,l) as follows: If RCP(LE (e,l) ) > RCP(L E), then If CIR(LE (e,l) ) > CIR(L E), then Therefore, when we try to update the explanation with several debating candidates that shows at least one dominant explainability (RCP or CIR) over several iterations, the updating step inevitably go through for the two debate candidates that shows maximum value on each explainability indicator among all candidates (max ∀(e,l) RCP(LE (e,l) ), max ∀(e,l) CIR(LE (e,l) )). Accordingly, the updated explanation at least converges to ,l) ). □ However, as the precondition for Theorem 1 that is induced from Lemma 2 is not always satisfied in the real world environment. Therefore, in order to alleviate this gap, as shown in Algorithm 1, we add a procedure of examining whether RCP or CIR improves without any degradation to adopt mediation for each updating step, and iteratively attempt to mediate among debating explanations in exploring various episodic layer-wise explanations for satisfying such precondition more likely. Moreover, as our methods can be orthogonally applied to various CAM-based visual explanation methods and attention modules, it does not contains any other constraints on the backbone task network except for the requirements of deriving CAM-based visual explanations and applying attention modules.

A. EXPERIMENTAL SETTINGS
We evaluated the feasibility of our methods empirically. We conducted evaluation on land use classification task with UC Merced satellite imagery dataset [31], which requires explanations on predictions due to low resolution characteristics of satellite imagery in practical domain. In the dataset, 90% of total dataset is split to train set and the remaining 10% is used for test set.
The task network is trained using SGD with momentum of 0.9, batch size of 48 and weight decay of 0.0001. We trained 160 epochs in total, where the initial learning rate is configured to 0.1 and decayed by 1/10 at every 60 epochs. VOLUME 11, 2023   Quantitative comparison of the mediated explanation obtained using the proposed scheme. We observed the results on the three set of samples where the mediation is conducted among the explanations derived from three different sampled episodes in each case, and the mediated explanation achieves improvement on explainability over the explanations of the three episodes in all three samples.
We used ResNet-18 [32] architecture, where only a single full connected layer at the last layer is modified with 21 output channels for UC Merced satellite imagery dataset and the other settings follow the ResNet-18 [32]. ABN [7] is adopted for attention module on each residual block of the task network in parallel, and Grad-CAM [6] is used to extract layer-wise visual explanation. For each attention module of the residual block, 2 convolutional layers are applied to generate the attention map, where the kernel sizes of 1 and 3 are applied to each convolutional layer respectively, and the padding is only applied to the latter one to maintain the output size. Batch normalization layers are applied to each convolutional layers of the attention module, and ReLU is subsequently applied to the former one, while the sigmoid is applied to the latter one. Moreover, a single 1 × 1 convolutional layer is additionally applied to the former one to predict the task in parallel, and the global average pooling and softmax are applied subsequently to produce the probability score of each label. In the training phase, the sum of losses from 4 attention branches and a perception branch is applied as loss function to train the whole task network including the attention modules likewise to [7]. For DropAttention, p=0.5 is applied in the training phase. The task network with attention modules is trained using SGD with momentum of 0.9, batch size of 48 and weight decay of 0.0001. We trained 160 epochs in total, where the initial learning rate is configured to 0.1 and decayed by 1/10 at every 60 epochs. Any data augmentation is not applied in the training phase. In the training, any data augmentation is not applied.
To obtain the initial settings for mediating the explanations, the discriminator and generator are constructed with a single fully connected layer, where the softmax is applied subsequently to the discriminator and the sigmoid is applied subsequently to the generator. The discriminator and the generator are trained by using the same hyper-parameter settings of training the task network, and a single step of inner loop is applied. In the procedure of recursive mediation, ϵ = 5% is applied as a unit of searching appropriate regional ratio ρ.

B. FEASIBILITY OF MEDIATING EXPLANATIONS
In order to check the practical feasibility of the proposed scheme for mediating explanations from various attention episodes, we sampled three episodes from the amortized network trained with DropAtt, and conducted the proposed scheme of mediating explanations among only the sampled episodic layer-wise explanations. The layer-wise policy of applying attention modules and target layer (for deriving Grad-CAM) of each sampled episode in Table 5 is presented in Table 4.
As shown in Table 5, the results shows that the proposed scheme of mediating explanations can derive the integrated explanation in which both two explainability indicators (average percentage drop in confidence, rate percentage of increase in confidence) are improved among the layer-wise explanations in all three different sampled episodes, and it is expected that the proposed scheme of mediating explanations can be able to improve explainability if any multiple episodes are acquired.
When we see the results of the first sample closely as an example, first and second episodes have strength on rate of increase in confidence, but they have lower performance on average drop than the third episode. As the explainability performance is similar each other between first and second episode, our mediation method allocated reflection ratio mainly on the second episode (0.1) rather than the first episode (0.4). Moreover, by allocating some portion on the complementary episode (i.e. third episode), the finally derived explanation through mediation can achieve improvement on both explainability metrics over each episodic layer-wise explanation with the help of complementing each other's strength and weakness. Such complement can be realized mainly from our mediation mechanism that reflects only partial regions of derived explanation (i.e. take only strong points) and excludes the remaining parts (i.e. abandon weak/noisy points and fill with strong points of other explanations instead). The second sample also shows similar aspects of results to the first sample.
On the third sample, first episode shows lower performance on both explainability metrics over the others. Accordingly, our mediation method does not allocate reflection ratio on the first episode, and only the complementary pair (second and third episodes) are utilized for mediation.
In overall, our mediation methods can mediate complementary explanations that are derived from the different target layer and episode, and therefore can achieve improvements over the strongest explainabilities among applied explanations.

C. QUANTITATIVE AND QUALITATIVE COMPARISON OVER VISUAL EXPLANATION METHODS
We also observed the results of visual explanations, and compared over the existing methods. We compared our methods to the two main methods as baselines: Grad-CAM and ABN. For Grad-CAM, visual explanation (Grad-CAM [6]) is derived from the task network without applying any attention module, and all the visual explanations derived at each residual block are averaged to show as a single image. ABN is the case of visual explanations derived from the fixed model that is trained with applying ABN [7] on all residual blocks as it suggested, and the explanations derived from all residual blocks are averaged into a single explanation.
We also attempted to observe the effect of our methods one by one: DropAtt, DropAtt + Discrim., and Whole Proposed Scheme (Algorithm 1). For DropAtt, visual explanation (Grad-CAM [6]) is derived from the task network trained with applying our DropAtt, and all the layer-wise explanations are also averaged to show as a single explanation. In case of DropAtt + Discrim., three layer-wise episodic explanations are derived from the task model by applying our DropAtt, and they are integrated into a single explanation by mediating the three sampled explanations with the initial reflecting ratios inferred from the generator that is trained with the proposed class-wise feature discriminator. Whole Proposed Scheme is applying our final Algorithm 1 that mediates explanations from multiple episodes and target layers, and the presented results in Table 6 and Fig. 6 are derived on the settings of ''Sample 3'' in Table 4 as an example.
Among the comparing methods, we evaluated the visual explainability of each method based on the aforementioned two main metrics: average drop (lower is better) and rate of increase in confidence (higher is better). As shown in Table 6, Grad-CAM or ABN itself show relatively lower explainability than our methods. Moerover, applying DropAtt only can not achieve a significant improvements over two baselines showing rather higher value on average drop. However, further applying the proposed way of integrating various explanations with DropAtt + Discrim. shows the improvements on both metrics of explainability. Finally, by applying the proposed mediating scheme, Whole Proposed Scheme (Algorithm 1) achieves the highest explainability by enhancing near 17% on average drop and near 3% on the rate of increase in confidence comparing to the baseline (ABN). Such results implies that the way of integrating various explanations that can be acquired from the diversity of deriving visual explanation and applying attentions also plays an important role for providing the reliable explanation.
We also observed the visual explanations on each comparing method qualitatively, and Fig. 6 shows the corresponding results. In the results, the leftmost column corresponds to the samples of original input images, and the remains are the visual explanation derived from each comparing method: Grad-CAM, ABN, DropAtt, DropAtt + Discrim., and Whole Proposed Scheme (Algorithm 1).
The results shows that applying our DropAtt only can not produce significant improvements in visual explanation over two baselines (results at second and third column), although it shows stable task performance over multiple episodes and even improvement on task performance over baseline model of [7]. This implies that not only just generating the multiple episodes of applying various attention modules, but also the way of integrating such various explanations from different target layers and episodes properly to complement each other is important for enhancing the quality of the visual explanation.
Subsequently, the results of applying only our DropAtt and the proposed class-wise feature discriminator together just show the better quality of visual explanation, where the explanation highlights on the target objects mainly in the image sophisticatedly. Furthermore, when we additionally apply our explanation mediating scheme with multidisciplinary debate, the final results in the rightmost column show further enhancement on the quality of the visual explanation in some instances. In particular, the first row of the results shows such improvement distinctly.

D. ABLATION STUDY: MEDIATED PARTIAL EXPLANATIONS
In order to check how our explanations mediating scheme works, we observe the visual explanation derived from each episode and their partial regions from explanation mediation, and Fig. 7 shows the corresponding results on the settings of ''Sample 3'' in Table 4. As shown in the results, the visual explanations from Episode 1 show redundant information over several input instances, where the background bias is revealed and even the highlighted regions cannot distinguish the target object and background. Accordingly, the mediation policy (regional reflection ratio ρ) for Episode 1 is allocated as zero for blocking the meaningless explanations of Episode 1 from explanation integration.
In the case of Episode 2, the visual explanations can mainly highlight the regions of target objects over several input images, which shows the lowest average drop among three episodes as shown in Table 4. As Episode 2 also shows lower background bias over other episodes, the highest value of the mediation policy is allocated (0.7) on Episode 2.
Similar to Episode 2, the visual explanations on Episode 3 also shows the good explainability, where the rate of increase in confidence on Episode 3 is the highest among the other episodes. However, visual explanations in some input images shows background bias or wrong bias (e.g. The visual explanation of Episode 3 on airplane example image mainly highlights the vehicle not the airplane and some background bias occur together). As the correct explanation the successfully capture out the regions of target objects are already reflected largely by Episode 2, only a small portion of mediation policy is allocated (0.1) to Episode 3 within the bounds of emphasizing the advantage (higher rate of increase in confidence) but minimizing the weakness (higher average drop than Episode 2).

E. COMPARISON OF COMPUTATIONAL COST FOR SERVING ON THE REAL DOMAIN ENVIRONMENT
In this section, we provide the additional analysis of the computational cost on the proposed model for evaluating the feasibility of serving the model on the real domain environment. To evaluate the feasibility of serving the analysis model on the space-related application, we consider an onboard AI computing environment, depicted in [1]. In such environment, utilizing the low-power computing resources such as visual processing unit or embedded GPU, the onboard computing system conducts inferencing of the DL network to analyze the images collected in the satellite system and send the annotated results to the ground station system. In the ground station system, further analysis with XAI model can be conducted to provide the reliable visual explanations to the human experts for assisting final decision making. In this scenario, the computation of inferencing the task DL network is conducted on the satellite onboard system and the computations of deriving visual explanations is conducted in the ground station. Accordingly, we observe the total cost Cost net of processing the series of such analysis on each satellite image in terms of the total energy consumption, which is derived as follows: (15) where P comp sat and P comp gs represent the power consumption of computing HW resource in the satellite system and the ground station system, and t inf and t expl represent the processing time of inferencing the task network and the processing time of deriving the visual explanation respectively. PE comm sat is the power efficiency of satellite-ground station communication represented as the power per data rate (W/bps), and D is the data volume for captured images. We assume PE comm sat = 1/200W /Mbps of communication parameter same as [33], and assume that NVIDIA Jetson TX-1 is utilized for the satellite onboard system and NVIDIA RTX 3080 is utilized for the ground station system where P comp sat = 10W and P comp gs = 320W respectively. As shown in Table 7, Grad-CAM shows the smallest cost as it does not apply any attention modules. DropAtt and DropAtt + Discrim. shows the lower cost than ABN from the advantages of DropAtt that does not need all the attention modules for deriving the explanations. Our final scheme (Algorithm 1) shows near 3% higher cost than ABN, but it can enhance the explainability in trade-off.

F. FUTURE WORK
Compared to Grad-CAM, the proposed method has a limitation in that additional computational overhead is required due to the recursive searching step of finding the proper regional reflection ratios over multiple episodes. However, except for such preparation step of recursive search, in the serving phase, the proposed method only requires 13% additional computational cost comparing to Grad-CAM, while improving the explainability.
Moreover, in the proposed method of mediating explanations, by only taking the partial regions of each explanation, background bias can be filtered out in some cases (see the results of harbor example), but it is practically hard to filter out whole wrong regions of explanations (see the results of airplane example). As our method attempts to adjust the mediation policy by observing explaianbility over whole data instances, such limitation on sophisticated mediation is inevitably remains. Such limitation can be overcome by adjusting the mediation policy with observing changes in explainability for each data instance, but it brings out the huge computational overhead instead. Nevertheless, we verified that our explanation mediating scheme can improve the explainability by integrating various explanations, and we remain the further research for such elaborate mediation as future work.

VI. CONCLUSION
In this paper, we identified two main problems on deriving visual explanation with applying the attention modules: 1) the heavy computation overhead for training a new model from scratch is required for each episode of applying attention modules, 2) and the visual explanations show their diversity (complexity) over the various layer-wise policies of applying attention modules and various target layers, which makes it hard for human user to choose a reliable explanation for making a final decision.
In order to overcome such problems, we propose a new method of mediating various attention-variant layerwise explanations by generating several episodes from the amortized model with the help of DropAtt. Through the multi-disciplinary debate in mediating process, a single integrated explanation can be derived, which can improve the both disciplinary explainability by complementing strength and weakness of explanations from multiple episodes and target layers.
From the empirical evaluation, our DropAtt shows stable task performance in applying multiple episodes of attention modules from the amortized model and even slight improvement on task performance over the baseline. Moreover, by applying our explanation mediating scheme with multi-disciplinary debate on the multiple episodes that are generated from the amortized model with the help of DropAtt, it further achieves improvements on both explainability metrics and derives more sophisticated explanations. Center, KAIST, where he is developing core technologies that are in the areas of high-performance computing, explainable AI system, satellite imagery analysis, and AI acceleration system. He wrote a book on Cloud Broker and Cloudlet for Workflow Scheduling (Springer, 2017). He served as a TPC member for many international conferences. He was the General Chair of the 6th EAI International Conference on Cloud Computing (Cloud Comp 2015), KAIST, in 2015. He was a Guest Editor of IEEE WIRELESS COMMUNICATIONS, in 2016. VOLUME 11, 2023