Exploring Simple and Transferable Recognition-Aware Image Processing

Recent progress in image recognition has stimulated the deployment of vision systems at an unprecedented scale. As a result, visual data are now often consumed not only by humans but also by machines. Existing image processing methods only optimize for better human perception, yet the resulting images may not be accurately recognized by machines. This can be undesirable, e.g., the images can be improperly handled by search engines or recommendation systems. In this work, we examine simple approaches to improve machine recognition of processed images: optimizing the recognition loss directly on the image processing network or through an intermediate input transformation model. Interestingly, the processing model's ability to enhance recognition quality can transfer when evaluated on models of different architectures, recognized categories, tasks and training datasets. This makes the methods applicable even when we do not have the knowledge of future recognition models, e.g., when uploading processed images to the Internet. We conduct experiments on multiple image processing tasks paired with ImageNet classification and PASCAL VOC detection as recognition tasks. With these simple yet effective methods, substantial accuracy gain can be achieved with strong transferability and minimal image quality loss. Through a user study we further show that the accuracy gain can transfer to a black-box cloud model. Finally, we try to explain this transferability phenomenon by demonstrating the similarities of different models' decision boundaries. Code is available at https://github.com/liuzhuang13/Transferable_RA .


INTRODUCTION
Unlike in image recognition where a deep network maps an image to a category label, a deep network used for image processing maps an input image to an output image with some desired properties. Examples include super-resolution [1], denoising [2], deblurring [3], colorization [4].
The goal of such systems is to produce images of high perceptual quality to a human observer. For example, in image denoising, we aim to remove noise that is not useful to an observer and restore the image to its original "clean" form. Metrics like PSNR/SSIM [5] are often used [1], [6] to approximate human-perceived similarity between the processed and original images, and direct human assessment on the fidelity of the output is often considered the "gold-standard" [7], [8]. Therefore, techniques have been proposed to make outputs look perceptually pleasing to humans [7], [9], [10].
However, while looking good to humans, image processing outputs may not be accurately recognized by image recognition systems. As shown in Fig. 1, the output image of an denoising model could easily be recognized by a human as a bird, but a recognition model classifies it as a kite. One could specifically train a recognition model only on these output images produced by the denoising model to achieve better performance on such images, or could leverage domain adaptation approaches to adapt the recognition model to this

Input Image
Output Image Fig. 1: Image processing aims for images that look visually pleasing for human, but not those accurately recognized by machines. In this work we try to enhance output images' recognition accuracy. Zoom in for details.
domain, but the performance on natural images can be harmed. This retraining/adaptation scheme might also be impractical considering the significant overhead induced by catering to various image processing tasks and models.
With the fast-growing size of image data, images are often "viewed" and analyzed more by machines than by humans. Nowadays, any image uploaded to the Internet is likely to be analyzed by certain vision systems. Therefore, it is of great importance for the processed images to be recognizable not only by humans, but also by machines. In other words, recognition systems (e.g., image classifier) should be able to accurately explain the underlying category meaning of the image content. In this way, we make them easier to search, recommended to more interested audience, and so on, as these procedures are mostly executed by machines based on their understanding of the images. Therefore, we argue that image processing systems should also aim at machine recognizability. We call this problem "Recognition-Aware Image Processing".
It is also important that the enhanced recognizability arXiv:1910.09185v4 [cs.CV] 10 Sep 2022 is not specific to any concrete recognition model, i.e., the improvement is only achieved when the output images are evaluated on one particular model. Instead, the improvement should ideally be transferable when evaluated with different downstream models/tasks, to support its usage without access to possible future recognition systems. On reason for this is what model will be used for recognizing the processed image could be out of our control, for example if we upload it to the Internet or share it on social media. We may not know what network architectures (e.g. ResNet or VGG) will be used for inference, what object categories the model recognizes (e.g. animals or scenes), or even what task will be performed (e.g. classification or detection). Without these specifications, could it be hard for us to enhance output images' machine recognition accuracy in an unkown context?
In this work, we explore simple yet highly effective approaches to make image processing outputs more accurately recognized by downstream recognition systems, and demonstrate that these approaches generate transferable accuracy gain among different recognition architectures, categories, tasks and training datasets. The approaches we study add a recognition loss optimized jointly with the image processing loss. The recognition loss is computed using a fixed recognition model that is pretrained on natural images, and can be done without further supervision from class labels for training images. It can be optimized either directly by the original image processing network, but we also have the option of resorting to an intermediate transformation network or training in an unsupervised manner, depending on different use cases. Interestingly, we find the accuracy gain from optimizing one recognition model's loss transfers favorably among different recognition model architectures, object categories, and recognition tasks, which renders our simple solutions effective even when we do not know what the downstream recognition model is.
We conduct extensive experiments, on multiple image processing (super-resolution, denoising, JPEG-deblocking) and downstream recognition (classification, detection) tasks. The results demonstrate our methods can substantially boost the recognition accuracy (e.g., up to 10%, or 20% relative gain), with minimal loss in image quality. Results are also compared with alternative approaches in section 7. We demonstrate in detail that these methods generate transferable accuracy boost, when the downstream model is either with a different architecture, recognizes different classes, performs different tasks, or even is a cloud-based, black-box model. We conduct decision boundary analysis and show different models' decision boundaries exhibit high similarities, to give an explanation for this transferability phenomenon.
We would like to emphasize that in studying and analyzing these approaches, our contribution does not lie in proposing novel network architectures, training procedures or loss functions, but in demonstrating these simple methods are surprisingly effective at making the processed images more accurately recognized, and the improved machine recognizability can be transferred favorably to different contexts. The simplicity of these methods also leads to easy deployment in practice, and they could serve as strong baselines in this relatively underexplored problem. The transferability also facilitates their practical usage in various and potentially changing deployment environments.
Our contributions can be summarized as: • We propose to study the problem of enhancing the machine recognition of image processing outputs, a desired property considering the amount of images analyzed by machines nowadays. • We study simple yet effective methods towards this goal, suitable for different use cases. Extensive experiments are conducted on multiple image processing and recognition tasks. • We show that with simple approaches, the recognition accuracy improvement is transferable among recognition architectures, categories, tasks and datasets, a desirable behavior making the proposed methods applicable without access to downstream recognition models. • We provide decision boundary analysis of recognition models and show their similarities to gain a better understanding of the transferability phenomenon. We hope our empirical findings can encourage the community to propose new methods for improving the recognition of processed images, and further study the reason behind the intriguing transferability, which could lead to deeper understanding of neural networks.
There are also a number of works that relate image recognition with image processing. Some works [4], [22], [23], [24], [25] use image recognition accuracy as an evaluation metric for image colorization/super-resolution/denoising, but without optimizing for it in training. [26], [27], [28], [29], [30] investigate how to achieve more accurate recognition on low-resolution or corrupted/noisy images. [31] propose a method to make denoised images more accurately segmented. [32] introduced a theoretical framework for classificationdistortion-perception tradeoff and conducted experiments with simulated or toy datasets, while our work develops practical approaches for real-world datasets. Most existing works only consider one image processing task or image domain, and develop specific techniques, while our simpler approach is task-agnostic, more widely applicable, and is first to be shown transferable. Our work is related but different from those aiming for robustness of the recognition model [33], [34], [35], as we focus on training the processing models. Our method also shares some similarity with [36] which tries to differentiate input signals by optimizing recognition accuracy using auto-encoders. [37], [37], [38], [39], [40], [41], [42], [43] jointly train a processing model (e.g., dehazing, resizing, face reconstruction) together with a recognition model (e.g., object recognition or face recognition) to achieve better image processing and/or recognition quality. Our problem setting is different from these works, in that we assume we do not have the control of the recognition model, as it might be on the cloud or to be decided in the future, thus we adapt the image processing model only. This has the advantage of making the performance gain transferable to different downstream environments, and also ensures the recognition of natural images is not harmed as the recognition model is fixed. Section 7 includes our comparison with training recognition model jointly.

METHODS
We first formally define the problem setting of recognitionaware image processing, and then introduce the multiple approaches we examined, each suited for different use cases. We finally introduce different transferring scenarios we explored. Our proposed methodology, although only introduced in a vision context, can be extended to other domains (e.g., speech) as well.

Problem Setting
In a typical image processing problem, given a set of training input images {I k in } and corresponding target images {I k target }, we aim to train a network that maps an input to its target. Denoting this network as P (processing), parameterized by W P , our optimization objective is: where l proc is the loss function for each sample (e.g., L 2 ). The performance is typically evaluated by similarity (e.g., PSNR/SSIM) between I k target and P I k in , or human assessment. In recognition-aware processing, we are interested in a recognition task, with a trained recognition model R (recognition). We assume each target image I k target is associated with a category label S k for the recognition task. Our goal is to train a processing model P such that the recognition performance on the output images P I k in is high, when evaluated using R with the category labels {S k }. In practice, R might not be available (e.g., on the cloud), in which case we could resort to other models if the performance improvement transfers among models.

Optimizing Recognition Loss
Given our goal is to make the output images by P more recognizable by R, it is natural to add a recognition loss on top of the objective of the image processing task (Eqn. 1) during training: l recog is the per-example recognition loss defined by the downstream recognition task. For example, for image classification, l recog could be the cross-entropy (CE) loss. Adding the image processing loss (Eqn. 1) and recognition loss (Eqn. 2) together, our total training objective becomes where λ is the coefficient controlling the weights of L recog relative to L proc . We denote this simple solution as "RA (Recognition-Aware) processing", which is visualized in Fig.  2 left. Note that once the training is finished, the recognition model used as loss is not needed anymore, and during inference, we only need the processing model P, thus no overhead is introduced in deployment. A potential shortcoming of directly optimizing L recog is that it might deviate P from optimizing the original loss L proc , and the trained P will generate images that are not as good as if we only optimize L proc . We will show that, however, with proper choice of λ, we could substantially boost the recognition performance with nearly no sacrifice on image quality.

Unsupervised Optimization of Recognition Loss
The solution above requires category labels for training images, which however, may not be always available. In this case, we could regress the recognition model's output of the target image R(I k target ), The recognition objective changes to where l dis is a distance measure between two of R's outputs (e.g., L 2 distance, KL divergence). We call this approach "unsupervised RA". Note that it is only unsupervised for training model P , but not necessarily for the model R. The (pre)training of the model R is not our concern since in our problem setting R is a given trained model, and it can be trained either with or without full supervision. Unsupervised RA is related to the "knowledge distillation" paradigm [44] used for network model compression, where the output of a large model is used to guide a small model, given the same input images. Instead we use the same recognition model R, but guide the upstream processing model to generate desirable images. It is also related to the perceptual loss/feature loss used in [7], [9], where the feature distance is minimized instead of output distance. We provide a comparison in Section 7.

Using an Intermediate Transformer
Sometimes we want to prevent the added recognition loss L recog from causing P to deviate from optimizing its original loss. We can achieve this by introducing an intermediate transformation model T : P 's output is first fed to the T , and T 's output serves as the input for R (Fig. 2 right). T 's parameters W T are optimized for the recognition loss: In addition to the image processing loss, we add a recognition loss using a fixed recognition model R, for the processing model P to optimize. Right: RA with transformer. "Recognition Loss" stands for the dashed box in the left figure. A Transformer T is introduced between the output of P and input of R, to optimize recognition loss. We cut the gradient from recognition loss flowing to P , such that P only optimizes the image processing loss and the image quality is not affected.
With the help of T on optimizing the recognition loss, the model P can now "focus" on its original image processing loss L proc . The optimization objective becomes: In Eqn. 6, P is solely optimizing L proc as in the original image processing problem (Eqn. 1). P is learned as if there is no recognition loss, and therefore the image processing quality of its output will not be affected. This could be implemented by "detaching" the gradient generated by L recog between the model T and P (Fig. 2 right). We term this solution as "RA with transformer". Its downside compared with directly optimization using P is that there are two instances for each image (the output of model P and T ), one is "for human" and the other is "for machines". Therefore, the transformer is best suited when we want to guarantee the image processing quality not affected at all, at the expense of maintaining another image. For example, in classifying images, we can have the higherquality image presented to users for better experience and the other image passed to the backend for accurate machine classification.

Transferring Scenarios
If using R as a fixed loss can only boost the accuracy on R itself, the use of the method could be limited. Sometimes we do not have the knowledge about the future downstream recognition model or even task. Thus, we explore several scenarios to see whether processing models trained with the loss of one recognition model R 1 , can also boost the performance when evaluated using another model R 2 . Here R 2 is not only possibly a different architecture but also can perform a different vision task, etc. If the improvement is transferable, then we can train our image processing model P with a loss from a generic recognition model, such as those trained on ImageNet. In this case, even if in the future the output images from P is tested on another model, we still have accuracy gains compared with a vanilla processing model. Specifically, we examine the following transfer scenarios: • Transferring to a different model architecture. R 1 and R 2 perform the same vision task, are trained on the same datasets, and only differ in model architecture, e.g., ResNet-50 and VGG-16. • Transferring to another set of categories. R 1 and R 2 are of the same architecture, perform the same task, but are trained to recognize disjoint subsets of categories from the same dataset. • Transferring to another task and dataset. R 1 and R 2 perform different tasks, for instance image classification and object detection. In most cases, this also means R 1 and R 2 are trained on different datasets with different set of categories. They can have different architectures as well. • Transferring to a black-box model. The model R 2 could be a proprietary online model that provides recognition service but does not allow users access to its structures, weights, what dataset(s) it was trained on, or even its set of output categories. Interestingly, we find that in each of the above cases, the accuracy boost gained on R 1 also transfers to R 2 . This makes our method effective even when we cannot access the target downstream model, where we could use another trained model as the loss function. This phenomenon also implies that the "recognizability" of a processed image can be more general than just the extent it fits to a specific model. More details are presented in the experiments.

Experimental Details
General Setup. We evaluate our proposed methods on three image processing tasks: image super-resolution, denoising, and JPEG-deblocking. In those tasks, the target images are all the original images from the datasets. To obtain the input images, for super-resolution, we use a downsampling scale factor of 4×; for denoising, we add Gaussian noise on the images with a standard deviation of 0.1 to obtain the noisy images; for JPEG deblocking, a quality factor of 10 is used to compress the image to JPEG format. The image processing loss used is the mean squared error (MSE, or L 2 ) loss. For the recognition tasks, we consider image classification and object detection, two common tasks in computer vision. In total, we have 6 (3 × 2) task pairs to evaluate. Training is performed with the training set and results on the validation set are reported.
We adopt the SRResNet [7] as the architecture of the image processing model P (unless otherwise specified, e.g., in Sec. 4.7), which is simple yet effective in optimizing the MSE loss. Even though SRResNet is originally designed for super-resolution, we find it also performs well on denoising and JPEG deblocking when its upscale parameter set to 1 for the same input-output sizes. For the transformer model T , we use the ResNet-like architecture in CycleGAN [45]. The recognition models are ResNet, VGG and DenseNet. Please refer to section 7 for comparison with alternative approaches.
Throughout the experiments, on both the image processing network and the transformer, we use the Adam optimizer [46] with an initial learning rate of 10 −4 , following the original SRResNet [7]. Our implementation is in PyTorch [47]. The experiments are run on 1-4 NVIDIA TITAN Xp GPUs. The training process finishes in 2-24 hours depending on the model sizes/variants of methods/recognition tasks, and the maximum GPU memory taken is 30GB (multi-GPU) with batch size of 20.
Image Classification. For image classification, we evaluate our method on the large-scale ImageNet benchmark [48], which can be downloaded at http://image-net.org/download. It consists of ∼ 1.2 million training images and 50,000 validation images. We use five pre-trained image classification models, ResNet-18/50/101 [49], DenseNet-121 [50] and VGG-16 [51] with BN [52] (denoted as R18/50/101, D121, V16 in Table  1), on which the top-1 accuracy (%) of the original validation images is 69.8, 76.2, 77.4, 74.7, and 73.4 respectively. We train the processing models for 6 epochs on the training set, with a learning rate decay of 10× at epoch 5 and 6, and a batch size of 20. In evaluation, we feed unprocessed validation images to the image processing model, and report the accuracy of the output images evaluated on the pre-trained classification networks. For unsupervised RA, we use L 2 distance as the function l dis in Eqn. 4. The hyperparameter λ is chosen using super-resolution with the ResNet-18 recognition model, on two small subsets for training/validation from the original large training set, from a grid search from 10 −4 to 100. The λ chosen for RA processing, RA with transformer, and unsupervised RA is 10 −3 , 10 −2 and 10 respectively.
Object Detection. For object detection, we evaluate on PAS-CAL VOC 2007 and 2012 detection dataset (https://pjreddie. com/projects/pascal-voc-dataset-mirror/), using Faster-RCNN [53] as the recognition model. Our implementation is based on the code from [54]. Following common practice [53], [55], [56], we use VOC 07 and 12 trainval data as the training set, and evaluate on VOC 07 test data. The Faster-RCNN training uses the same hyperparameters in [54]. For the recognition model's backbone architecture, we evaluate ResNet-18/50/101 and VGG-16 (without BN [52]), obtaining mAP of 74.2, 76.8, 77.9, 72.2 on the test set respectively. Given those trained detectors as recognition loss functions, we train the models on the training set for 7 epochs, with a learning rate decay of 10 × at epoch 6 and 7, and a batch size of 1. We report the mean Average Precision (mAP) of processed images in the test set. As in image classification, we use λ = 10 −3 for RA processing, and λ = 10 −2 for RA with transformer.

Evaluation on the Same Recognition Model
We first present results when the R used for evaluation is the same as the R we use as the recognition loss in training. Table 1 shows our results on ImageNet classification. ImageNet-pretrained classification models ResNet-18/50/101, DenseNet-121 and VGG-16 are denoted as R18/50/101, D121, V16. "No Processing" denotes the accuracy on input images (low-resolution/noisy/JPEG-compressed); "Plain Processing" denotes using image processing models trained without recognition loss (Eqn. 1). We observe that plain processing can boost the accuracy over unprocessed images. These two are considered as baselines.    From Table 1, using RA processing can significantly boost the accuracy of output images over plainly processed ones, for all image processing tasks and recognition models. This is more prominent when the accuracy of plain processing is lower, e.g., in SR and JPEG-deblocking, where we mostly obtain ∼10% accuracy gain (close to 20% in relative terms). Even without category labels, our unsupervised RA can still in most cases outperform baseline methods, despite achieves lower accuracy than its supervised counterpart. Also in SR and JPEG-deblocking, using an intermediate transformer T can bring additional improvement over RA processing.
The results for PASCAL VOC object detection, when evaluated on the same architecture, are shown in Table 2. We observe similar trend as in classification: using recognition loss can consistently improve the mAP over plain image processing by a notable margin. On super-resolution, RA processing mostly performs on par with RA with transformer, but on the other two tasks using a transformer is slightly better. The model with transformer performs better more often possibly because with this extra network in the middle, the capacity of the whole system is increased.
For all result tables, each reported accuracy number is based on one run due to the relatively stable performance (almost all within 1%) we noticed and the large amount of tasks combinations/architectures to be evaluated. For example, on super-resolution with ResNet-18 as R on ImageNet, five runs on plain processing gives accuracies (%): 53.

Transfer between Recognition Architectures
In reality, sometimes the R we want to eventually evaluate the output images on might not be available for us to use as a loss for training, e.g., it could be on the cloud, kept confidential or decided later. In this case, we could train an processing model P using recognition model R A (source) that is accessible to us, and after we obtain the trained model P , evaluate its output images' accuracy using another unseen R B (target). We evaluate model architecture pairs on ImageNet in Table 3, for RA Processing, where row is the source model (R A ), and column is the target model (R B ). In Table 3's each column, training with any model R A produces substantially higher accuracy than plainly processed images on R B , indicating that the improvement is transferable among recognition architectures. This phenomenon enables us to use RA processing without the knowledge of the downstream recognition architecture. We provide the model transferability results of RA processing on object detection in Table 4. Rows indicate the models trained as recognition loss and columns indicate the evaluation models. We see similar trend as in classification (Table 1): using other architectures as loss can also improve recognition performance over plain processing; the loss model that achieves the highest performance is mostly the model itself, as can be seen from the fact that most boldface numbers are on the diagonals.
In Table 5, we present the results when transferring between recognition architectures, using unsupervised RA. We note that for super-resolution and JPEG-deblocking, similar trend holds as in (supervised) RA processing, as using any architecture in training will improve over plain processing. But for denoising, this is not always the case. Some models P trained with unsupervised RA are slightly worse than the plain processing counterpart. A possible reason for this is the noise level in our experiments is not large enough and plain processing achieve very high accuracy already.
In Table 6, we present the results of transferring between architectures when we use a transformer T . We use the processing model P and transformer T trained with R A together when evaluating on R B . From Table 6, in most cases   improvement is still transferable but there are a few exceptions. For example, when R A is ResNet or DenseNet and when R B is VGG-16, in most cases the accuracy fall behind plain processing by a large margin. This weaker transferability is possibly caused by the fact that there is no constraint imposed by the image processing loss on T 's output, thus it "overfits" more to the specific R it is trained with.

Transfer between Object Categories
What if the R A and R B recognize different categories of objects? We divide the 1000 classes from ImageNet into two splits, denoted as Cat (category) A/B, each with 500 classes, and train two 500-way classifiers (R18) on both splits, obtaining R A and R B . Next, we train two image processing models P A /P B with the R A /R B as recognition loss, using images from Cat A/B respectively. Note that neither P nor R has seen images/categories from the other split. We evaluate obtained processing models on both splits in Table 7. We observe that RA still benefits the accuracy even when transferring across categories (e.g., in SR, 60.1% to 66.5% transferring from Cat A to Cat B). The improvement is only marginally lower than directly training on the same categories (e.g., 60.2% to 67.8% on Cat B). This suggests RA processing models do not impose category-specific signals to the images, but signals that enable wider sets of classes to be better recognized.

Transfer between Recognition Tasks and Datasets
We evaluate task transferability when task A is classification and task B is object detection in Table 8, where rows are classification models used for RA loss and columns are detection models for evaluation. There is also a dataset shift, since model P and R are both trained on ImageNet; during evaluation, P is fed with VOC images and we use a VOCtrained detection model R. We observe that using classification loss on model A (row) gives accuracy gain on model B over plain processing in most cases. Such task transferability suggests the "machine semantics" of the image could be a task-agnostic property.
We also evaluate the opposite direction, from detection to classification. The results are shown in Table 9. Here, using RA processing can still consistently improve over plain processing for any pair of models, but we note that the improvement is not as significant as directly training using classification models as loss (Table 1 and Table 3).
Additionally, the results when we transfer the model P trained with unsupervised RA with image classification to object detection are shown in Table 10. In most cases, it improves over plain processing, but for image denoising, this is not always the case. Similar to results in Table 5, this could be because the noise level is relatively low in our experiments.

Transfer to a Black-box, Third-party Cloud Model
We compare the images generated from plain processing and RA models using the "General" model at clarifai.com, a company providing state-of-the-art image classification cloud services. We do not have knowledge of the model's architecture or what datasets it was trained on, except we can access the service using APIs. The model also recognizes over 11000+ concepts that are different from the 1000-class ImageNet  (mAP). Note that rows are classification models and columns are detection models, so even the same name in row and column (e.g., "R18") indicates different models trained on different tasks and datasets.    categories. For this experiment, we only take the output category with the maximum probability as the prediction. We use the SR processing model trained with R18/ImageNet as the RA model. We randomly sample image indices from ImageNet validation set, and ask clarifai.com for predictions of both images generated from plain and RA processing models.
From the results, we then randomly select 100 instances where clarifai.com gives different predictions on plain and RA images, to compose a survey for user study. For each of the 100 instances, the survey presents the user with the target image, and both prediction labels generated from plain/RA images, in randomized left/right order. The survey asks the user to indicate in his/her opinion which label(s) describe the image to a satisfactory level. The user has the options to choose none, either or both labels. The survey and instructions can be found at https://tinyurl.com/y698779q. 10 volunteers participated in our survey. The resulting average satisfaction rates for plain and RA super-resolved images are 40.1% and 55.3% respectively. We achieve 15.2% absolute gain or 37.9% relative gain on recognition satisfication rate, indicating the strong transferability our method provides without knowledge of the black-box cloud model.

Experiments on More Architectures
In previous sections, we use SRResNet [7] as our processing model P . Here we provide more results with other more recent processing models, including SRDenseNet (SRDNet) [6], Residual Dense Network (RDN) [57], and Deep Back-Projection Networks (DBPN) [58]. We present results at Table  11, with super-resolution as the processing task, ImageNet classification as recognition task, and R being ResNet-18. The general trend we observed before holds for various archiectures.

Experiments on ImageNet-C
We evaluate our methods on the ImageNet-C benchmark [33]. It imposes 17 different types of corruptions on the ImageNet [48] validation set. Despite ImageNet-C benchmark is originally designed for evaluating recognition models that are robust to corruptions, it is a good testbed for our methods in a broader range of processing tasks. We use the corrupted image as the input image to the processing model and the original clean image as the target image. Since only corrupted images from the validation set are released, we divide it evenly for each class into two halves and train/test on its first/second half. The recognition model used in this experiment is an ImageNet-pretrained ResNet-18.
In Table 12, we evaluate RA Processing on all 17 types of corruptions, with "corruption level" set to 5 [33]. We observe that RA Processing brings consistent improvement over plain processing, sometimes by an even larger margin than the tasks considered in Sec. 4.
In Table 13, we experiment with different levels of corruptions using two corruption types: "speckle noise" and "snow". We also evaluate with our variants -Unsupervised RA and RA with Transformer. We observe that when the corruption level is higher, our methods tend to bring more recognition accuracy gain. In this case, we note that using a Transformer could sometimes hurt the accuracy compared with plain processing. This is possibly because the insufficient training data in ImageNet-C dataset (half of validation set) caused the transformer to hurt the accuracy, since more parameters typically require more training data. In the majority of other cases, it improves slightly over RA processing.
In Table 14, we examine the transferability of RA Process-ing between recognition architectures, using the same two tasks "speckle noise" and "snow", with corruption level 5. Note the recognition loss used during training is from a ResNet-18, and we evaluate the improvement over plain processing on ResNet-50/101, DenseNet-121 and VGG-16. The improvement over plain processing is transferable among architectures.  Fig. 3: Examples where outputs of RA processing models can be correctly classified but those from plain processing models cannot. PSNR/SSIM/class prediction is shown below each output image. Slight differences between images from plain processing and RA processing models could be noticed when zoomed in.

Experiments on Randomized Image Corruption
[0.1, 0.2, 0.3, 0.4, 0.5]; for JPEG-deblocking, the quality factor is sampled from [5,10,20,30,40]. We further compound these three randomized corruptions sequentially to approximate real world image distortions. We conduct the experiments using ResNet-18 on ImageNet as the recognition model, and the results are shown in Table 15. In all cases, RA processing can boost the recognition accuracy. In the randomized compound experiment, the relative accuracy gain is even more significant (31.5% → 41.2%, a 30.8% relative improvement).

IMAGE PROCESSING QUALITY ASSESSMENT
We compare the output image quality using conventional metrics (PSNR/SSIM). When using RA with transformer, the output quality of P is guaranteed unaffected, therefore here we evaluate RA processing. We use R18 as loss on ImageNet, and report results with different λs (Eqn. 3) in Table 16. λ = 0 corresponds to plain processing. When λ = 10 −4 , PSNR/SSIM are only marginally worse. However, the accuracy obtained is significantly higher. This suggests that the added recognition loss is not harmful when λ is chosen properly. When λ is excessively large (10 −2 ), image quality is hurt more, and interestingly even the recognition accuracy start to decrease, which could be due to the change of actual learning rate. A proper balance between processing and recognition loss is needed for both image quality and accuracy. We also measure the image quality using the PieAPP metric [59], which emphasizes more on perceptual difference: on SR, when λ = 0/10 −4 /10 −3 , PieAPP (lower is better) = 1.329/1.313/1.323. Interestingly, RA processing can slightly improve perceptual quality measured with PieAPP. In Fig.  3, we visualize some examples where the output image is incorrectly classified with a plain processing model, but correctly recognized with RA processing. With smaller λ (10 −2 and 10 −3 ), the image is nearly the same as the plainly processed images. When λ is too large (10 −2 ), we could see some extra textures when zooming in.

DECISION BOUNDARIES ANALYSIS AND TRANSFERABILITY
Inspired by prior works' analysis on adversarial example transferability [60], [61], we conduct decision boundary analysis to gain insights on RA processing's transferability. The task used is SR with ImageNet. We restrict our analysis to a single direction at a time due to image's high dimension: given a input image x and a direction d (unit vector, same dimension as x), we analyze how the output of the recognition model R changes when x moves along d by δ amount, i.e., when input is x + δs. We define the boundary distance (BD) of model R, with respect to input x and direction s, as the minimum amount of movement along d required for x to produce a different label at R, or more formally: Consider a two-model scenario, with a source model R s and a target model R t sharing the same output categories. We define their inter-boundary distance (IBD): same prediction within boundary), a small IBD between R s and R t means they have close boundaries along the s direction, since x does not need to move beyond one's boundary too far to reach the other's. In this case, changes made to x along d likely has a transferring effect from source to target model due to their close boundaries.
We take the image x to be a plain processing output, and consider two types of directions: 1. random direction d r . 2. The direction generated by subtracting the plain processing output x from RA processing output x s , i.e., (x s −x)/||x s −x|| 2 . The RA processing model here is trained with the source model R s , thus x s is specific to R s . We call this "RA direction" (d RA ), since it points to the RA output x s from the plain output x. We take all validation images such that the plain processing output x generates the same wrong prediction when fed to R s and R t , i.e., R s (x) = R t (x) = Ground Truth. For each image, we compute BD(R s , x, d), BD(R t , x, d) and IBD(R s , R t , x, d) with d being random direction and RA direction. In this experiment we present results with R s being R18 and R t being R50, as we observe other model pairs produce similar trends. The average results are shown in Table 17. We first observe that BDs are much smaller alongside the RA direction than the random direction. This indicates moving along the RA direction will change the model's wrong prediction at x faster, possibly to a correct prediction. More importantly, under either random or RA direction, IBD is always smaller than source/target BDs, which indicates R s and R t 's boundaries are relatively close, leading to a transfering effect. This result in RA direction can explain why RA processing can lead to transferable accuracy gains, since the RA loss brings this direction as the effect on the processing output x.
We further visualize decision boundaries in Fig. 4 with examples. We use R18 as source and each of the other models as target. Here we plot d RA as the horizontal and d r as the vertical axis. The origin represents the plain processing output x, and the color of point (u, v) represents the predicted class of the image x + u · d RA + v · d r . From the plot, we can see that different models share similar decision boundaries, and also tend to change to the same prediction once we move from the origin along a direction far enough. In both examples, we do confirm that when we move towards RA direction (towards right at horizontal axis), the first color we encounter (green for top, purple for bottom) represents the image's correct label. This suggests the signal from RA loss (RA direction) can correct the wrong prediction with plain processing output (x at origin), and such correction is transferable given the similar decision boundaries among models.

COMPARISON WITH ALTERNATIVES
We analyze some alternatives to our approaches. Unless otherwise specified, experiments in this section are conducted using RA processing on super-resolution, with ResNet-18 trained on ImageNet as the recognition model, and λ = 10 −3 if used. Under this setting, we achieve 61.8% classification accuracy on the output images.

Training/Fine-tuning the Recognition Model
Instead of fixing the recognition model R, we could train/finetune it together with the training of image processing model P , to optimize the recognition loss. Many prior works [37], [38], [40] do train/fine-tune the recognition model jointly with the image processing model. We use SGD with momentum as R's optimizer, and the final accuracy reaches 63.0%. However, since we do not fix R, it becomes a model that specifically recognizes super-resolved images, and we found its performance on original target images drops from 69.8% to 60.5%. Moreover, when transferring the trained P on ResNet-50, the accuracy is 62.4 %, worse than 66.7% when we train with a fixed ResNet-18. This suggests we lose some transferability if we do not fix the recognition model R.

Training Recognition Models from Scratch
We could first train a super-resolution model, and then train R from scratch on the output images. Doing this, we achieve 66.1% accuracy, higher than 61.8% in RA processing. However, R's accuracy on original clean images drops from 69.8% to 66.1%. Alternatively, we could train R from scratch on interpolated low-resolution images, in which case we achieve 66.0% on interpolated validation data but only 50.2% on the original data. In summary, training/fine-tuning R to cater the need of super-resolved or interpolated images can harm its performance on original images, and causes additional overhead in storing models. In contrast, RA processing could boost the accuracy of output images with the performance on original images intact.

Training without the Image Processing Loss
It is possible to train the processing model on the recognition loss L recog , without even keeping the original image processing loss L proc (Eqn. 3). This may presumably lead to better recognition performance since the model P can now "focus on" optimizing the recognition loss. However, we found removing the original image processing loss hurts the recognition performance: the accuracy drops from 61.8% to 60.9%; even worse, the SSIM/PSNR metrics drop from 26.69/0.804 to 16.92/0.263, which is reasonable since the image processing loss is not optimized during training. This suggests the original image processing loss is helpful for the recognition accuracy, since it helps the corrupted image to restore to its original form.

Perceptual/Feature Loss
Our unsupervised RA method optimizes the recognition model's output probability distance between processed and target images. This is related to the perceptual loss (also called feature loss) used in [7], [9]. Perceptual loss optimizes processed and target images' distance in VGG feature space. Note that the perceptual loss was originally proposed to improve output's quality from a human observer's perspective. To compare both methods, we follow [7] to optimize the perceptual loss from VGG-16. We find perceptual loss yields lower accuracy than unsupervised RA (56.7% vs. 61.0% on the VGG-16 recognition model). This could be because using final probabilities provides more category supervision, while intermediate features improve the outputs from a perceptual perspective.

CONCLUSION
We investigated an important yet largely overlooked problem: enhancing the machine recognition of image processing outputs. We found that a set of simple approaches-that optimize an additional recognition loss-can significantly boost the recognition accuracy with little to no loss in image quality. Moreover, the gain in accuracy can transfer across architectures, categories, and vision tasks unseen during training, or even transfer to a black-box cloud model. This indicates that the enhanced interpretability is not specific to one particular model but generalizable to others. This makes the approaches applicable even when the future downstream recognition models are unknown. Finally we analyzed the reason of such transferability phenomenon from the perspective of decision boundary similarities between recognition models. We hope our study can encourage the community to further improve the recognition of processed images.