ISP Distillation

Nowadays, many of the images captured are `observed' by machines only and not by humans, e.g., in autonomous systems. High-level machine vision models, such as object recognition or semantic segmentation, assume images are transformed into some canonical image space by the camera \ans{Image Signal Processor (ISP)}. However, the camera ISP is optimized for producing visually pleasing images for human observers and not for machines. Therefore, one may spare the ISP compute time and apply vision models directly to RAW images. Yet, it has been shown that training such models directly on RAW images results in a performance drop. To mitigate this drop, we use a RAW and RGB image pairs dataset, which can be easily acquired with no human labeling. We then train a model that is applied directly to the RAW data by using knowledge distillation such that the model predictions for RAW images will be aligned with the predictions of an off-the-shelf pre-trained model for processed RGB images. Our experiments show that our performance on RAW images for object classification and semantic segmentation is significantly better than models trained on labeled RAW images. It also reasonably matches the predictions of a pre-trained model on processed RGB images, while saving the ISP compute overhead.


INTRODUCTION
Traditionally, Image Signal Processors (ISPs) are designed to optimize human perception of photographs. The core task is, given a raw measurement from an array of sensors (with different sensitivity to different light frequencies), to produce an image that looks natural to the human observer. To this end, there is a need to estimate the right colors and tones, and also compensate for acquisition process artifacts, such as noise.
However, in many domains, scenes are captured for machine consumption only. Examples include robot cameras, autonomous driving, and security cameras that are automatically monitored. In these domains, the objective is not to produce visually pleasing images, but rather to achieve high performance in a given downstream task, e.g. object recognition. Thus, discarding the ISP and training the model directly on the raw data is tempting in these cases. An ISP is typically a dedicated hardware system designed and optimized for low latency. Discarding it can reduce design costs, silicon area, and the power . Corresponding author: Eli Schwartz me@eli-schwartz.com used. Unfortunately, simply discarding the ISP has been shown in the literature to cause performance degradation for high-level computer vision tasks such as object-recognition [5], [9], [13]. Note though that for lower-level tasks such as optical-flow it might be beneficial [33]). This happens because the ISP serves as a 'normalization' of the data, transforming it into a canonical space, which is independent (or less dependent) of the camera used or the capturing environment. Solutions discussed in the literature usually involve adjustments of the ISP for the vision task, either manually [31] or via learning [30]. Unlike alternative methods that focus on designing a minimal ISP for vision, in this paper we focus on completely discarding the ISP. We explore how the drop in performance can be mitigated.
Training a model on RAW images requires annotating the data. Human labeling of RAW images is impossible, as they are sometimes almost unrecognizable to humans. The alternative is to use RGB images for labeling and then transfer the labels to RAW. The transfer to RAW can be done by having an inverse model of the ISP (ISP inv ). So given a dataset of labeled RGB images {(RGB i , l i )}, we can generate a matching labeled RAW dataset {(RAW i , l i )} = {(ISP inv (RGB i ), l i )} relying on pixel alignment between RAW and RGB to get the labels for the RAW [5], [9], [13].
In this work, instead of using an ISP inverse model for transferring labels from RGB to RAW, we employ a dataset of RAW-RGB pairs. Building such a dataset is relatively easy since it does not require human labeling and by labeling the RGB images we immediately get labels for the RAW. Also, we wish to reduce the labeling cost. Thus, instead of manually labeling the RGB images we use a pre-trained model to label them.
With the above-mentioned dataset we can train our model on RAW images with the transferred labels as ground truth. Yet, we still suffer from a big drop in performance compared to working with RGB inputs. To mitigate that, we suggest using Knowledge Distillation (KD) [16] to make the RAW predictions fit the RGB predictions. KD is a technique that is known to work quite well for compressing deep models, i.e., making a smaller model behave similarly to a larger model. Here we use KD for compressing both the heuristically designed ISP and a higher-level deep vision model, e.g. classifier, into a new deep model for classification but with the size of only the latter part. We show the advantage of our approach also when training the network to have similar predictions for shortexposure RAW images. In that case, in addition to the ISP and classification, the model also compresses the 'knowledge' of an  Distilling the knowledge of the ISP algorithm together with the pretrained CNN (e.g. classifier or semantic segmentation) that follows it improves the performance of a CNN applied directly to RAW images. Optionally, the CNN can operate on short-exposure RAW images, effectively distilling the information in the physical process of acquiring a signal with better SNR. ideal (non-existent) denoiser that maps the short-exposure RAW to a longer-exposure RAW image. We show that this technique works well for both classification and semantic segmentation.

RELATED WORK
ISP for Vision. The ISP consists of a set of algorithms, usually applied sequentially, intended for transforming a RAW image into a visually appealing RGB image. The different steps either fix some degradation in the acquisition process (e.g., noise) or just transform the image to better fit human perception (e.g., tone mapping). Recently, some works suggested replacing some or all of these operations with a learned model [6], [26]. However, these designed or learned ISPs are optimized for visual appearance and not the vision tasks.
Simply dropping the ISP does not work well. Several works used simulated RAW images to train a classifier and observed a substantial gap in accuracy [9]. Hansen et al. [13] report a larger gap for smaller models, ∼16% drop for MobileNet, attributed to the failure of compact models to compensate for the lack of ISP. Buckler et al. [5] identified the lack of demosaicing and gamma correction to be a critical cause of performance degradation. They suggested modifying the imaging sensor such that demosaicing and gamma correction are no longer necessary, which limits the effect of the lack of ISP. Some methods suggest optimizing the ISP for downstream vision tasks. [21], [28], [31] suggested tuning the parameters of a traditional (not learned) ISP to improve downstream tasks. Sharma et al. [27] add a component that takes an RGB image processed by the ISP and further enhances it for the downstream task. Diamond et al. [9] suggested jointly learning a low-level processing module, which performs denoising and deblurring, with the classifier. They train with simulated RAW images. Wu et al. [30] suggested VisionISP, a trainable ISP, which is trained to optimize object detection in an autonomous driving setting. Wang et al. [29] suggested a totally different approach for performing classification on low-quality images (e.g. fog or low-contrast). They use a model pre-trained on high-quality images and learned a mapping between the deep representations of the low-quality images and the high-quality images. Essentially, unlike other methods, they perform domain adaptation by mapping the outputs of the classifier rather than the inputs. Other methods for classifying low-quality images include [3], [18], [32]. In this work, instead of designing an image processing module, i.e., modifying the input image for a certain task, we focus on applying the vision model directly to RAW data. Knowledge Distillation. Compressing larger models into smaller ones was first suggested by Bucilua et al. [4]. Application of this technique to deep neural networks, known as Knowledge Distillation, was suggested by Hinton et al. [16].
The key idea behind KD is that the soft label output (or soft probabilities) of a classifier contains much more information about the data point than the hard label. KD has been widely used for numerous different applications, e.g. [2], [19], [22]. It has also been shown to work for distilling knowledge of non-neural network machine-learning models [11]. In [8], [12] KD was used to transfer knowledge across different types of sensors for either image reconstruction or high-level tasks, and in [17] it was theoretically analyzed. In this work, we use KD to distill not just a deep model, but also the non-learned (manually engineered) algorithms in the ISP and the information gained in the physical process of acquiring a better signal (that has a better SNR).

METHOD
We are interested in training a classification model to operate on images from one modality, when no semantic labels are available. What we have is a function that maps the images to another modality for which we have a pre-trained model. Alternatively, we might have a dataset of image pairs from two different modalities with pixel alignment, even if a function that maps these modalities does not exist (or we do not have access to it). In our case, the two modalities are a RAW and a processed image RGB = ISP (RAW ). We are also interested in operating on a short exposure image, RAW short , where the reference processed image is based on a longer exposure, RGB long = ISP (RAW long ), or multiple short exposure RAW images, RGB long = ISP ({RAW i short }) (clearly, there is no deterministic mapping f such that RGB long = f (RAW short )).
Manually labeling RAW images is hard as they are almost impossible to be recognized by humans. The alternative is to use RGB images for labeling and transfer the labels to RAW. The transferring to RAW can be done by having an inverse model of the ISP (ISP inv ). So given a dataset of labeled RGB images {(RGB i , l i )}, we can generate a matching labeled RAW dataset {(RAW i , l i )} = {(ISP inv (RGB i ), l i )} relying on pixel alignment between RAW and RGB to get the labels for the RAW. But the ISP is not invertible and thus the inverse model is just an estimation and due to the one-to-many mapping should be stochastic. Also, designing the inverse model is not trivial. In this work, instead of having an inverse model of the ISP for transferring the labels from RGB to RAW, we use a dataset of RAW-RGB pairs. Moreover, to reduce the labeling cost, instead of manually labeling the RGB images, we use a pretrained model, M RGB , to label them. We argue that training on the dataset {(RAW i , l i )}, where l i are the hard labels (e.g., an integer value or a one-hot encoding), with the cross-entropy (CE) loss is not the best option. Instead of using l i , we use the soft probability distribution vector over the classes, p i , predicted by a pretrained model. Following many works that have shown its benefits, we use the KD loss.
Given the probability vectors p = M RGB (RGB) and q = M RAW (RAW ), which are the outputs of a softmax layer with temperature T . A softamx with a temperature T for a logit vector x is defined as The temperature T controls the "softness" of the predicted distribution. For smaller T we get a low-entropy vector (an approximation of one-hot), while for higher T we get a highentropy vector (the probability is distributed more evenly between all classes). The KD loss is given by This loss simultaneously distills the information from the heuristically designed ISP and the CNN model (classifier) pretrained on RGB images, M RGB .
An extension of the KD loss that can be beneficial is adding an 2 loss on the intermediate representation vectors [23]. We found this extension to be useful in our case too. We use the extended form of the loss where f RGB and f RAW are the intermediate representations of the RGB image from M RGB and the RAW image from M RAW . Unlike [23] where the representation is captured in the middle of the network, we use the representation from the last layer before the classifier. This is more likely to produce aligned probability predictions, even without balancing the CE loss and the features loss magnitudes.
In the case of short-exposure RAW, we use and we also implicitly learn to classify low-dynamic-range and extremely noisy images. As commonly done, we use a linear combination of the Cross-Entropy (CE) loss and KD loss Where the CE loss is defined as (y is the ground-truth one-hot probability vector). In our experiments, we chose T = 4 (similar to [16]) and we heuristically chose α = 0.9 to balance the two loss terms (same order of magnitude).
We also extend the method to the semantic segmentation task. Where the class label (or class probabilities) is predicted at each pixel or spatial location. In this case, the loss is averaged over all spatial locations at the output layer (which might be different from the input resolution). In the case of spatial elements equal to N and L CE (i), L KD (i) the loss functions at location i, the final loss is defined as: It is common practice to normalize the inputs so each channel has zero-mean and unit STD (calculated on the relevant training set). For the RAW images, we do the same, where the mean and STD are calculated separately for R, G, and B pixels in the Bayer pattern.
In practice, for faster training, we initialize the RAW classifier, M RAW , with the weights of the pre-trained RGB model M RGB . Using this initialization, we can resort to a short training (4 epochs in our experiments). Since RAW images have a single channel and not 3 RGB channels, we need to make some adaptations to be able to use off-the-shelf classifiers with them (especially when initializing with pre-trained models). We do so in a very simple way by transforming the RAW images to RGB and filling in the missing values using bilinear interpolation, similar to what is done in [26]. There is no loss of information in bilinear interpolation. The original pixel values are unchanged, and the interpolation only fills in the missing values introduced by the conversion from the 1-channel Bayer pattern to RGB.

EXPERIMENTS
To validate our approach we used two test cases. In the first, we test performance when operating on noisy and mosaiced images, i.e., discarding the denoising and demosaicing preprocessing. In the second, we test performance when the full ISP is discarded.

Discarding Denoising and Demosaicing
We first test our method on synthetically generated RAW images, since for these images we can compare to the classification performance of training with ground truth labels. In this experiment, we limit the simulated ISP (we want to discard) to include only denoising and demosaicing. Thus, the RAW images   [0, 1]). Under such a distortion, the performance of ResNet18 drops from 69.76% top-1 accuracy to 29.23%. Training the model on distorted images (RAW) using either ground truth labels or hard labels produced by a pretrained ResNet18 (based on the clean RGB images) improves performance to ∼ 57%. Using the proposed ISP Distillation we improve the performance by more than 4%. Similar trends are observed for MobileNetV2 too. Fig. 3 shows the improvement is consistent across noise levels, including for ST D = 0 where it is just demosaicing.

Discarding the Full ISP
To test the effect of discarding the full ISP, we use real data based on images from the HDR+ dataset [14]. This dataset includes 3640 bursts (containing 28461 images in total). Each burst has between 2 to 10 short-exposure raw photos, where each is generally 12-13 Mpixels, depending on the type of camera used for the capturing. The images in a burst are generally captured with the same exposure time and gain. The dataset also provides for each burst a merged RAW image that is generated by aligning the short-exposure RAW images and combining them to produce a single high-dynamic-range RAW. This merged RAW is then processed by their ISP to produce the final RGB image. See [14] for more details. We performed two kinds of experiments. In the first, we want to distill the ISP and a pre-trained classification model. Thus, we use the merged RAW as input to our model, trying to mimic the predictions on the final RGB. In the second experiment, we choose a single short-exposure RAW and train our model to mimic the predictions made by the RGB model on the final RGB, produced from the merged RAW. We always choose the short-exposure RAW that is pixel-aligned with the merged RAW (information provided in the dataset).
For evaluation we used both a larger pseudo-label dataset and a smaller manually labeled dataset. For the pseudo-labeled dataset the top-1 and top-5 accuracies are measured as the agreement of our model with the predictions of the pre-trained classification model on the final RGB. original HDR+ dataset are of very high resolution, but popular classifier architectures expect images to be in the range of 200 − 300 pixels. While we could down-sample the images, it would have removed the effect of the Bayer pattern (and potentially the effect of the noise too), and we are interested in understating the ability of the model to overcome these artifacts. Therefore, we chose to split the images into smaller 256 × 256 non-overlapping patches. For training, we use all the patches originating from the bursts in the training set. For the test, since many of the patches do not contain an object, we only tested those for which the pre-trained RGB classifier predicted an object with probability p > 0.8. The manually labeled test set contains 211 images out of the pseudo-labeled dataset test set described above. Table 2 compares the performance of training the RAW model with predicted labels vs. applying ISP Distillation. Our approach shows a substantial improvement of +9% top-1 accuracy for ResNet18 and +6% for MobileNet V2. It also exhibits a similar advantage for the model trained on short-exposure RAW, where the improvement is +8% and +11% for ResNet18 and MobileNet V2, respectively. A noticeable advantage exists in all experiments for top-5 accuracy too, bringing the accuracy to 97 − 98%. This suggests that the RAW model predicted probability distribution highly matches the one from the RGB model. Table 3 compares the performance on the small manually labeled test set. While the results are a bit lower compared to the pseudo-labeled test set, we observe similar trends.
For the short-exposure RAW experiment, we also compared to a sequential combination of a network that performs the ISP part (DeepISP [26]) before the classifier, where both are trained end-to-end to optimize classification performance. The computational costs vs. accuracy are summarized in Table 4. DeepISP flop count is ∼ 2.5G (compared to ResNet18's 2G and MobileNetV2's 0.3G). Note that the classifier alone, trained with ISP Distillation, performs almost as well as the combination of models, which is more computationally demanding. The expressive power of the joint ISP and classifier is probably limited by the constraint of having a 3-channel tensor at their interface. When training just the classifier, using our approach, both models are mostly compressed into the classifier alone.

Semantic segmentation
We tested the extension to semantic segmentation, where the KD loss is applied at each spatial location, as mentioned in Sec. 3. For this experiment we used the DeepLabV3 model [7] with ResNet101 backbone [15]. The model used for the RGB images was trained on Pascal VOC dataset [10] and a subset of the COCO dataset [20] with images containing objects from the 20 classes in the VOC dataset. As in the classification case, the same pretrained weights were used to initialize the model operating on RAW images. Figures 2 and 4 provide representative examples of the semantic segmentation results. To quantitatively assess the model performance we compute the Mean Intersection Over Union (mIOU) between the pretrained RGB model outputs and our RAW model (i.e. the agreement between the RGB outputs and the RAW outputs). The quantitative results are summarized in Table 5. We observed improvement over the training with labels baseline of ∼ 2.5% mIOU for both RAW images and shortexposure RAW images. This is a significant improvement, but not as big as the improvement for classification. We suspect that the per-pixel output adds higher-level information providing better guidance to the student and thus the effect of having the soft-labels rather than the hard-labels is smaller.

ResNet18
MobileNetV2  Table 7. Ablations. Performed on the ImageNet experiment. Removing the feature loss results in a drop in performance. Using stronger teacher (ResNet50) improve performance.
Partial model finetuning. Since we initialize our RAW model with ImageNet pre-trained weights we want to test how many of the layers need to be adapted to accommodate the RAW input. Is it just about local artifacts and finetuning only the first layers will be enough or there are global distortions that require higher-level features from deeper layers to adapt too? Table 6 shows that indeed training just the first layers is beneficial, most of the gains are thanks to the finetuning of the first and second quarters of the network. Training the first half of ResNet18 is almost as good as training the full model.
Features loss and stronger teacher. Knowledge Distillation was originally used for distilling knowledge from stronger (more parameters, deeper) models to weaker ones. We verify performance can be further improved by using a stronger teacher in our case. We test the case where the teacher architecture is ResNet50 and the student is either ResNet18 or MobileNet V2. The teacher and student models' feature dimension is different, and even if it was the same since they are pretrained independently points in the two unaligned embedding spaces cannot be compared. For this reason, we drop the 2 loss on the features in this case. Table 7 shows that dropping the features loss results in a drop in about 1% drop in performance. Adding the stronger teacher improves the performance so it is comparable to the effect of features loss. In all our experiments we use the features loss. It might be possible to get even higher performance by combining the features loss with a stronger teacher (by having the same features dimension and enforcing embedding space alignment).

Things that did not work
For completeness, we also report ideas we hypothesized should improve performance but empirically did not.
Gradual blending. We are trying to solve the domain adaptation problem from RGB to RAW. Unlike the usual domain adaptation task, in this case, we have pixel-aligned pairs of images from both domains. Since we have pixel alignment, we can generate infinitely many intermediate domains by interpolating between the RAW and RGB images, i.e. we can define a set of new domains D α such that the images in D α are {αRGB i + (1 − α)RAW i }. This approach was used by [1] for the case of shifting from the classification of images where the background is masked out to the classification of the same images where the background is not masked. We expected that gradually (linearly) shifting between the domains, i.e. first training the model to classify from domains with small α and finishing with α = 1, would help the model to learn better. But, in our experiments, we found the model finally converged to a similar performance as directly training for α = 1.
Localized distillation. We observed in the case of semantic segmentation that even training with hard labels yields very strong results. We hypothesized that the localized semantic signal helps the student learn better by providing more information. To test it we performed the experiment of distilling both the ISP and a classifier (similarly to Section 4.2) but requiring spatial consistency between the teacher and student. To keep the spatial information we removed the global pooling layer and replaced the fully connected layer with a 1 × 1 convolution layer (with the same weights). We then applied the KD loss at each spatial location. However, in our experiments, this didn't improve the results beyond what we got with the 'global' KD loss.

CONCLUSIONS
We have shown that it is possible to distill the knowledge of not just pre-trained models but also heuristically designed ISP, to improve the performance of classification and semantic segmentation models for RAW images. We have also shown improvement for short-exposure RAW, distilling the information of the physical process of acquiring a better signal. Our proposed ISP distillation is a step towards reaching similar performance on RAW images compared to RGB in high-level vision tasks. It can advance the deployment of vision models on RAW images in domains where the images are consumed by machines and not humans. This will save the computing cost of the ISP.
Possible future directions include: Previously, Eli co-founded and served as the CTO of a robotics startup -Inka Robotics that developed the world's first tattooing robot (2015)(2016)(2017). Before that, Eli was with Microsoft, developing algorithms for the HoloLens AR headset (2013-2016). Eli has also worked in the chip design industry for Qualcomm (2011-2013) and IBM (2008-2011).
Alex Bronstein (Fellow, IEEE) is currently a Professor of computer science with the Technion-Israel Institute of Technology and a Principal Engineer with Intel Corporation. His research interests include numerical geometry, computer vision, and machine learning. He has authored over 100 publications in leading journals and conferences, over 30 patents and patent applications, the research monograph Numerical Geometry of Non-Rigid Shapes, and edited several books. Highlights of his research were featured in CNN, SIAM News, and Wired. In addition to his academic activity, he co-founded and served as the Vice President of technology in the Silicon Valley start-up company Novafora, from 2005 to 2009. He was a Co-Founder and one of the main inventors and developers of the 3D sensing technology with the Israeli startup Invision, subsequently acquired by Intel, in 2012. His technology is currently the core of the Intel RealSense 3D camera integrated into a variety of consumer electronic products. He is also a Co-Founder of the Israeli video search startup Videocites and the London-based startup Sibylla, where he serves as the Chief Scientist. He is a Fellow of the IEEE for his contribution to 3D imaging and geometry processing.
Raja Giryes Raja Giryes (raja@tauex.tau.ac.il) is an Associate Professor in the school of electrical engineering at Tel Aviv University. His research interests lie at the intersection between signal and image processing and machine learning, and in particular, in deep learning, inverse problems, sparse representations, computational photography, and signal and image modeling. Raja received the EURASIP best P.hD. award, the ERC-