Learning Tone Curves for Local Image Enhancement

Image enhancement methods can be formulated as global transformations, local transformations, pixel-wise processing, or a mixture of these operations. Global transformations are limited in enhancing local image regions. Existing local and pixel-wise methods mitigate this issue, but give rise to the additional challenge of limited interpretability. Bridging the gap between global and local methods, we propose a local tone mapping network (LTMNet) that learns a grid of tone curves to locally enhance an image. Tone curves are commonly used by photo-editing software and offer an intuitive representation to photographers, facilitating subsequent customization of the image. Tone curves are also widely used in image signal processors (ISPs), making our method easy to deploy on cameras. Because existing datasets contain image enhancement and photofinishing beyond global and local tone mapping, we also propose a new dataset representative of local tone mapping—the LTM dataset. We evaluate our method on this new dataset as well as MIT-Adobe and HDR+ datasets. We show that the proposed LTMNet outperforms existing methods in local tone mapping while achieving competitive performance modeling additional photofinishing. Furthermore, we show that our method can be assistive in user-interactive photo-editing tools. Our code, model, and data will be released publicly at https://github.com/SamsungLabs/ltmnet.


I. INTRODUCTION
Cameras capture valuable moments in our daily life in the form of photographs. Most cameras use dedicated image signal processors (ISPs) to process the captured sensor image into the final output image. ISPs apply several steps in a pipeline fashion to process images. One of the key operations is tone mapping. Tone mapping is an essential step in the photo enhancement stages of ISPs and has a major impact on the quality of the final image by enhancing the contrast and color tones of the image.
A tone map converts an input pixel intensity to a new output intensity. Generally, the same or different tone maps are applied to the R, G, and B channels of a color image. This operation is efficient to perform in hardware using a lookup table (LUT). Tone maps are often called by other names: for example, tone curve, transfer function, and 1D LUT. Tone mapping is widely used in dynamic range compression The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang .
(e.g., [3], [4], [5]), reducing a high dynamic range (HDR) image to a lower dynamic range (LDR) while preserving an aesthetically pleasing appearance. In this paper, however, we place our focus on transformations within the same dynamic range, rather than HDR to LDR. Tone mapping can be applied in two manners, global and local. Global tone mapping (GTM) maps each pixel value to another value regardless of pixel location. For a typical real-world ISP, GTM alone is rarely sufficient to adequately enhance an image. In particular, GTM lacks flexibility and produces over/under-enhanced local regions, as shown in Fig. 1-(B). Local tone mapping (LTM), on the other hand, spatially adjusts image regions using different tone curves based on local characteristics. LTM offers more fine-grained control and helps bring out highlights. As shown in Fig. 1-(D), our LTM method provides transformations tailored to a local region-for example, increasing the visibility of regions of dense content (the bushes), and increasing contrast of the shadow regions to provide more vibrant imagery effects. Many existing methods either perform GTM only (e.g., [1], [6]), or perform pixel-wise enhancements (e.g., [2], [7]), which output an enhanced image rather than transfer functions. Little work focuses on explicitly learning local tone curves, which can be efficiently integrated into existing ISPs as 1D LUTs, making them more convenient than pixel-wise processing and more powerful than GTM.
Contribution We propose a deep-learning approach to local image enhancement that estimates local tone curves. Unlike existing methods, our approach is trained to output a grid of local tone curves instead of pixel-wise processing. Tone curves are more intuitive for post-processing editing and implementation in hardware. Because tone curves can be applied to images of any size, our method is not limited to a specific resolution. We also introduce a new image dataset representative of local tone mapping, consisting of tone-mapped and non-tone-mapped image pairs. In addition, we provide a tool for interactive LTM manipulation that can be used to manually fine-tune the tone curves predicted automatically by our method.

II. RELATED WORK
There is a large body of work on image enhancement; only representative works are presented here. We divide these methods into three main categories based on the way they process the image: (1) methods that apply global transfer functions or LUTs to the whole image; (2) methods that apply local enhancement to local image regions; (3) methods that apply pixel-level mapping from input to enhanced image. Some methods may combine two or more of global, local, and pixel-wise processing.

A. GLOBAL ENHANCEMENT
Traditional image enhancement methods apply pre-defined global transfer functions, such as gamma correction, or a transfer function estimated from the intensity distribution, such as histogram equalization [8] and its extensions: contrast-limited histogram equalization (CLHE) [9] and histogram modification framework (HMF) [10].
Recent methods use neural networks to predict global transformations. For example, the method in [11] proposes a neural network to implement CLHE and HMF. SpliNet [12] performs personalized enhancements using a learned global tone curve. The method in [6] learns a 3D lookup table to achieve fast and robust enhancement. White Box [13] selects the best sequence of global enhancement operations from a pre-defined set based on deep reinforcement learning (RL) guided by generative adversarial networks (GANs). Distort-and-recover [14] also uses RL to explicitly model the step-wise nature of the human retouching process. Similarly, [15] uses RL and unpaired images.
As mentioned in the introduction, global transformations can under-or over-enhance local image content, thus leading to the need for local enhancement methods.

B. LOCAL ENHANCEMENT
Many methods extend histogram equalization techniques to be locally adaptive [16]- [19]. One prominent example is adaptive histogram equalization (AHE) [20], [21] which involves equalizing a set of histograms computed from local image regions, typically a grid of patches. One further extension is contrast-limited AHE (CLAHE) [9], where the contrast amplification is limited by clipping the computed histograms. CLAHE is an industry standard adopted by many camera ISPs and typically used as a local tone mapping operator; however, it requires careful parameter tuning.
Some methods perform color transformations for local image enhancement. Color palette-based methods [22], [23] interpolate colors based on a sparse set of colors in the palette. However, updating the palette requires user interaction or example images. Representative color transform (RCT) [24] learns and transforms a set of representative colors in the image, globally and locally. HDRNet [25] learns a bilateral grid [26] of 3 × 4 affine transformation matrices from a down-sampled image. Each matrix maps an input color to an output color. These affine coefficients are then applied to the full-resolution input image through bilateral guided upsampling [27]. HDRNet is expressive at modeling complex transformations. However, the matrices are difficult to visualize and interpret. In contrast, our method learns a grid of 1D curves, which are intuitive to photographers, easy to interpret and edit. StarEnhancer [28] also learns a set of curves that transform an image based on both intensity values and pixel location. The curves can be manually fine-tuned, but it is difficult to pinpoint adjustments to a specific spatial coordinate. Our method, in comparison, explicitly maps each curve to a local region. Overview of our local image enhancement pipeline. We first learn a grid of tone curves using a neural network. Each predicted tone curve corresponds to one patch of the input image. The predicted tone curves are then applied to each patch with tile-based interpolation.

C. PIXEL-WISE ENHANCEMENT
Traditional pixel-wise methods perform base-detail layer decomposition to enhance an image's high-frequency details. These methods include bilateral filtering [29], Laplacian operators [30], guided filtering [31], and just-noticeabledifference (JND) transform [32]. Numerous recent methods use convolutional neural networks (CNNs), especially encoder-decoder architectures [33]. The method in [34] maps input images to enhanced images using per-pixel quadratic color transforms. Work by [35] maps low-quality smartphone images to corresponding DSLR high-quality images. Some methods employ GANs, such as WESPE [36] and deep photo enhancer (DPE) [37]. EnhanceGAN [38] applies weak supervision using binary labels of image aesthetic quality to estimate piece-wise transfer functions on the CIELab color space. PieNet [39] incorporates user preferences by injecting a preference vector into its base network. CSRNet [1] processes pixels independently using a multi-layer perceptron (MLP) modulated by a global feature vector extracted from a condition network. Neural curve layers (CURL) [2] predicts a sequence of global transfer functions applied in different color spaces while using a backbone CNN for local enhancement. Both [7] and [40] learn global tone curves and a pixel-wise residual map for local enhancement. IceNet [41] personalizes local contrast enhancement by predicting perpixel gamma-correction values based on a global brightness parameter and a scribble map, both interactively provided by the user. Our method combines automatic image enhancement with the option of manual post-editing to minimize user efforts while allowing interactivity.
Pixel-wise methods are less explainable as it is difficult to identify what operations are performed on the input image, whereas our method explicitly specifies the transformations.

D. OTHER METHODS
Another set of methods addresses underexposure enhancement, such as DeepUPE [42], DRHT [43], and [44]. Similarly, other methods focus on low-light image enhancement, such as [45]- [47]. Zero-reference deep curve estimation (Zero-DCE) [48] is a pixel-wise curve-based method that targets low-light images without reference images. Some methods rely on physical models of image formation (e.g., the Retinex theory of color vision [49]). Such methods include exposure correction methods based on separation of scene reflectance and illumination [50], illumination estimation [42], [51], and modeling of camera response functions [52].
As an alternative to CNNs, STAR [53] is a fast and lightweight backbone network for multiple image enhancement tasks, such as white-balancing, low light image enhancement, and photofinishing.
Global transformations may not be sufficient to estimate highly non-linear mapping between low-quality and highquality images. Methods based on pixel-wise processing usually are hard to interpret, fine-tune, or integrate into ISPs or photo-editing software. Our method is based on learning local tone curves for local image regions; this makes it more flexible than global enhancement methods. Also, tone curves are well understood, interpretable, and widely used in many camera ISPs and photo-editing software. To the best of our knowledge, our method is the first to introduce learning local tone mapping automatically in a data-driven manner instead of manual tuning.

III. LTMNet
Our method, illustrated in Fig. 2, aims to perform local image enhancement through learning a grid of local tone maps, inspired by the well-established CLAHE algorithm [9]. Given an input image x ∈ R H ×W ×C , we use a neural network LTMNet to predict a set of local tone maps (LTMs) T ∈ R M ×N ×C×L from an input image: where H , W , and C are image height, width, and number of channels, respectively. M and N are the height and width, VOLUME 10, 2022 respectively, of a grid of image patches. L is the number of intensity levels, typically 256 for 8-bit integer images. LTMNet layers serve two purposes: feature extraction and tone curve prediction. For feature extraction, a wide range of architectures can be used, as long as the receptive fields of the output neurons composing the tone curves cover the image patches on which they are applied. A tone curve prediction head can be stacked on top of the feature extraction layers to ensure tone curve entries are in the desired shape and range (i.e., M × N × C × L). For efficiency, we design LTMNet such that the input image is always resized to a fixed input size (e.g., 512 × 512).

A. LOCAL TONE CURVES
The output of LTMNet, T , represents a set of transfer functions (i.e., tone curves) that are applied to the input image to adjust its local contrast, brightness, and colors. LTMNet predicts a number of tone curves or 1D lookup tables (LUTs) for each image patch in an M × N grid. For a typical standard RGB (sRGB) image, three 1D LUTs are predicted for each patch, one for each R, G, and B channel: Each tone curve is represented by a 1D LUT that has L entries, t ∈ R L . Each entry maps an input pixel intensity to an output enhanced intensity. The application of the predicted local tone curves on the input image is performed using bilinear interpolation between each set of local tone curves in order to produce a smooth and artifact-free locally tone-mapped imageŷ ∈ R H ×W ×C : (3)

B. LOCAL TONE CURVE INTERPOLATION
A predicted tone curve t m,n is most appropriate for the center pixel of patch (m, n) in the M × N grid. Intuitively, all other pixels in the patch are influenced by the tone curves of neighbouring patches by varying degrees, according to the distance of the pixel to the neighbouring patch centers. This way, the tone curve for each pixel smoothly transitions to another, resulting in a continuous output image free of boundary artifacts. Our tone curve interpolation module, Interp, transforms all non-center pixels by a combination of neighboring tone curves whose patch centers are closest to it, as shown in Fig. 3. Pixels in the center region of the image are bilinearly interpolated, combining the influence of the four neighboring tone curves.
Specifically, suppose (i 1 , j 1 ), (i 2 , j 1 ), (i 1 , j 2 ), (i 2 , j 2 ) are the (x, y) coordinates of the four patch centers closest to location (i, j) of input image x, in the order of top left, top right, bottom left, and bottom right respectively. Moreover, suppose t 1 , t 2 , t 3 , t 4 are the predicted tone curves of the four patch centers in the same order. The interpolated pixel value at (i, j) is given by Equation 4: where x(i, j) ∈ [0, 1], and [·] indicates rounding to the nearest integer. L is the number of intensity levels, typically 256 for 8-bit integer images. Similarly, pixels in the border region are linearly interpolated. Take a location (i, j) in the top or bottom border region as an example; suppose (i 1 , j 1 ) and (i 2 , j 1 ) are the (x, y) coordinates of the two patch centers closest to location (i, j), in the order from left to right. Moreover, suppose t 1 and t 2 are the predicted tone curves of the two patch centers in the same order. The interpolated pixel value at (i, j) is given by: Finally, pixels in the four corner regions are not interpolated. Suppose t is the predicted tone curve of the patch center closest to a position (i, j) in one of the corner regions; the tone-mapped pixel value at (i, j) is given by: The input image to both the tone curve prediction network LTMNet and the interpolation module Interp can take on any shape because the application of tone curves only transforms pixel values and is independent of the image's spatial dimensions. The final output is a continuous locally tone-mapped imageŷ, with the same resolution as the input image x.

C. TONE CURVE CONSTRAINTS
For each image patch in the M × N grid, its corresponding lookup table maps each pixel value to some other value according to the table entries. The entries in the lookup table are enforced to be non-decreasing to maintain intensity rank consistency. Furthermore, maximum intensity is kept unchanged in the LUT to preserve information in the overexposed regions. The tone curve constraints are implemented through integrating and normalizing non-negative output neurons: wheret is one output neuron from the last layer of the neural network, which is followed by a sigmoid activation to constrain the neurons such thatt ∈ [0, 1]. Integration oft enables t l , an entry in tone curve t, to be non-decreasing over the range l ∈ [0, L − 1].

D. LOSS FUNCTIONS
We use two loss functions to drive model training: L 1 and perceptual loss [54]. L 1 loss minimizes the fidelity difference between the predicted imageŷ and its corresponding ground-truth image y. For perceptual loss, we use the initial two layers of VGG19 [55] (block1_conv1 and block2_conv1), which is trained on ImageNet [56] to minimize squared L 2 distance between the features of predicted and target images. Since the predicted and target images differ only in terms of low-level features, such as brightness, contrast, and color, only layers of the initial two VGG blocks are used for the loss function. Deeper VGG layers are not used because they primarily encode highlevel information, such as object shape and spatial arrangement [57], which are already identical between our paired images. Our loss function is where φ k indicates VGG19 features from the first convolutional layer in the k th block. We empirically set the L 1 loss weight λ l 1 to 3.0 and the perceptual loss weight λ p to 10 −4 .

E. NETWORK ARCHITECTURE
The architecture of LTMNet is shown in Fig. 4. The first layer size is 512×512×4, followed by a sequence of convolutional, non-linear activation [58], and max pooling layers. Either the number of layers or the pool size of the last pooling layer can be adjusted such that the output shape is consistent with the shape of the tone curves T . L = 256. The first layer size is 512 × 512 × 4, followed by a sequence of convolutional, non-linear activation [58], and max pooling layers. The number of layers is adjusted such that the output shape is consistent with the shape of the predicted tone curves T .

A. EXISTING DATASETS
Two commonly used datasets in image enhancement are MIT-Adobe FiveK [59] and HDR+ [60]. MIT-Adobe FiveK contains 5,000 pairs of input and enhanced images retouched by five professional experts. However, this dataset involves mostly global tone mapping among other photo retouching operations [61]. The HDR+ dataset consists of 3,640 image bursts, which make up 28,461 images in total. Each burst is processed into a merged, aligned, and enhanced single output high dynamic range (HDR) image. This dataset includes strong local tone mapping and is more suitable for evaluating our method. However, it also includes other photofinishing operations, such as sharpening and hue/saturation adjustment.
For evaluation on the HDR+ dataset, we prepare image pairs as follows. We process the raw-RGB merged frame into a gamma-corrected sRGB image using a simulated image signal processor (ISP) [62] and use it as input. We use the final photofinished JPEG image as output. We prepared around 2,000 image pairs.

B. OUR LTM DATASET
As illustrated in Fig. 5, in a typical ISP, tone mapping and other photofinishing operations, such as color manipulation, are often performed in separate stages. To the best of our knowledge, there is no image dataset involving local tone mapping only; existing datasets include global tone mapping or local tone mapping mixed with other photofinishing VOLUME 10, 2022 operations. To overcome this issue, we used CLAHE [9], a widely adopted industry standard for local tone mapping in ISPs, to generate a dataset of image pairs. Each pair consists of an sRGB image with global gamma correction and the corresponding locally tone-mapped image using CLAHE. We used MIT-Adobe FiveK to generate our dataset. A major limitation of CLAHE is that it requires manual tuning of its parameters, the grid size and the contrast limit. Instead of manually tuning these parameters for each image, we perform a grid search on the parameters for each image and automatically select the parameter values that produce an image with the highest non-reference image quality metric. We use neural image assessment (NIMA) [63] as the non-reference metric as it corresponds well with human perception. Specifically, out of all versions of an enhanced image, NIMA is able to select one without artifacts. Appendix A provides further justifications for our choice of NIMA. Fig. 6 showcases examples of grid-searched images over 15 parameter combinations. Although the images selected by NIMA are mostly artifact-free, some poor-quality ones may still be selected, which are then manually removed. In the end, we removed 91 images out of 2,500 (< 4%). These 4% are mostly images with large homogeneous regions (e.g., sky) that may not require local processing. Finally, we round down our LTM dataset to consist of 2,000 image pairs.

C. QUANTIFYING LOCAL TONE MAPPING
To estimate the extent of local tone mapping in each dataset, we perform the following experiment. We compute the root mean squared error (RMSE) of the best-fit 4-degree polynomial between the input and output image intensities, averaged over all images in the dataset. This metric gives an indication of how much the transformation between an image pair deviates from a single global transfer function, and hence, it also indicates how much local processing exists in the images. The RMSEs for MIT-Adobe, HDR+, and our LTM dataset are 0.0229, 0.0483, and 0.0404, respectively. The results indicate that the MIT-Adobe dataset does not contain much local processing, while HDR+ contains significant local processing, including local tone mapping. Our LTM dataset contains noticeable local processing; but unlike the other datasets, it is restricted only to local tone mapping. Fig. 7 shows an example of the fitted transfer functions between example images from the three datasets.

V. EXPERIMENTS
For the following experiments, we use the LTMNet architecture shown in Fig. 4 that contains six convolutional layers and produces a 3D grid of 8 × 8 × 3 tone curves of size 256.

A. EVALUATION ON THE LTM DATASET
We evaluated our method on our LTM dataset as it contains local tone mapping only and to avoid the effect of other photofinishing operations that exist in other datasets. We compare our method against state-of-the-art (SOTA) image enhancement methods: CURL [2], Zero-DCE [48], HDRNet [25], CSRNet [1], and Pix2Pix [64]. We use the following metrics: peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [65], and learned perceptual image patch similarity (LPIPS) [54]. Table 1 shows the performance of our method and SOTA methods on the LTM dataset. Our method outperforms all SOTA methods in all metrics. Fig. 8 shows some visual comparisons. Our LTMNet produces visually enhanced images with vivid local contrast while avoiding structural and color artifacts. CURL and CSRNet seem to be limited at enhancing local contrast, while Pix2Pix FIGURE 8. Visual comparison of our LTMNet against SOTA methods: CURL [2], HDRNet [25], CSRNet [1], Pix2Pix [64], and Zero-DCE [48], on our LTM dataset. Our LTMNet produces visually enhanced results while avoiding structural and color artifacts. and HDRNet are prone to structural or color degradations. Zero-DCE has a relatively low performance because it uses non-reference loss functions. Additional results are provided in Appendix B.

B. EVALUATION ON THE HDR+ DATASET
To verify our method's capability of modeling generic local tone mapping effects in addition to the CLAHE algorithm, we also evaluated our method on HDR+ against SOTA methods. The quantitative results are shown in Table 1. Visual comparisons are shown in Fig. 10. Our LTMNet method yields comparable results to SOTA. LTMNet does not outperform SOTA methods on the HDR+ dataset because such a dataset contains more processing beyond local tone mapping, such as sharpening and hue/saturation adjustments, while our LTMNet is limited to using local tone curves only. Also, other methods use pixel-wise processing, which is more expressive in modeling fine local detail enhancement, such as sharpening. To see if pixel-wise processing can help our LTMNet in modeling the additional photofinishing in HDR+ images, we append a small residual network with 1.6K parameters to LTMNet, naming this model ''LTMNet + Res.'' This model closes the gap with SOTA methods in terms of SSIM and LPIPS, while boosting PSNR by a large margin. This indicates the effectiveness of pixel-wise processing in modeling additional photofinishing operations. Additional results are provided in Appendix B.

C. CHOICE OF GRID SIZE
To select the best size for the tone curve grid, we evaluated multiple grid sizes on our LTM dataset, as shown in Table 2. Grid size 8 × 8 produces the best results for all metrics.

D. CHOICE OF CONTROL POINTS
We performed experiments with smaller numbers of control points for the LUTs, as shown in Table 5. Control points are interpolated with monotone cubic splines. Fewer VOLUME 10, 2022  control points produce less optimal performance but use fewer parameters.

E. TRAINING AND HYPERPARAMETER SETTINGS
For both experiments on the HDR+ dataset and our LTM dataset, 1,400 images are used for training, 100 used for validation, and 500 used for testing. All visual results in the paper are sampled from the test set. For fairness of comparison, all SOTA methods are re-trained on the two datasets. For training our model, we use Adam [66] as the optimizer with a learning rate of 0.001. Models are trained for 150 epochs and 250 epochs for the LTM dataset and HDR+ dataset, respectively, both with a batch size of 20. We augment the input with random flip to generalize the models for inputs of different orientations.

F. INTERACTIVE EDITING OF TONE CURVES
In addition to automatic local tone mapping, our method can be used in an interactive setting and integrated with photoediting software. Users can apply our method to produce an automatically enhanced photo, and then manually enhance a local region of the image by modifying the local tone curve corresponding to that region. Fig. 9 shows a use case for integrating our method with interactive editing of local tone curves. We also prepared a video to show local tone curve editing: link.
After producing the locally tone-mapped image using our method, the user selects a pointŷ(i, j) on the image to modify the patch containing the point. The tone curve applied at location (i, j) is a weighted average of tone curves predicted at its closest patch centers. Suppose (i, j) is located in the center region; the tone curve applied at (i, j) can be computed as follows:t where t k is one of the four component tone curves in Equation 4. w ijk represents the weight given to a component tone curve at location (i, j) that is inversely proportional to its distance from point (i, j). For example, w ij1 = (i 2 −i)(j 2 −j) (i 2 −i 1 )(j 2 −j 1 ) , which corresponds to the first weight term in Equation 4. Similarly, the interpolated tone curves at the border regions can be inferred using Equation 5.
Afterwards, the user defines a target tone curve at location (i, j).
Step 2 in Fig. 9 presents three possible options: (1) selecting from a set of preset tone curves, (2) using the cumulative distribution function of the selected region, and (3) using a self-defined LUT. The target tone curve t * can be treated as a scaled version oft: where elements of s are the scaling factors transforming each entry int to the target tone curve entry. Given a target tone curve, the scaling factors can be computed by element-wise division of the target tone curve t * by the original tone curvet. Next, tone curves predicted at the closest patch centersnamely, t k , k ∈ {1, . . . , 4}-are modified so that their interpolated result matches exactly with the target tone curve. This can be achieved by simply multiplying the component tone  [2], HDRNet [25], CSRNet [1], Pix2Pix [64], and Zero-DCE [48], on the HDR+ [60] dataset. Our LTMNet produces visually enhanced results while avoiding structural and color artifacts. curves by the same scaling factors, such that: Finally, the edited imageỹ is obtained by applying Interp(x,T ) to the tone curve setT that contains edited tone curves.

VI. EVALUATION ON MIT-ADOBE FiveK DATASET
In Section IV, we have discussed the lack of local processing in the MIT-Adobe FiveK [59] dataset. We provide additional evidence by comparing our local tone mapping model with a global tone mapping model trained on MIT-Adobe FiveK. Results are shown in Table 3. The local tone mapping (LTM) model has grid size 8 × 8. The global tone mapping (GTM) model has grid size 1 × 1. The GTM+LTM model predicts both an 8 × 8 grid of local tone curves and a global tone curve, with the local tone curves applied after the globally tone-mapped image. The quantitative results indicate that performance increases when the model architecture enables more global tone mapping effects, which suggests that the MIT-Adobe FiveK dataset is better modeled by global, rather than local, transformations, and thus is unsuitable for our local tone mapping task. VOLUME 10, 2022 FIGURE 14. Example of all 15 versions of an image enhanced by CLAHE [9]. CLAHE requires careful parameter tuning and not all parameter combinations produce high-quality results. We use a non-reference metric, NIMA [63], to automatically select the CLAHE parameters that give the most visually pleasing version of an enhanced image. The version with the highest NIMA score is highlighted in red. NIMA is able to select images without halo artifacts.   Despite being not well suited for our task, for completeness, we still evaluated on MIT-Adobe FiveK and the results are shown in Table 4. We used 1,000/100/500 images for training/validation/testing. LTMNet with a 1 × 1 grid (i.e., GTM) outperforms other methods; with a 8 × 8 grid (i.e., LTM), performance on PSNR is worse because, as mentioned, MIT-Adobe contains mostly GTM images. However, there is no significant decrease in perceptual metrics (a 0.01 difference in SSIM), which indicates that a finer grid can still model global operations.

VII. LIMITATIONS AND FUTURE WORK
Our method can experience halo artifacts when the input image has a foreground object that is transformed by a drastically different function from the background scene. This is a result of pixels close to the object boundaries receiving influence from two different transfer functions. As illustrated in Fig. 11, background pixels close to the flower are interpolated by both the tone curves predicted for the gray background and the tone curves predicted for the yellow flower. Influence from the flower's tone curves results in a dark halo. This is a limitation of CLAHE [9] as well, which uses the same interpolation scheme. This issue may be addressed by semantic segmentation, which separates prominent objects in a scene so that each segment has its own transformation functions, unaffected by neighboring segments. Furthermore, we may learn different grid sizes for each segment, so that homogeneous segments are assigned a smaller grid to reduce the amount of spatial variation, and textured segments are assigned a larger grid size to leverage more expressive local enhancements.
Another potential future direction is to condition our network on tunable parameters to allow both automatic enhancements and manual tuning. Although the output images from our method can be adjusted by post-editing the predicted tone curves, our network itself is fully automatic. This poses challenges if the user would like to make customized adaptations to the neural network based on personal preferencesfor example, tuning a few parameters so that the network consistently produces different styles for different scene categories. We would like to investigate strategies that tackle these challenges in our future work.

VIII. CONCLUSION
We proposed LTMNet, a method for local image enhancement that learns a grid of local tone curves. LTMNet enhances local image regions more effectively compared with global transformations and offers higher interpretability than pixelwise methods. LTMNet outperforms existing methods in local tone mapping and achieves competitive results in modeling additional photofinishing operations. In addition, we proposed a new dataset representative of local tone mapping (LTM dataset) that, unlike existing datasets, represents only global and local tone mapping. Our method is quite advantageous in that it can be easily integrated into both camera ISPs and user-interactive image editing tools.

APPENDIX A NIMA FOR LTM DATASET PREPARATION
We performed a user study to compare NIMA [63] with three other commonly used non-reference metrics: BRISQUE [67], NIQE [68], and PIQE [69]. We randomly selected 100 images from our dataset. For each image, we produced 15 CLAHE versions and selected the best version using all four metrics (NIMA, BRISQUE, NIQE, and PIQE). We asked 40 users to select the image they prefer from the four ''best'' versions. The average user preference (i.e., the percentage of time one image version is preferred over the others) is shown in Table 6. The results are statistically significant; applying the ANOVA test, we obtain F 3,39 = 10.33 and p < 0.0001. The results indicate that NIMA aligns with user preference better than other metrics. The NIMA paper and other works [70], [71] confirm that NIMA aligns with perceptual quality well. We have obtained informed consent for the user study.   Fig. 12 and 13 showcase more qualitative comparisons between our method and SOTA methods, on our LTM dataset and the HDR+ dataset [60] respectively.