Estimation of the canopy height model from multispectral satellite imagery with convolutional neural networks

The canopy height model (CHM) is a representation of the height of the top of vegetation from the surrounding ground level. It is crucial for the extraction of various forest characteristics, for instance, timber stock estimations and forest growth measurements. There are different ways of obtaining the vegetation height, such as through ground-based observations or the interpretation of remote sensing images. The severe downside of field measurement is its cost and acquisition difficulty. Therefore, utilizing remote sensing data is, in many cases, preferable. The enormous advances in computer vision during the previous decades have provided various methods of satellite imagery analysis. In this work, we developed the canopy height evaluation workflow using only RGB and NIR (near-infrared) bands of a very high spatial resolution (investigated on WorldView-2 satellite bands). Leveraging typical data from airplane-based LiDAR (Light Detection and Ranging), we trained a deep neural network to predict the vegetation height. The provided approach is less expensive than the commonly used drone measurements, and the predictions have a higher spatial resolution (less than 5 m) than the vast majority of studies using satellite data (usually more than 30 m). The experiments, which were conducted in Russian boreal forests, demonstrated a strong correlation between the prediction and LiDAR-derived measurements. Moreover, we tested the generated CHM as a supplementary feature in the species classification task. Among different input data combinations and training approaches, we achieved the mean absolute error equal to 2.4 m using U-Net with Inception-ResNet-v2 encoder, high-resolution RGB image, near-infrared band, and ArcticDEM. The obtained results show promising opportunities for advanced forestry analysis and management. We also developed the easy-to-use open-access solution for solving these tasks based on the approaches discussed in the study cloud-free composite orthophotomap provided by mapbox via tile-based map service.

manned Aerial Vehicle (UAV)-based approaches; and 3) satellite remote sensing data. All aforementioned approaches have advantages and limitations connected with acquisition time and cost (Fig 1). The first data source is forest inventory documents, usually treated as field-based observations. They are available for some regions and useful in addressing forest owners', governmental, and independent organizations' needs [14]. However, these data do not cover all regions of practical interest [15]. Furthermore, such data actualization is time-consuming and cost intensive in difficult-to-access areas. An alternative approach is to use remote sensing data.
The remote sensing approach draws on both active and passive sensing technologies. During active sensing such as Light Detection and Ranging (LiDAR) measurements, the sensor measures time between the emitted light time and its return time to estimate the distance of an object (a surface). This technology allows digital elevation models to be produced. Passive remote sensing measures radiation that is emitted or reflected by the object in different spectral wavelengths. Spectral bands obtained this way can be used for future analysis and to calculate the height value in landcover extraction.
A common approach builds on UAV assessment. A UAV with LiDAR sensors is a powerful tool for forest height estimation. It obtains canopy height data with minor errors, meeting the precision requirements for almost all forestry tasks. However, such equipment is more expensive than a spectral aerial camera system, thus there remains the challenge of obtaining the same information using low-cost methods [16]. Many works use LiDAR data as a reference and aim to find a cheaper height data source. A detailed review of the alternative approaches to LiDAR sensing is presented in [17], [18]. Thus, most of the current studies in the sphere of canopy height estimation use UAVs with optical aerial systems [19]- [24]. Despite its advantages over fieldbased observations, when large regions have to be processed, the labor involved in working with vast and remote areas is problematic. Satellite data address this issue, providing a cheaper option for forest monitoring [17]. Point cloud data that is useful for estimation of the canopy height can also be derived from satellite imagery using photogrammetry approach. The comparison of such photogrammetry approach and high-density LiDAR measurements is presented in [25], where authors showed photogrammetry method is slightly less accurate (difference in R 2 is about 0.07) compare to the LiDAR method for height measurements of the forest region in New Zealand. The important benefit of the photogrammetry method is that it could provide information for the larger scale compare to the LiDAR method, however it requires special high resolution imagery which is not always available for the particular region. The other limitation of the photogrammetric method is that it is able to characterise only the upper canopy and is not able to perform vertical characterisation of the forest such as can be done by laser scanning. The comparison of the photogremmetry obtained by unmanned aerial systems and areal laser scanning for the forest inventory in Oregon was presented in [26], where authors stated that photogrammetry is slightly less accurate compare to laser scanning (difference in R 2 for height estimation is about 0.15). However photogrammetry is easier to integrate to existing forest monitoring methodologies.
Our work is focused on using satellite images for CHM estimation as it is more preferable data source than LiDAR derived measurements in terms of cost and spatial coverage. Neural networks allows us to conduct image processing automatically. We set up the hypothesis that neural networks can extract significant spatial features from very high-resolution RGB images of 1 m to improve performance of CHM estimation. It was expected that developing a satellite-based solution compatible with a high-resolution UAV approach would further enable the prediction of advanced forest characteristics. Thus, this study's objectives and contributions were: 1) to develop a method for vegetation height estimation utilizing deep neural networks and different configurations of input data varying spectral compound (reducing to Blue, Green, and Red), spatial resolution and by adding topography features; 2) to assess the generated height map, conducting a further investigation into the classification of dominant forest species (conifer and deciduous). For this, multispectral imagery was incorporated with height data; 3) to create the software toolchain to train a neural network to predict CHM using single satellite non-stereo imagery. 4) to develop the easy-to-use open-access solution for the community which is now available by the following resource [27]. The underlying code will be shared: https://github.com/LanaLana/forest_height.

II. RELATED WORK
For canopy height estimation studies, spectral satellite imagery can be distinguished by the following characteristics: spatial resolution, spectral range, and availability. The majority of works use a spatial resolution much higher than 20 m to tackle the canopy height evaluation problem. This approach is justified for particular tasks when large-scale maps are produced. In [28], they conducted a 30 m spatial resolution canopy height evaluation with Landsat imagery and showed the dynamics over 29 years in the Darwin region. In [29], they employed Landsat 7 and 8 time-series data (30 m spatial resolution) to estimate tree heights in Africa. GLAS (Geoscience Laser Altimeter System) height measurements from the ICESat satellite were used as reference data (60 − 70 m spatial resolution). The same height data source was mentioned in [30]. In [31], they used Sentinel-2 images that were resampled to a 20 m pixel size to predict Mangrove forest canopy height. Other studies involving Sentinel-2 data are reported in [32]- [34]. In [35], they assessed SAR images from ALOS PALSAR, and upsampled them from 30 to 5 m as a LiDAR elevation model. The cases of very high-resolution FIGURE 1: Cost comparison of different forest height measurement approaches (diagram is not to scale) (3.7 m) images from the Planet Dove implementation are presented in [36]. However, the target height map resolution for that study was 1 hectare. Very high-resolution (2 m) WorldView-2 satellite imagery was used in [37], but the working resolution was adjusted to 5 m.
Another important data characteristic is the spectral range and the number of channels. A wider wavelength range is available for satellites with low spatial resolutions (Landsat, Sentinel) than for some very high-resolution satellites. For instance, Planet (3-5 m resolution) and GeoEye (2 m resolution) satellites have Blue, Green, Red, and NIR bands; RapidEye (6 m resolution) has Red Edge. The GeoEye panchromatic channel has a 0.4 m resolution and allows RGB to be enhanced. WorldView-2 provides eight spectral bands with a resolution of 2 m. An additional source of very high remote sensing data is Basemaps, with RGB bands such as those provided by Maxar one [38]. Nevertheless, the majority of works focus on using only the wide multispectral range (more than eight bands), sacrificing the spatial resolution. From the aforementioned satellite-based studies, the minimal number of spectral bands (Blue, Green, Red, NIR) was only considered in [36]. However, the goal of the work was the creation of a large-scale country wide map, so the spatial resolution of the analysis was 1 hectare. Therefore, the issue of minimizing the number of required satellite bands for forest height estimation has not yet been well studied.
Satellite data are frequently accompanied with data of other sensing techniques. In [39], they combined four Kompsat-3 multispectral bands and PALSAR-1 radar images resampled into 2.8 m to train a neural network. Few studies have implemented this into self-contained spectral satellite data [33], [40]- [42]. However, the spatial resolution of the Sentinel and Landsat images (lower than 10 m) considered in these studies is not high enough to extract small details on the surface. Thus, the satellite spatial resolution of 1-m per pixel is still beyond the scope of the majority of studies.
Data availability is also a significant aspect of implementation in practice. An image's properties affect its cost. Sentinel and Landsat images are open source, while WorldView, Planet, and RapidEye are commercial and contain a greater amount of the spatial information required in applied tasks.
After data acquisition, the obvious question of data processing arises. Computer vision algorithms enable highquality automatic satellite imagery analysis. Such methods are usually based on key feature extraction from input spectral bands to describe some object, which can be a pixel or set of pixels. Then, the algorithm aims to ascribe a label (for classification tasks) or a value (for regression tasks) to the object. The processing methods for expansive forestry areas using satellite images are classical machine learning models, such as Random Forest [43] or Support Vector Machine [44]. Their main advantages are simplicity and straightforward interpretation in the case of linear models. Generally, spatial characteristics are not taken into consideration, and an algorithm relies on spectral values or precalculated vegetation indices. In [28], a combination of 14 vegetation indices and spectral bands were used in the Random Forest model to predict the canopy height using Landsat images. Moreover, the strong correlation between the normalized difference vegetation index (NDVI) and canopy height has been well emphasized in aerial photography [16], [35]. Despite the importance of spectral data, other vital features can also be processed. For instance, there is a strong correlation between forest height and canopy width, as discussed in [32], in which the canopy volume was estimated using only the crown projected area and the crown diameter combined in a particular regression equation. The deep neural network-based approach is more capable than classical machine learning methods for the following reasons: the texture and spatial features extracted by the neural networks include sufficient VOLUME 1, 2021 information about landcover; it not only handles spectral values, but also the aforementioned spatial characteristics of an object available, for instance, in UAV-based tasks [45].
Tree height is correlated with tree diameter for each forest species [46]. In [47], tree height was estimated from the exponential equation, including diameter at breast height value. The crown form depends on the tree species; accompanied by the crown diameter, it can provide important features for a neural network. Tree height can also be derived from spectral information only, as it depicts meaningful vegetation characteristics such as chlorophyll content [48].

A. STUDY AREA
The study area is located in the Arkhangelsk region of northern European Russia with coordinates between 45 • 16 ′ and 45 • 89 ′ longitude and between 61 • 31 ′ and 61 • 57 ′ latitude (Fig 2). The investigated territory belongs to the middle boreal zone. The region's climate is humid, with the warmest month being July when the temperature rises to 17 • C. The topography is flat, with a height difference in a range between 170 and 215 m above sea level [49]. The main species present in the region are pine, spruce, aspen, and birch.

B. REFERENCE DATA
We used forest inventory and LiDAR-derived data covering the area of about 50 thousand hectares. LiDAR measurements were continued in the end of August of 2017 and 2018 by Leica ALS 80 HP scanner. Then the Canopy Height Model (CHM) with a 1 m spatial resolution was generated from LiDAR-derived point clouds.
The inventory data were collected in accordance with the official Russian inventory regulation in 2018 and 2019 [50]. It included such characteristics as canopy height, species percentage distribution, and age. This data was organized as a set of individual stand coordinates with appropriate characteristics based on the assumption that the crop was homogeneous. A species class markup was used in additional experiments presented as a raster map of dominant conifer and deciduous classes. The statistics of this data are shown in Table 3.
However, the shift in geo-referencing between the satellite data and LiDAR-derived measurements makes the target at 1 m spatial resolution less useful. As the typical shift lies between 2 and 3 m, the high-resolution CHM will show erroneous value for the particular point in the satellite image. This forced us to downsample the height map to 5 m to make the target value for each point represent the mean value of the area including the true location.
The distribution of the height over the study region is shown in the Figure 4. Although, height is usually represented as a continues value, height categories are essential for practical use in power lines services. Height classes are often required instead of continues values for decision making within protected areas [51]. The reason is that different categories (dangerous vegetation overgrowing) have different  importance and estimation in particular categories have to be more precise to reduce accidents on power lines corridors.

C. THE TEST REGION SELECTION
The training and test area was from the same satellite images, but without overlapping. The test region was manually chosen to include a diversity of height classes. The total test area was equal to 13% of the initial dataset. The spatial location is presented in the

D. SATELLITE DATA
We used Sentinel-2 and WorldView-2 satellite imagery to check the high and very high spatial resolution data sources.
The boreal location of the study area resulted in a lack of cloudless images. All images were from the boreal growing season (from May to August). Image IDs and dates are presented in tables 1, 2. WorldView imagery was downloaded from GBDX [52]. For the height estimation task, we used Red, Green, Blue, and Near-Infrared bands, while for the species classification problem, all eight bands were considered. The resolution of the WorldView images was 1, 2, or 5 m depending on the experiment statement. For CNN-based tasks, image values in the range from 0 to 1 are usually used [53], [54]. Therefore, pixel values were brought into a range between 0 and 1 using Equation 3. For the spatial resolution adjustment, the pansharpening procedure was implemented using a panchromatic band which was obtained in the imagery bundle with multispectral data from the data vendor. We did not consider any predefined cloud mask for WorldView. However, during training, pixels with particular properties were eliminated from consideration (see subsection III-G). This allowed us to clean the dataset from erroneous labels.
where mean, std are the mean and standard deviation of the image. In equations 1, 2, we calculate m and M (minimum and maximum of the preserved dynamic range). The standardization of the imagery according to the whole dataset statistics proves profitable for the neural network training compared to a simple scaling of the entire value range [55]. For the additional analysis, freely available Sentinel data were downloaded in L1C format from EarthExplorer USGS [56] and preprocessed using Sen2Cor [57] to an L2A format. Pixel values were brought into a range between 0 and 1 using Equation 3. We used the B02, B03, B04, B05, B06, B07, B08, B11, B12, and B8A bands, which were adjusted to a 10 m resolution. 60m bands were discarded as they are more affected by atmosphere than the land surface. 20 to 10 m bands were upsampled with the nearest neighbor method to avoid initial data corruption (they can be unambiguously downsampled back to exactly initial 20m data).
Both for Sentinel and WorldView, each image covered the entire study area, and images were considered separately without any spatial averaging (the same as in [58]).
As supplementary features, we used a freely available high-resolution digital elevation model (DEM), ArcticDEM [59], covering boreal regions (Fig 5). It provides a resolution of 2 m. For some experiments, the resolution was upsampled to 1 m by interpolation (see the section III-E).
Both the satellite and LiDAR data were co-registered through geo-referencing, the same as in [37].
We used cloud-free composite orthophotomap provided by mapbox [60] via tile-based map service as an example of free-available high-resolution RGB data-source. This image covered the same test region and was used just for the developed model assessment. We chose this data-source, because model implementation without expensive input data demands is crucial for open-access platform that can handle a more available images. The spatial resolution was 1 m per pixel, and the preprocessing was the same as for WorldView data.

E. FEATURE SELECTION FOR DEEP NEURAL NETWORK
Convolutional neural networks take a tensor as an input. The feature selection to create this tensor is fundamental. To find the best input data representation for the CHM estimation problem, we established a range of experiments. Firstly, we conducted a study with the WorldView bands.
The workflow of our research is shown in the Fig 6. For each experiment, the RGB bands were used constantly. The variable part concerned the resolution changing and the supplementary features (NIR and ArcticDEM), which were combined with the RGB channel in a single input tensor for the neural network model. We studied the original (2 m),  pansharpened (1 m), and downsampled (5 m) images. For the experiments with the 1 m resolution, bands were upsampled to the target resolution by bilinear interpolation. We used bilinear interpolation for image resampling to avoid aliasing emerging in nearest neighbor and halo inherent to higherorder interpolation methods, which are more problematic for neural networks than bilinear smoothing. A reference CHM was used during the training procedure to estimate the model's error. To minimize data mismatches, reference and predicted height maps were intersected with the forest cover mask before the loss function calculation stage. Therefore, we conducted the following experiments for the WorldView images: 1) RGB original resolution 2 m; 2) RGB pansharpened to 1 m; 3) RGB pansharpened to 1 m + ArcticDEM upsampled to 2 m; 4) RGB + NIR original resolution 2 m; 5) RGB + NIR original resolution 2 m + ArcticDEM upsampled to 2 m; 6) RGB pansharpened to 1 m + NIR upsampled to 1 m; 7) RGB pansharpened to 1 m + NIR upsampled to 1 m + ArcticDEM upsampled to 1 m; 8) RGB downsampled to 5 m resolution.
To assess the importance and restriction of the spatial resolution, we also checked the model's performance for the WorldView RGB bands downsampled to 5 m.
We conducted the following study to compare model's performance for high-resolution RGB images and less detailed but richer in terms of the spectral information Sentinel data with 10 bands, upsampled to 10 m. There were two experiments: 1) Multispectral bands; 2) Multispectral bands + ArcticDEM downsampled to 10 m. VOLUME 1, 2021

F. STRATEGIES FOR HEIGHT PREDICTION AND EVALUATION METRICS
Regression may naturally lead to richer (continuous) estimations for practical implementations than rigid class-based output maps. Therefore, we considered both regression and classification tasks for a comparative analysis. The regression problem statement means that we ascribe each pixel with a particular value corresponding to the height parameter. Then, the loss can be estimated as an error between real height value (CHM value) and the predicted value. The considered metrics are root mean square error (RMSE), mean absolute error (MAE), and mean bias error (MBE): where y is the mean target value among all pixels (mean CHM value),ŷ i is the predicted value of the i th pixel, y i is the target value of the i th pixel (CHM value), and n is the pixel number. Test regions results were computed for all images in WorldView or Sentinel datasets. Using the same reference data we can also solve classification task. When we formalized the problem as a classification task, we divided the continuous values of height into various classes. The choice of such a division often depends on an applied task's demands. For our study, we chose intervals 0 − 4, 4 − 10, 10 − 20, and > 20 m. We rely on the amount of classes and intervals of height that described [61]. We slightly shifted the boundaries of the height intervals, described in [61] according to the suggestion inventory data provider from Arkhangelsk region. After splitting the continuous dataset to the aforementioned classes we can compute the portion of the wrong estimated pixel classes and use F1-score [62] for evaluation of the trained classification models.
where TP denotes true positive, FP denotes false positive, and FN denotes false negative. The above formulas were applied in a per-class basis. To compute results, test regions from all images were used.
This refers to the area assessment, while in terms of regression, we strove to optimize each pixel value. Therefore, these two approaches can lead to a different local optimum. For example, if we split heights between 0 and 30 m into the following buckets: 0 − 4, 4 − 10, 10 − 20, and 20 − 30, then it is not important that we do not ascribe the exact values but some value from the correct bucket to some pixels. Then, it is clear that regression predictions can also be represented in terms of classification.
For the classification task, the multiclass weighted crossentropy loss function was used to make the predictions more balanced even for classes with fewer representatives. The same approach was implemented for the regression task. We compared the simple RMSE loss (Equation 10) and the weighted RMSE loss (Equation 11). For heights with fewer representatives, the penalty for the wrong prediction was increased by predefined weights. The weights were inversely proportional to the height distribution. There was also a threshold for the height when the weight was equal to 1 (no extra penalty). The range of weights and the threshold were chosen empirically, as shown in the Figure 7.
whereŷ i is the predicted value of the i th pixel, y i is the target value of the i th pixel, N is the number of relevant (nonmasked) pixels, weights(y i ) is the extra penalty depending on the target value of the i th pixel.
We needed to manage the temporal mismatch (such as logging) between LiDAR scanning and satellite imagery. To do so, we used two heuristics. The first one was that pixels labeled as forest by the forest cover model but with a height of less than 1 m were considered to be a forest logging. The forest cover model classifies pixels covered with clouds as non-forested. Therefore, the second heuristic was that pixels not labeled as forest but with CHM > 5 m were considered clouds. Reference and predicted height values for these pixels were not used in the loss function calculation during the training procedure (they were treated as masked). Thus, the mask of relevant pixels was defined by the following equations: height_mask = (logging == 0) * (cloud == 0) (14) where forest mask was obtained by the neural network model trained to predict forest cover with a high accuracy, especially in terms of small details using RGB bands. The model was implemented in the GeoAlert service [63].

G. EXPERIMENTAL SETTINGS
For all the neural network models, training was performed on the Skoltech supercomputer Zhores [64], using Keras [65] with a Tensorflow [66] backend. The source code containing the implementation details is available in the aforementioned repository.
Both for the regression and classification task, U-Net [67] with an Inception-ResNet-v2 [68] encoder was used (Figure 8). U-Net is a popular CNN architecture in the remote sensing domain which has shown high performance in various problems [69], [70]. The upsampling layers follow the U-Net's downsampling layers. Skip connections between layers allow the convolutional neural network to manipulate vital information at large spatial scales avoiding losing local information. Skip connections also facilitate gradient flow during the training procedure that was highlighted in [71]. We substituted the original VGG encoder with a ResNetbased one as it has shown high results in various works [72]. Residual connections in the Inception-ResNet-v2 encoder support shortcuts leading to better prediction quality [73] and enabling substantial simplification of the Inception blocks. We used the original U-Net decoder, where every step consists of an upsampling of the feature map followed by a 2x2 convolution. That halves the number of feature channels. The expansive path also includes concatenation with the cropped feature map from the contracting path and two 3x3 convolutions followed by a ReLU. The total number of parameters in the neural network is 62M where the encoder includes 54M. The decoder has 5 blocks, while the encoder part consist of 8 blocks. The models' implementation was based on opensource library [74].
Each model was trained 25 epochs for 200 training and 100 validation steps with a decreasing learning rate from 0.001 using RMSprop [75] optimizer and early stopping with patience 5 epochs. For the classification task as an activation function for the last layer, the softmax function was chosen. As an activation function for the last layer's regression model, we used linear function.
For all models, geometrical augmentation was implemented. This involves random rotations, and vertical and horizontal flipping. For models using the RGB channels only, we implemented color transformation. For this task, the albumentations library [76] was used.

H. CLASSICAL MACHINE LEARNING METHODS
We also conducted experiments with classical machine learning methods to compare different approaches in canopy height estimation. Two approaches were considered: Random Forest (RF) [43] and Gradient Boosting (GB) [77]. These approaches are widely used in the remote sensing domain due to relatively high performance in various tasks. For the RF method, we implemented 300 decision trees with maximum depth equal to 8, as these parameters shown the best quality. We also compared it to decision tree numbers 100, 200, 400, 500, 600, and maximum depth values equal to 4, 5, 6, 7, 8, 9, 10. In the GB method the parameters were 200 estimators with learning rate equal to 0.1, and maximum depth equal to 7, that were also set empirically (the same grid was considered to choose number of trees and maximum depth as in the RF case). For both two methods the implementation was used from scikit-learn [78]. A proper feature space is essential for machine learning algorithms, namely in classical one. The features were selected according to the study described in [79] as more relevant for vegetation properties estimation from Sentinel images. Therefore, the following vegetation indices were computed and accomplished initial multispectral bands resulting in Sentinel-derived features: the Normalized Difference Vegetation Index (NDVI), the Simple Ratio Index (SRI), the red-edge Normalized Difference Vegetation Index (RENDVI), and the Anthocyanin Reflectance Index 1 (ARI1). Thus, each pixel was considered as an input VOLUME 1, 2021

I. FOREST-TYPE CLASSIFICATION MODEL
To estimate the quality of the developed models, we considered a forest-type classification problem. To train the neural network model to predict two species (conifer and deciduous), we leveraged both WorldView and Sentinel imagery. The problem was defined as the per-pixel semantic segmentation task. Forest inventory characteristics were used as reference data. Eight WorldView bands were intersected with the forest mask. Both for the Sentinel and WorldView imagery, a height map or age map was used as an additional channel. This was done to make the model more robust in terms of species diversity resulting from different forest ages. Therefore, the neural network input was formed of 10 bands. As mentioned above, there are two familiar sources of height values: LiDAR-derivied data and forest inventory characteristics. The difference is in the data representation. Forest inventory characteristics establish height for each individual stand (small region joined according to some similar value of features such as tree species, age, density). Although real height within each stand can differ for each pixel, all pixels corresponding to a particular stand have the same height value. Thus, for this experiment we used both inventory-and LiDAR-derived height data.
We compared model predictions according to the next strategies of data leveraging: 1) just multispectral data; 2) multispectral data and CHM data; 3) multispectral data and inventory height data; 4) multispectral data and inventory age data; 5) multispectral and artificially generated CHM by the best model height. For these experiments, we trained a smaller U-Net model with the Resnet-34 encoder [80]. Individual stands from the dataset were randomly split into a training and testing set shown in Table 3. During training, the cross-entropy loss function was computed in a per-pixel manner. For testing, the F1-score was estimated for each individual stand. The predicted class for the individual stand was defined as a dominant class among all pixels within the stand. Each forest classification model was trained 25 epochs for 200 training and 100 validation steps with a decreasing learning rate from 0.001 using RMSprop [75] optimizer and early stopping with patience 5 epochs. The activation function for the last layer was soft-max.

IV. RESULTS
The achieved metrics for the regression models are shown in Table 4. The best quality predictions, using WorldView imagery with MAE 2.47 m (Exp. 9), were achieved with a combination of Red, Green, Blue pansharpened bands, the NIR band, and the supplementary ArcticDEM raster with resolution upsampled to 1 m (Fig 9). The smaller region is presented in the Fig 10. For the Sentinel imagery, only two experimental modes were considered: with ArcticDEM and without ArcticDEM. For both the Sentinel and WorldView data, ArcticDEM usage allowed us to improve the prediction results (for Sentinel, the MAE improved from 4.1 to 3.9 m, and for WorldView, the MAE improved from 2.9 to 2.58 m). The pansharpening procedure also contributed to the final result, decreasing the error from 3.3 to 3.1 m (Exp. 1 and Exp. 2) for the WorldView RGB model. The NIR band usage demonstrated an error reduction from 2.9 to 2.58 m (Exp. 3 and Exp. 7). This effect is linked to vegetation condition, which is reflected by the NIR wavelength. Additional weights during the loss computation reduced the MAE from 2.58 to 2.47 m (Exp. 7 and Exp. 9).
In Table 5, we can see a comparison between the regression model and the classification model (Fig 12). These two    height between 4−10 m. This is mainly caused by the spatial distribution specificity of the class, and it often occurs due to the small regions between crowns and depends dramatically on the satellite and LiDAR geo-reference data. For this study, we used LiDAR data downsampled to 5 m, while the WorldView imagery resolution was 1 or 2 m. This allowed us to save high-resolution spatial surface characteristics.
To assess the importance of texture information, we experimented with RGB bands downsampled to 5 m. The MAE for this case was 4.4 m. This result is lower than that of the Sentinel images (4.1 m) and confirms that when we reduced the spectral information, we faced stricter demands for spatial resolution.
We checked the generated height in the forestry task of species classification. The results are presented in Table  6. The first objective of the experiment was to show how supplementary features can enhance the quality of applied tasks. Both LiDAR and inventory data helped to improve classification in comparison with simple multispectral data. The second goal was to show that the generated height is of sufficient quality to beat the base model using just satellite data. We did not intend to conduct a comparison between WorldView and Sentinel sources. For this reason, in both experiments, observation dates were not equal in the data used. The superior results for the Sentinel imagery, as compared with the WorldView data, were partially due to the wider dataset.
We also evaluated the regression model trained using RGB WorldView (pansharpened to 1 m resolution) image on a cloud-free composite orthophotomap provided by mapbox [60] and covering the same test area. For this experiment, the MAE was equal to 3.5, and the RMSE was 4.6. Prediction example is shown in Figure 11. This promising result allows cheaper CHM estimation for large areas using only highresolution free-available satellite RGB data.
We conducted experiments with classical machine learning algorithms using Sentinel-derived features to compare this approach to the proposed one, namely the CNN-based with high-resolution data. The best results were achieved for the GB algorithm and combination of Sentinel-derived features with ArcticDEM, where MAE was equal to 4 and RMSE was equal to 5.4 4. VOLUME 1, 2021

V. DISCUSSION
It is challenging to perform a fair comparison between the majority of studies related to height estimation for various reasons. The main reason is the difference in height distribution. For example, in [37], the predicted height was limited by 30 m, the spatial resolution was 5 m, and the final RMSE was 2.2 m. However, according to the presented plots, the mean value was less than 10 m, while in our study, it was about 15 m. In [28], the validation pixels range was defined as being from 0 to 25 m, with a mean value of 7 m. The model's spatial resolution was 30 m. For this height distribution, an RMSE from 2.3 to 4.1 m was achieved. In [31], they studied the ranges between 0 to 18 m and 3 to 15 m, by leveraging satellite (both spectral and radar) data with a 20 m resolution. In contrast to our work, field-based observations with a sampling frequency of the 10 largest trees per inventory plot were used as reference material. Therefore, the achieved result (an RMSE of 1.48 m) cannot be compared with our model's performance. Other obstacles impeding a fair comparison are the species diversity and regional conditions.
It is worth mentioning that although ArcticDEM provides a stable improvement in canopy height estimation (see table  4, Exp. 6 and Exp. 7), it does not cover central or southern regions. For these areas, more powerful base models need to be implemented, leveraging just satellite imagery.
We showed that high-resolution WorldView 3-bands images provided more significant features than low resolution Sentinel with 10 spectral bands (see table 4, Exp. 2 and Exp. 10). However, resolution adjustment from 2 m to 5 m for the same WorldView dataset leads to a loss of important information, in particular texture information (see table 4, Exp. 2 and Exp. 8). The aforementioned experiments, which was performed on the same dataset and using the same NNs with only one difference -the adjusted spatial resolution, showed that neural networks can extract additional spatial features from very high-resolution optical images of 1 m. Thus we experimentally confirmed the initial hypothesis that by using high resolution data it is possible to make CHM estimation more accurate.
Creating the model with only high-resolution RGB channels allows it to be implemented in more available satellite images, such as RGB mosaic basemaps (google, yandex, and mapbox). Therefore, an opportunity to replace WorldView data with satellite images derived from other sources, making the provided model more universal. We made a prediction for cloud-free composite orthophotomap provided by mapbox [60] using the CNN model trained on RGB 1 m bands. The achieved quality (MAE = 3.5) confirms the opportunity for further model application for basemaps analysis.
There are the following directions for future research. The first involves improving the co-registration between LiDAR and satellite data. Now the developed RGB-based model shows the ability to reconstruct the main patterns corresponding to the CHM (Fig 10); large individual trees and spots within forest are detected successfully. However, satellite data has a slight shift in comparison with LiDAR data. Im-proving co-registration would allow the model's performance to be assessed more accurately for resolutions of less or equal to 1 m and also could probably improve the poor performance for the class of 4-10 m.
The ability of the model to be transferred to new regions is another essential question. As we did not have data from other regions, it is impossible to judge the model robustness for new areas. Moreover, for some regions, the ArcticDEM layer is not available; therefore, additional training for new areas might improve prediction quality. However, the neural network approach has proven to be powerful enough to extract the necessary spatial information and adapt to changing natural conditions. Augmentation and image diversity are often applied to overcome this weakness in real-life applications.
Another possible objective for future research is a canopy height estimation for areas with complex topography. Neural network models rely on landcover's spectral and texture characteristics, making the initial approach promising even when topography is not flat. However, shadows on slopes pose additional challenges to the multispectral satellite image analysis. LiDAR data additional preprocessing is also considered for study areas with complex topography [81].
In this study, we used all available images both for training and testing (splitting them into training and testing regions) as it is a common choice in the remote sensing domain [82]. However, in the future work, image-based cross-validation techniques can be used and robustness for new environmental conditions can be considered [83].

VI. CONCLUSIONS
Overall, in this study we confirm the hypothesis that neural networks can extract significant spatial features from very high-resolution RGB images, which can be used for more precise canopy height estimation. We also checked whether it is possible to get an accuracy of canopy height estimation by using of satellite-based solutions compatible with measurements obtained by UAV approach. For checking our assumptions we analysed the potential of very high-resolution images with limited spectral information in the task of canopy height model estimation. We created a software toolchain based on a state-of-the-art neural network architecture that enable us to extract spatial features from very high-resolution images. The proposed approach led to a reduction in the mean absolute error to 2.4 m, while leveraging just four spectral bands and the supplementary features from ArcticDEM. However, in southern regions where ArcticDEM is not available and without other sufficiently accurate DEM, the model achieved an MAE of 2.9 m. We also examined how generated height can be successfully used in the forest classification task. Our canopy height model estimation results using RGB bands indicated the prospect of replacing expensive LiDAR sensing data with easily attainable satellite data. Depending on the region of study, our technique allows a customer to promptly collect all the necessary relevant forestry inventory information without ground-based observations. We also developed and shared the easy-to-use open source solution which gives a new possibilities for the community to solve similar tasks. In future works, we are planning to include texture data, indexes and other attributes that can be obtained using ArcticDEM in the modeling procedure. index profile in mountainous forests," ISPRS journal of photogrammetry and remote sensing, vol. 132, pp. 77-87, 2017.
[82] E. Saralioglu and O. Gungor, "Semantic segmentation of land cover from high resolution multispectral satellite images by spectral-spatial convolutional neural network," Geocarto International, pp.