TAIGA: a novel dataset for multitask learning of continuous and categorical forest variables from hyperspectral imagery

The spectral and spatial resolutions of modern optical Earth observation data are continuously increasing. To fully utilize the data, integrate them with other information sources and create applications relevant to real-world problems, extensive training data are required. We present TAIGA, an open dataset including continuous and categorical forestry data, accompanied by airborne hyperspectral imagery with a pixel size of 0.7 m. The dataset contains over 70 million labeled pixels belonging to more than 600 forest stands. To establish a baseline on TAIGA dataset for multitask learning, we train and validate a convolutional neural network to simultaneously retrieve 13 forest variables. Due to the size of the imagery, the training and testing sets were independent, with strictly no overlap for patches up to 45×45 pixels. Our retrieval results show that including both spectral and textural information improves the accuracy of mapping key boreal forest structural characteristics, compared with an earlier study including only spectral information from the same image. TAIGA responds to the increased availability of hyperspectral and very high resolution imagery, and includes the forestry variables relevant for forestry and environmental applications. We propose the dataset as a new benchmark for spatial-spectral methods that overcomes limitations of widely used small-scale hyperspectral datasets.


I. INTRODUCTION
Satellite-based remote sensing is a widely accepted tool in forest inventory and for monitoring the extensive forests covering approximately one-third of the world's land surface. This essential service is largely based on high-to mediumresolution multispectral optical instruments with pixel sizes of 10 m or more. While the availability of imagery with better spatial resolution, one meter or less (called Very High Resolution, VHR) is increasing rapidly, its utility in forestry applications has not been demonstrated yet [1]. VHR images are useful in visual analyses, but lack large-scale automated processing chains required for routine forest applications.
A natural approach to automate VHR image processing is deep learning by making use its breakthroughs in computer M. Mõttus vision and speech recognition. Over the past few years, it has been increasingly applied in Earth Observation (EO) [2]. Deep learning methods have been successfully adapted to the spectral, spatial and temporal specificities of EO imagery [3], [4]. In parallel with the ongoing enhancements in spatial image resolution, HyperSpectral Imaging (HSI) aims for the same in the spectral dimension, collecting continuous reflectance spectra using hundreds of narrow reflectance bands, providing the best discriminating power in the spectral domain [5]. The imagery produced by a HSI sensor is often viewed as a data cube with three continuous axes: one spectral and two spatial ones. Building upon the legacy of earlier machine learningbased spatial-spectral methods [6]- [8], deep learning has also been applied to HSI data cubes, with Convolutional Neural Networks (CNNs) being the most popular approach [9]. CNN models can extract discriminatory features from the data cube and exploit its spectral and spatial information. The use of both spectral and spatial information for hyperspectral image analysis has been shown to improve the accuracy in image classification [10]- [12]. Several reviews are already available on machine learning in HSI analysis [13]- [15], indicating that the field has already an established status. Hyperspectral image classification is the most popular method for scene analysis from HSI [16]. Here again, CNNs have been found to be more accurate compared with other deep learning methods or the more traditional classification algorithms, e.g., support vector machines (SVMs) [10], [17]. Recently, 3D-CNN architectures, which use both spectral and spatial information, have been used for classification of hyperspectral data [12], [18]. In addition, novel deep CNNs that integrate both spatial context and spectral signature have been designed for HSI classification in a few studies [11], [19]. Despite numerous works on deep learning for HSI, there remains research gaps and a fundamental limitation related to commonly used HSI datasets.
Prediction of continuous biophysical variables from HSI EO data with CNNs is not yet well studied [20], [21]. Still, this data-method combination has been used in a few studies. For instance, Liu et al. [22] achieved good accuracy in predicting soil clay content with a 1D CNN approach (the coefficient of determination, R 2 = 0.83). CNN has also been found to have great potential for cyanobacteria detection and quantification with HSI [23].
Multitask learning, defined as jointly solving a set of prediction problems by sharing information across tasks [24], has attracted even less attention in scientific publications, but the situation is improving rapidly [25], [26]. In practical EO, it corresponds to the joint classification and regression (variable estimation) with a training dataset of mixed continuous and categorical variables. Furthermore, we chose multitask learning because in forest mapping and many other real-world problems, a coherent retrieval of not only target classes, but also numerous parameters for each detected class, is essential. With a multitask learning approach, the model has an opportunity to learn or take into account those hidden correlations between variables. Multitask learning also saves time and computing resources compared with training a separate CNN model for each variable.
In HSI analysis, the viability of multitask learning has been demonstrated in superpixel-based classification [27] and combining samples from different hyperspectral images to improve classification accuracy [28]. However, the machine learning methods used in EO usually only make use of the spectral information [29] with multitask learning used to to improve image classification. The full potential of multitask machine learning in EO, and HSI analysis in general, is yet to be unleashed.
Whereas numerous machine learning and deep learning methods have been proposed to make use of the spatial and spectral characteristics of hyperspectral imagery, the key limitations of the approach are the adequate training of the model and a statistically sound accuracy assessment on an independent dataset. The lack of large-scale labeled HSI datasets has pushed practitioners to rely heavily on a limited selection of small aerial imagery (e.g., Indian Pines, Salinas and University of Pavia [27], [28]), each consisting of a single image with only a few hundred pixels in each spatial dimension. Training and evaluating models on such small images with randomly sampled large patches as inputs leads to spatial overlap between training and testing sets (directly or through the receptive field of deep networks), resulting in a lack of independence between training and testing sets, and ultimately to over-optimistic accuracy assessment [14]. Although alternative sampling schemes have been proposed to mitigate these effects [30]- [33], these cannot fully overcome the physical limitation caused by the small number of pixels. Fixed data splits between training and testing sets have been introduced for these datasets by IEEE Geoscience and Remote Sensing Society (GRSS) 1 , which further reduce the amount of data for training models and do not completely solve the overlap when larger patches are used. Larger HSI datasets have also been released such as the Houston University datasets from GRSS Data fusion contests in 2013 [34] and 2018 [35], and the Chikusei dataset [36]. These datasets include fixed disjoint training and testing sets that mitigate the overlap issue, yet many pixels are left unlabeled.
The motivation of the manuscript is the need for a largescale HSI dataset, which would enable both deep-learning based multitask learning and a statistically sound accuracy assessment on an independent test set. We present TAIGA, an openly available VHR HSI dataset covering the visible to near-infrared spectral regions (VNIR, 400-900 nm) acquired above a boreal forest in Central Finland. The imagery, offering 70 million labeled pixels, is accompanied by extensive and continuous stand-wise forestry data from regularly updated nationwide measurements for over 600 forest stands. It exceeds the current standard benchmark hyperspectral datasets both in image size and the amount of available labelled data, and enables the joint classification and regression of forest variables. We obtain baseline results for this unique dataset by applying CNNs for multitask learning of the forest variables using strictly independent samples for testing and training. In addition, we compare the results with a previous exercise on forest variable retrieval using only spectral information, therefore quantifying the advantages of combining spectral and spatial features in analysis of VHR HSI data in forestry applications.

A. Study site
The HSI data were collected in the southern boreal zone of Finland, in the vicinity of Hyytiälä forestry field station (61 • 50'44"N, 24 • 17'10"E, Figs. 1, 2). The study area is mostly covered by managed boreal forests, agricultural fields and wetlands. The main overstory species of the forests are, in the order of dominance, Scots pine (Pinus sylvestris), Norway spruce (Picea abies) and Silver birch (Betula pendula), either as mixed or single-species stands growing on mostly Podzolic soils. As a result of management, forest density varies usually between 500 and 1500 trees/ha. Different shrubs, lichens and mosses cover the forest floor, no bare soil is visible. The 30year average annual precipitation at Hyytiälä is 711 mm, and the annual mean temperature is 3.5 • C [37]. The area has a mean elevation of approximately 150 m above sea level (asl) and does not exhibit large variations in elevation, with some undulating hills reaching less than 200 m asl. . The hyperspectral data were projected to UTM geographical coordinates using satellite and inertial positioning data and a digital elevation model based on nationwide laser scanning data with the Atcor software package (ReSe Applications GmbH, Switzerland) and resampled onto an orthogonal grid. The data were further converted to top-of-canopy reflectance with the Atcor package (ReSe Applications GmbH, Switzerland) using sun photometer-measured aerosol optical properties provided by the the AErosol RObotic NETwork (AERONET, https://aeronet.gsfc.nasa.gov), and mosaicked. A comprehensive description of the data preprocessing chain applied to the airborne data is reported by Markiet et al. [38]. The measured area was approximately 9 km by 3 km. Based on a comparison with fixed ground points, the geolocation accuracy of the data is approximately 2-3 m, determined mostly by the accuracy of the top-of-canopy digital surface model and the inertial measurement system used during the acquisitions.

C. Finnish forestry data
We used field data based on the forest resource information gathered and processed by the Finnish Forest Centre (FFC) on privately owned Finnish forests. The inventory process is executed by FFC as a statutory assignment together with a few Finnish forestry and remote sensing companies. General outline of the laser scanning based stand level inventory is described by Maltamo et al. [39] and a more detailed description of the inventory process by Holopainen et al. [40]. The inventory process is summarized below.
The forest data is collected by nationwide field measurements and airborne remote sensing -Airborne Laser Scanning (ALS) and aerial imaging. The number of reference plot measurements per one inventory area, approximately 100,000-200,000 ha, is about 600-800. The plots typically have a radius of nine meters, and variables measured in the field (by species) are mean height, basal area, stem count, mean diameter and age. The reference plots are distributed to represent well the existing forest variation. The ALS campaign is performed during a single growing season (leaf-off or leaf-on). If the ALS data is acquired on a different year than the field measurements, the reference plots can be grown with models to match the correct epoch. The aerial imagery are fourchannel images (visible and near-infrared).
In the inventory process, the numerical forest variables are imputed from ALS data for an inventory unit, a 16 m × 16 m cell. The units form a nationwide grid. Statistical nonparametric models are built for each variable for predicting them from the height distribution and the intensity of the ALS data, calibrated with the field plot data. The texture metrics and the spectral statistics of the aerial imagery are used and auxiliary data source, e.g. in species determination. Due to their empirical nature, models are specific to inventory areas. In addition, information about different forestry related soil properties (e.g., soil type and fertility class) are retrieved for the grid cells from external data sources.
The stand-wise data by FFC contains information characteristic to the stand (e.g., soil type, fertility class) or to a class of trees growing inside the stand, called a "stratum". Each stratum quantifies the contribution of trees similar in species, age, height, etc. to the forest stand. The cumulative characteristics of a stand, such as its basal area, above-ground woody biomass or stem volume, were obtained as sums of the corresponding values of the strata; mean tree height and diameter at breast height (DBH) were computed as basal-area weighted average of the tree heights of all strata in a stand. Similarly, the species composition of a stand was calculated on the basis of basal area. The forest stand boundaries were contracted by 10 m buffer in further processing to avoid geolocation errors and border or neighborhood effects in the hyperspectral data.
In Finnish forestry, forests and their resource data are managed on stand level. A stand is an aggregation of trees that are sufficiently uniform in species composition, size, arrangement and age. Definition of stands starts by a segmentation of forest area into "micro stands" using canopy height model and aerial imagery. Forest stands are then delineated by combining similar neighboring micro stands into economically viable forest units (Fig. 3).
The stand level forest resource data are aggregated from the grid cells using the vector files with stand geometries. If one stand geometry includes enough complete or nearly complete grid cells, the variables of the stand are computed using these grid cells. If not enough grid cells fall inside the stand, some partly intersecting grid cells are also taken into account using their weighted averages, decreasing the accuracy of the stand variables. The weight is comprised of two elements: grid cell surface area proportion within a stand and weight factor describing the data reliability of a grid cell. The latter takes into account the poorer reliability of a cell that partly intersects the stand compared to one falling completely inside it.
Forests in different parts of the country are measured each year. The nation-wide forest variable data are continuously updated based on these measurements, forest management notices and empirical forest growth models. We used the forestry data with a reference date in the end of year 2017. Classification of site fertility in Finnish forestry is based on forest floor vegetation [41]. A majority of the study sites belonged to the following four fertility classes, listed in decreasing order of fertility: herb-rich, mesic, sub-xeric, and xeric heath forests [42] (codes 2-5 in the Finnish system). The few stands belonging to other classes did not constitute a sufficient sample and were merged with their closest class among the four listed above -either herb-rich heath or xeric forest. Additionally, we significantly reduced the number of soil types given in the forestry database. In Finnish boreal forests, soil is covered by mosses, herbs, lichens and shrubs and will not have a direct effect on the reflectance signal. We therefore merged the ten distinct soil types in the database into two: mineral and organic soils (coded as 1 and 2 in the data file).
Using the FFC data, we calculated the percentages of pine, spruce and broadleaf (i.e., birch) in each stand based on their basal areas. The species with the largest basal area was considered the main tree species (note that this differs slightly from the definition used by the FFC data). For each stand, we retrieved the key forest variables, mean diameter at breast height (1.3 m, DBH), stem density (trees per hectare), basal area (cross-section area of tree trunks at 1.3 m, in square meters per hectare), and mean tree height.
In addition to traditional forest variables given in the database, we estimated the leaf area index (LAI) of each stand using the forestry data and the biometric regressions published by Repola [43], [44] for Finnish boreal forest. We used the species-specific specific leaf area and shoot silhouette to total area ratios suggested by Majasalmi et al. [45] for the same geographic area, to obtain the total LAI and shoot-corrected effective LAI for each stratum. The effective LAI values take into account the effect of the clumping of conifer needles into shoots on the transmission of incident sunlight and were considered as potential predictors of the reflective properties of forests. For broadleaves, effective LAI was equal to the LAI. Note that the effective LAI values used here are abstract concepts and are not supposed to correspond to the optical canopy transmission measurements as only shoot-level clumping was considered, clumping of leaves around branches and into crowns was ignored due to lack of reliable information. The list of the stand characteristics used in the study is given in Table I. The dataset containing the hyperspectral image, stand boundaries and the associated forest variables will hereafter be referred to as The Artificial Intelligence dataset for forest Geographical Applications (TAIGA). The python code used to process the forest variables in TAIGA from the open data by FCC is available on GitHub in the taiga-baseline directory in https://github.com/aalto-cbir/AIROBEST.
Part of the stands within the hyperspectral image were not suitable for further analyses. We created a list of unsuitable stands using thresholds for minimum area, the Normalized Diffrence Vegetation Index (NDVI) computed from red-and near-infrared reflectance [46], and effective LAI. The LAI and NDVI thresholds, 0.86 and 0.61, respectively, were selected visually and the minimum area of 0.5 hectares is a general recommendation from FFC. The LAI threshold excluded stands that were logged between the flight campaign (June 2017) and the epoch of the forest data (end of 2017), the NDVI threshold excluded additional recently logged stands with little leaf cover. The unsuitable stands were mapped as "no data" in TAIGA.
The TAIGA dataset (https://doi.org/10.23729/ fe7ce882-8125-44e7-b0cf-ae652d7ed0d5), including the hyperspectral image, forest variable data, and metadata are available through the Fairdata IDA service (https://etsin.fairdata.fi), a continuous service for safe research data storage organized by the Ministry of Education and Culture of Finland. The data are stored as geolocated rasters in ENVI format. The data are freely available under Creative Commons Attribution 4.0 International (CC BY 4.0) license.

III. METHODS
A. Data pre-processing for the CNN We predicted three categorical and 10 continuous forest variables (Table I) from the first 110 bands in the airborne HSI data. The bands with wavelengths above 910 nm were discarded due to their high noise level. To make the hyperspectral and forest data structurally similar, we created raster images of the forest variables, where the values of all forest variables were constant within the same stand, with the exception of 10-m buffers, in line with the processing of the hyperspectral imagery. The continuous variables were normalized between 0 and 1 after clipping to the 98 th percentile.
The stands were split into training, validation and test sets (Table II) so that the training and testing samples were never overlapping, and no single pixel appeared in both training and testing sets. The input image patch size used by the 3D CNN was 45×45 pixels. The patch size was selected large enough to include relevant spatial context for the analysis of submetric hyperspectral images. We tried several sizes, from 27×27 to 91×91 pixels, with 45×45 offering the best compromise quantified by the accuracy measures (RMSE and rRMSE) for basal area and LAI, while still strictly forbidding any spatial overlap between training and test sets selected within the TAIGA dataset [47], [48]. The center pixels of the patches were picked from the stands with stride of 13×13 pixels so that a 27×27 image window centered in the sample center was entirely inside the buffered borders of a forest stand. We ignored all non-forest pixels (i.e., water, roads, agricultural fields and residential areas).

B. TAIGA hyperspectral analysis CNN pipeline
Our baseline 3D CNN pipeline was inspired by the models studied by Chen et al. [20], who found out that 3D CNN could simultaneously utilize both spatial and spectral features from the hyperspectral image. For this work, we studied and experimented with also various other CNN models that were designed for hyperspectral imagery, including Lee's model [19] and He's model [49]. These models performed well on single task learning on small datasets. However, the performance degraded significantly when applied to multitask learning problems. Among these models, Chen's model showed the most promising results, while the Lee's and He's models provided clearly inferior results.
The similarities between our model and Chen's model include the use of the same feature extraction part (up to the shared FC layer in our architecture) that utilizes 3D convolution operations, and the use of the same data augmentation methods. Considering the complex nature of the data and tasks, aside from those similarities, our model has some major differences in comparison to Chen's model. Our model has separate output layers for multitask learning that can predict both discrete and continuous variables. Beside oversampling minority classes, we also applied different approaches to tackle the class-imbalanced problem, including cost-sensitive learning, class rectification loss [50] and focal loss [51]. Since out model was designed for multitask learning, we applied GradNorm [52] and uncertainty loss [53] to balance learning among specified tasks. We also carried out extensive numerical experiments for finding the optimal choices for the 3D CNN model configuration, including the image patch size, the number of convolution kernels, kernel sizes, various regularization and sampling techniques. The detailed experiments and results of these experiments are presented in [54].
In this article, we describe the best-performing variant of our experiments as the TAIGA baseline method. The pipeline of the architectures is illustrated in Fig. 4. We adopted the hard parameter sharing approach for our multitask learning problem, which means that the pipeline includes two parts: shared layers and task-specific layers.
The shared part of the pipeline consists of four convolutional (Conv) layers (Conv1 to Conv4 in Fig. 4) interleaved with two pooling (Pool) layers and followed by a shared fully connected (FC) layer. At each Conv layer, we used a stride of one and a padding of zero for the 3D convolution operation. For the pooling operation, a kernel of size 2×2×1 was used. After each pooling layer, we applied the non-linear ReLU activation function.
We used kernels with sizes of 3×3×54, 3×3×32 and 3×3×32 for the three simple convolution layers Conv1, Conv3 and Conv4, respectively, in our baseline model. We adopted the idea of multi-scale convolutions in our model, and used a multi-scale convolution block that comprises four convolution layers with the kernel sizes of 1×1×1, 1×1×3, 1×1×5, 1×1×11, respectively. These convolution filters are used to exploit the spectral correlations. No pooling is applied after these layers. The outputs of the convolution layers are then summed together.
The output from the Conv4 layer was flattened and fed to the shared FC layer, which produced a feature vector with a size of 512. Each task, i.e., one forest variable, had two task-specific stacked FC layers, which learned the mappings between the extracted features and the output for each individual task. The first FC layer takes a 512-length feature vector from the shared FC layer and transforms it into a 200-length feature vector. If the task is a classification task with C classes, this vector is mapped to a vector of length C in the last FC layer and then fed into the Softmax function to compute the probabilities for all the classes. Otherwise, if the task is a regression task, the last FC layer is used to predict the output directly.
We used cross-entropy loss for the classification tasks and mean squared error loss for the regression tasks. The model was trained with optimized batch size of 32 samples whose maximum value was determined by the memory constraints together with the image patch size. The learning rate was optimized in our experiments as described in [54], where the value 10 −4 was found to be optimal in the range [10 −7 , 10 −1 ]. We used the Adam optimizer with the weight decay η at 10 −4 , ρ 1 at 0.9 and ρ 2 at 0.999, which are commonly used as their default values.
Since the data is imbalanced, we used sampling with synthetic minority class samples as described in more detail in [54] to improve the classification performance of the categorical variables. For oversampling the minority classes, we used radiation-based and mixture-based methods as described in [20]. Class rectification loss [50] and focal loss [51] are the two loss functions that were used to mitigate the class-imbalance problem. We used the hyperparameter values recommended in the original papers for these loss functions. With the focal loss, our model achieved better results on our multi-class multi-label classification tasks (see Pham [54] for details). Oversampling seemed to lead to model overfitting by seeing rare instances too often.
The code for the model and the baseline experiments for the TAIGA benchmark are available in the taiga-baseline directory in https://github.com/aalto-cbir/AIROBEST.

IV. BASELINE RESULTS
Our 3D CNN model produces predictions of the 13 categorical and continuous variables for each data pixel in test set sample. As an example, a subset of the map for above-ground woody biomass is shown in Fig. 5. The stand-wise predictions, used for accuracy assessment, were obtained as arithmetic means of the pixel-wise values for continuous variables or as majority estimates for the categorical ones.
For the categorical variables, we computed the overall (micro) and the mean class (macro) accuracies and formed the confusion matrices. The micro accuracy tends to favor the model's good performance for the majority class or classes, whereas the macro accuracy gives equal weight also to the minority classes. For the continuous variables, we used the basic statistics commonly used in forest remote sensing studies, such as the root-mean-square error (RMSE), relative RMSE (rRMSE), relative mean bias (rBias), and coefficient of determination (R 2 ). Additionally, we computed the widths of the 90% and 95% confidence intervals for the continuous variables. Additional experiments and their results in the same setting can be found in [47].
The retrieval accuracy metrics presented in Table III can be considered very good for forest variable retrieval from EO data [56] with R 2 ≥ 0.7. The accuracy was generally better than the results obtained in a previous study based on the same data [55], but excluding spatial information. The most accurately predicted variable was tree height (rRMSE 12%). Tree species composition was predicted with the largest uncertainty -the 95% confidence interval was approximately 50 percentage points for all three species. Besides species composition, the rRMSE value for stand density (41%) was higher than that of other structural variables.
For two selected key structural variables, woody biomass and mean tree height, we can see that the estimated values lie in several clusters along the diagonal of the predicted vs. target plot (Fig. 7) and the tails of the error histograms fall off reasonably rapidly (Fig. 8).
The histograms of field data on the variables are not uniform indicating that the management regime of forests has undergone some changes during the rotation period: the tree height distribution has minima at approximately 12 and 22 m (Fig. 7). The structure in the target tree height distribution is not fully represented in the estimated height with an underestimation of the tree heights for the tallest stands. A similar effect of smoothing of the parameter distribution is visible also for woody biomass.
The overall accuracies varied largely among the categorical variables. The largest values were generally on the main diagonal of the confusion matrices ( Fig. 6) with the exception of mesic sites predominantly predicted to be herb-rich (Fig. 6a).
We also wanted to verify that the multitask learning approach is not only beneficial in the sense of efficiency in training a single network model instead of multiple ones, but also leads to prediction accuracies that are not much degraded. First, we studied the multitask predictions of the categorical and continuous variables only, so practically having two models instead of one. Second, we ran experiments where we predicted each categorical and each continuous variable separately, which is the basic setup of non-multitask learning. When only the categorical variables were predicted, the average overall micro accuracy was 78.7% and the mean class macro accuracy 75.3%. It thus seems that multitask learning of all variables together has been beneficial for the micro accuracy but not for macro accuracy. For the prediction of only all the continuous variables, the rRMSE value was 23.6% and ant the R 2 value 0.79. These are both slightly better than those in Table III, but the obtained average rBias value −3.1% is worse than the result in the table. Among the single categorical variables, we chose to evaluate predictions of the fertility class and obtained micro and macro accuracies of 63.0% and 52.9%, respectively, which can be seen to be inferior to the multitask prediction results. Correspondingly, basal area was predicted as a singletask manner and the rRMSE, rBias and R 2 values were 14.2%, −4.0% and 0.76, respectively, and again the rRMSE and R 2 values are somewhat better than the multitask results. It thus seems that the multitask learning approach does not either improve nor degrade the accuracy of the predictions results considerably and the observed gains and losses due to it are approximately equal in their size.

V. DISCUSSION
The value of the zettabytes of Earth observation data produced by the numerous remote sensing satellites can only 128 128 128 128 128 64 32  be utilized by integrating it with other information sources [57]. This is especially true for developing and validating novel machine learning tools for practical, real-life application. These actions require a large number of reliably labeled pixels accompanied by sufficient spatial extent in the spatially registered imagery to allow spatial separation of the training and testing subsets. Thankfully, many national policies answer this need by opening up geospatial datasets, such as the highquality forest variable data forming a part of TAIGA used in this study.
The forestry dataset used in this study includes both continuous and categorical variables and is tightly related to the real-world problems faced by remote sensing scientists. The forest variables in Table I are relevant not only to the forestry community but also to biologists, Earth system scientists, and many others. A simple classification approach, using categorical variables, does not answer the information need for many important mapping activities. The multitask learning approach presented here was chosen to address these gaps. With our experiments we also showed that the multitask approach gave prediction accuracies comparable to those of singletask learning while providing the benefit of hugely reduced complexity in model training.
Furthermore, the amount of published studies on multitask learning in remote sensing is still fairly limited. We release the TAIGA dataset openly and propose to use it as a benchmark for hyperspectral multitask learning in remote sensing, hoping to encourage more activity and developments in this field of research.
The results presented here are good compared with the general accuracy obtained for forests with remote sensing [56]. Utilizing both spatial and spectral information yielded  Table III. Abbreviations: h-rich = herb-rich, sub-x = sub-xeric, miner. = mineral, org. = organic.
better results compared with earlier Gaussian process regression based results with the same datasets and evaluation procedure [55]. Use of structural information improves forest variable retrieval somewhat, in line with the analysis by White et al. [1]. High spatial and spectral resolution data will not be available jointly on a large scale, at least not in an operational fashion, in the foreseeable future. The relatively small advantage of high-resolution imagery does not likely encourage development of specialized systems for forest monitoring with VHR HSI instruments. However, operational hyperspectral imaging on a global scale is becoming a reality with high-resolution (pixel size 20-30 m) instruments [58]. While there are parts of our planet for which VHR imagery is not readily available [59], the availability of these data is improving literally by the day. We suggest that combining VHR and HSI data from different sensors with advanced machine learning approaches may considerably improve the robustness of forestry data, especially for regions where national forest inventory data are not available. The 3D CNN model used to produce the baseline results of this article was the outcome of a series of experiments and improvements [54] that were left out from this article for the sake of brevity. The most important design choice in our CNN model was the use of multitask learning in one network instead of devising multiple separate models, one for each output forest variable. We are currently in the process of improving our CNN model for both the categorical and continuous variables.
In particular, we look forward to studying the performance of the novel transformer architectures [60].
As with many natural observations, TAIGA data are heavily imbalanced, as is seen from the confusion matrices (Fig. 6). Furthermore, the reference forestry data has its limitations. It is based on a large number of forest plot measurements interpolated by mainly airborne laser scanning, which means that some inaccuracies are unavoidable. Unfortunately, only a very small number of the plots were within the HSI area, making their direct use impossible. Airborne HSI data are acquired even for a relatively large area within hours to days, depending on weather. Satellite sensors acquire their imagery near-instantaneously. Field plots, on the other hand, may take months to years to acquire on a regional scale, with national measurements spread over several years. The different temporal scales direct towards the use of well-maintained forest variable databases, which are updated based on forest management action notices and growth models, and can be used to retrieve forest information for the instance of image acquisition. The use of stand level forest variable estimates and accuracy metrics in this study was motivated by the purpose of the FFC forestry data: they are computed to minimize errors at stand level, driven by the requirements of practical forest management. National forest inventories, for example, are designed to minimize bias at regional to national scales.
Naturally, a machine learning algorithm cannot exceed the (largely unspecified) error in the stand level forest variables and the room for improving prediction accuracy is limited. We know that tree height is very reliably estimated from airborne laser scanning data in Finland with rRMSE values below 7% [61]. An rRMSE value of 12% for this variable (Table III) indicates that there is room for improvement. The indirect nature of the reference data also distorts the distributions of the forest variables. Still, a lot of structure is retained in the FFC data (Fig. 7), and some -although clearly less -reproduced by the baseline retrieval algorithm. In addition to the accuracy metrics in Table III, the structure in the predicted distributions is a prerequisite for a successful EO-based forest variable retrieval algorithm.
Up-to-date and accurate forest data are needed for almost all regions of the globe. Technical developments are taking place which will make hyperspectral (CHIME [58]; EnMAP [62]; PRISMA [63]) and VHR EO data available everywhere, promising improvements for future forest monitoring systems. Machine learning can provide data processing algorithms capable of handling the large volume of high-dimensional data. To develop and test these algorithms, datasets such as TAIGA presented here will be essential. Our results, although promising, need to be further improved and extended to be truly applicable on national to continental scales. To this end, it would be beneficial to release also other benchmark datasets for multitask learning in different biomes. We hope that the TAIGA dataset and the baseline results presented in this paper will support improvement of the tools required to implement them.

VI. CONCLUSIONS
We have presented the TAIGA dataset containing a hyperspectral image with more than 70 million pixels labeled with both categorical and continuous forest variables. The dataset is available via the Fairdata IDA service (https://ida.fairdata.fi). We have demonstrated a baseline system with a multitask convolutional neural network which makes use of the spectral and spatial information in the data, available in the taiga-baseline directory in https://github.com/aalto-cbir/AIROBEST. The estimations presented in this article are more accurate than the ones made with spectral information only. The openly available data and the presented baseline results will help to make better use of the ongoing improvements in Earth observation technology. Since then he has worked at VTT as a research scientist in Earth Observation, with a main focus on machine learning and change detection for optical satellite images. He is an author of 60 articles in scientific journals and conferences. His current research interests include deep learning for multispectral and hyperspectral images, unsupervised change detection and satellite image time series analysis, with applications to environment monitoring.
Hai Cu is pursuing his Master degree in Aalto University, major in Machine Learning, Data Science and Artificial Intelligence with the thesis topic focusing on the reliability of the forest variable predictions from Deep learning hyperspectral model. He has participated in AIROBEST project since early 2020.