Compression Supports Spatial Deep Learning

In the last decades, the domain of spatial computing became more and more data driven, especially when using remote sensing-based images. Furthermore, the satellites provide huge amounts of images, so the number of available datasets is increasing. This leads to the need for large storage requirements and high computational costs when estimating the label scene classification problem using deep learning. This consumes and blocks important hardware recourses, energy, and time. In this article, the use of aggressive compression algorithms will be discussed to cut the wasted transmission and resources for selected land cover classification problems. To compare the different compression methods and the classification performance, the satellite image patches are compressed by two methods. The first method is the image quantization of the data to reduce the bit depth. Second is the lossy and lossless compression of images with the use of image file formats, such as JPEG and TIFF. The performance of the classification is evaluated with the use of convolutional neural networks (CNNs) like VGG16. The experiments indicated that not all remote sensing image classification problems improve their performance when taking the full available information into account. Moreover, compression can set the focus on specific image features, leading to fewer storage needs and a reduction in computing time with comparably small costs in terms of quality and accuracy. All in all, quantization and embedding into file formats do support CNNs to estimate the labels of images, by strengthening the features.


I. INTRODUCTION
O BSERVING the Earth remotely is one of the most fascinating branches of spatial data generation. In these days, it is possible to use multiple sensor modalities (RGB aerial images, multi-and hyperspectral images, SAR images, etc.) to observe the Earth's surface from aerial and especially satellite platforms.
These data sources form the backbone of environmental research and investigation of global challenges such as deforestation, climate change, or urbanization. Most important to these research activities are the number and regularity of observations, the spatial coverage, quality of images including image resolution, and the accessibility as well as exploitability of this data in a practical scenario. Manuscript  Unfortunately, increasing resolution, quality, number of observations, and coverage expand the amount of data that is being generated, does decrease the accessibility and exploitability of this data. We are currently at a point where much research claiming global relevance is either not global (reducing global to a few small locations selected in a spatial distribution from the whole globe probably, but not needed covering more than one climate zones or cultural zones) or not reproducible for average universities and even less for individuals and nongovernmental organization. In this area, a great deal of services is being provided by major Internet companies such as Google offering Google Earth or Amazon by hosting at least Sentinel data free of direct charge. However, these offers have immense indirect costs in the way of implying both a vendor lock-in and empowering web companies to govern the data we use and the way we use our data. This is about to form the next Internet oligopoly in which a few actors get access to all intellectual property free of charge and can, thus, occupy a strategic position for our data-driven future.
To some extent, however, the majority of data generated does clearly not contribute to the solution of a concrete given goal. This includes that the high radiometric resolution of current satellite missions like Sentinel does not help in low-cardinality classification such as land use. Another issue refer to global cloud disturbances of optical satellite missions like Sentinel-2 and GF-1, which lead to massive invisible pixels in multitemporal images [1]. Similar issues related to noise pollution, especially for hyperspectral image (HSI), are noticed by the community since long [2]. In addition, another example is the data structure in granules (e.g., Sentinel) leads to huge amounts of irrelevant data downloads as a polygon on Earth like the boundary of a city will intersect with quite a few granules and many of the delivered images might not actually have data in the city footprint due to the satellite only touching an edge of the granule for a bit of information. Furthermore, many derived products deliberately ignore many of the available channels (e.g., by keeping only RGB bands) in optical remote sensing.
In this article, we want to discuss how aggressive compression (especially compression with a high loss component) can help in reducing wasted transmission and storage costs to save time and energy for selected problems toward solving the reproduction crisis implied by the current trend of data explosion. We cluster pixel data for an aggressive quantization and use various lossless and lossy compression file formats to represent the whole dataset evaluating, how these parameters affect the performance of a simple classification system based on VGG-16 [3], ResNet [4], and MobileNet [5], which are pretrained on ImageNet. The process is visually represented in Fig. 1, which is separated This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ into three main parts. First is the source input data, which gets quantized and embedded into image file formats during the second part of the figure. The last step is to estimate the label of the quantized and embedded images using a convolutional neural network (CNN). The contributions of this article are summarized into three points. 1) We proposed a processing pipeline that quantizes images and embeds them into a range of selected file formats, where some of the formats applied lossy-or lossless compression. Compared to the baselines, image quantization can lead in to similar accuracies without a significant performance drop. In some cases, the accuracies are even slightly higher than using the source data.
2) The conducted experiments using three different remote sensing datasets showed that the quantization algorithms are allowing aggressive compression and cutting data communication and storage needs significantly. This leads to more efficient training on GPU and faster processing on custom hardware. 3) With the use of the processing pipeline the conducted experiments showed that the performance is occasionally comparable when compressing. Quantization and embedding into file formats can support the feature map within the images and leads to comparable results by making some features more prominent than others. The rest of this article is organized as follows. Section II briefly discusses related work. Then, Section III introduces the proposed approach of this article, including details of image quantization by clustering and the embedding to file formats. In Section IV, we present the considered datasets and evaluate the classification results. Finally, Section V concludes this article.

A. Compression in Deep Learning
The compression of data is still a very important and a present topic in all fields and is intensively studied in science. It is also finding its way into the fields of classifying image data, such as detection of land cover types, using machine learning and deep learning algorithms. In the case of deep learning, an attempt is made to improve the compression of images. On the other hand, the influence of compression, respectively, its artifacts, on the performance of the selected neural network is investigated. In this experiment, the attention focuses on the latter area. One approach focuses on how the quantization of images affects CNNs. The experiment performed in [6] deals exactly with this case, respectively, with the combination of quantization and pruning of data. For this purpose, all data have been quantized into different levels and unimportant parameters were removed by pruning, before classification. In a subsequent classification using CNNs, for example, VGG16 or ResNet50, the results have been compared with each other. On the one hand, the experiments are about imagery in general, but not in the context of remote sensing, such as land cover classification. Second, the compression of the datasets is based, among other things, on pruning and not on conventional compression-based formats, such as JPEG. The experiment described in [7] is exactly about that kind of format-based compression and its influences on classification methods. In contrast to the previously mentioned publication, the influence on machine learning algorithms is investigated here. Furthermore, although the images are embedded into the file format JPEG, but the compression is not exhausted. Other publications, such as [8] and [9] also deal with the influence of compression on the classification of image data using deep learning, but mostly not in the area of satellite-based remote sensing and conventional compressors. In addition, there are also techniques aiming at more efficient satellite images achieving, such as hashing-based image search and retrieval [10], [11], [12], which can significantly reduce the amount of memory required for storing massive satellite images in auxiliary archives, while in this article, we focus more on the compression of satellite images themselves to better support spatial deep learning. To our knowledge, little research has been done on the impact of quantization in combination with conventional compression-based image formats, such as JPEG or PNG, on the performance of CNNs in the context of satellite-based remote sensing.

B. Art of Storing Data
In the context of spatial computing, there are roughly two groups of data, the first one is the raw data and the second is the preprocessed version of the first one. This covers, for example, the tiling of the images, the generation of labels, and so on. While the raw products are stored without compression to avoid the loss of information, this work focuses on the preprocessed data, specifically images. Those can be stored in various ways that can be grouped in datasets that are stored with the use of file formats that apply compression algorithms to the images and those which does store the images without compression. In the domain of computer vision, ImageNet is a famous dataset, which consists of over 1,130,000 individual images stored using JPEG. One needs to take into consideration that in spatial images, such as in the domain of remote sensing, spatial information can be added to the metadata. Additionally, while ImageNet provides three bands, Sentinel-2 datasets, can provide up to 13 bands. Moreover, high-dimensional remote sensing data, such as Sentinel-2 multispectral or even HSI data, can be herein even challenging for the deep learning models, such as in [13] and [14], especially by considering their huge data volumes and high computational complexity [15]. To this end, the development in these domains (i.e., computer vision and remote sensing) are not directly comparable. The dataset BigEarthNet, provided and described by [16], [17], covers 590,326 16-bit remote sensing images, where each of the 12 included bands are stored in an individual TIFF format without any compression. This is the same also for other datasets, such as [18] and [19]. Of course, some of them are storing all bands in one image, and others do provide only 8-bit images. While one advantage of TIFF is that this file format is able to hold more than three or four channels in one file [20], this advantage is not applicable when using individual files for each band.
Another technique of storing data is to use hierarchical data structures instead of using common formats, such as JPEG or  TIFF on the file system of the operating system. The hierarchical data structures are able to save multiple images into one single file and make them accessible. This is especially relevant for high-performance computing (HPC) systems. One format is HDF5 provided by [21], which is able to hold data such as RGB images, but also images with a high number of channels. Additionally, it is possible to define subsets of the data, so, while the HDF5 file holds all images, it is possible to access only a subset of the data. Another format is TensorFlow's TFRecords, which is provided by [22]; similar to HDF5, it is possible to save data in various formats within one single file. The format is specialized to hold the data for deep learning processes and is suitable to be used on HPC systems, which can be easily combined with the proposed approach in this article.

III. PROPOSED APPROACH
In this section, we develop an end-to-end method to investigate and evaluate the impact of compression on images in form of reducing the bit depth combined with embedding the pictures in selected file formats. The scope is to evaluate the impact of certain compressive combinations. Additionally, with this end-to-end method, it is evaluated if the performance, in terms of the time and power consumption can be optimized by choosing a different combination, than the established ones. Before we formulate the problem, it is necessary to define the scope. More concretely, we formulate the following three hypotheses that will be discussed throughout this article based on experimentation with a given basic task from the family of land cover classification by the visual light imagery (RGB).
1) Hypothesis I: Even (almost) trivial approaches to image quantization can increase the accuracy of models based on fine-tuning existing models. 2) Hypothesis II: Quantization allows aggressive compression and cuts data communication and storage cost significantly. 3) Hypothesis III: Only a very small fraction of the information encoded in remote sensing imagery is relevant to land cover classification. The overall design of the end-to-end method, which will build step by step in this section is represented in Fig. 1. The visualization represents the architecture design to investigate the impact of compressing images on CNNs. The input data are first quantized to a range of given target bit depths, and afterwards, embedded into a range of file formats. Some of them are able to apply lossy compression algorithms. For each parameter pair of quantization and file format encoding, an individual CNN is trained on the pixel values to estimate the class. Finally, all training results are compared to see how much the compression of input data influences the classification.
Problem definition: More formally we can define a classification network with parameter Θ as F Θ (·). Additionally, a dataset D consists of images and labels. Each image x has a corresponding label y, which builds a pair (x, y). The defined network is able to estimate a label y from an given input image by y = F Θ (x). In order to train the network, its loss function L(y, y), needs to be minimized. The following equation defines the minimization of the loss: where Θ * is the optimal parameter set, and T , donates a combination of quantizing an input image to a given bit depth b and embedding it to a specific file format. Therefore, the optimal parameter set Θ * is influenced by T . The next subsections further investigate the quantization step and the image compression using file formats.

A. Image Quantization
One of the compressive methods, stated in the architecture in Fig. 1, is image quantization. Where we want to compress the image by reducing the bits that are available for a single channel. In this way, the color space for the quantized image x q is smaller than for the source image x. In this context, it is important to differentiate between lossy and lossless compression. The lossless type is when it is possible to reproduce the source object from its compressed representation. Compared to this, the lossy compression type archives a higher compression rate but does accept if there are some differences between the reproduced object and the source image. The problem of image quantization does reduce the bit depth of the image, and so we do lose information, therefore, it is a lossy compression procedure [23].
There are several methods to quantize an image, one basic method is to reduce the number of bits per pixel and channel with the use of clustering, in this article, the k-means algorithm [24] is selected and applied to each image.
The k-means algorithm is rather a basic algorithm and consists of several steps. The basic concept of the algorithm is to measure the distances between given data points, such as {R, G, B} pixel values. The dissimilarity measure V can be chosen and depends on the type of data, a common metric is the Euclidean norm. After initializing k random cluster centers µ k , all distances are measured and each data point x n is then assigned to one cluster where · describes the Euclidean norm. With this distortion function, each pixel value is assigned to a cluster, where the membership is signaled the binary indicator variable r nk . The value for r nk = 1 if k = arg min j x n − µ j 2 otherwise r nj = 0 for j = k. The function for k-means can be defined by combining the equations together and is defined as where N is the number of images. Each value in the {R, G, B} image channels is then represented by a cluster. So, the values of the source image x are clustered, e.g., 64 clusters, which is equivalent to log 2 (64) = 6 bit. Finally, the quantization of an image x to a given number of bits b can be defined as where x q is the compressed image and 2 b is the number of clusters. [23], [24] The image quantization process using the function J does reduce the color space. Since k-means is used for the quantization process, the computational complexity will be a quadratic term, which can be calculated as O((w * h * p) 2 ), where w represents the width of the image, h is the height, and p is the number of channels. The effects are visualized in Fig. 2.

B. Image Compression Using Data Formats
The second type of compressive method within the architecture, presented in Fig. 1, is the embedding of raw images into file formats like JPEG. Those file formats roughly consist of two parts. On the one hand, there is the algorithm, which compresses the image. The image format PNG, for example, uses the ZLIB and supports in this way multiple modes [25]. The other part of file formats is the encoding of the image into the file format-specific notation. Similar to image quantization, there are two types of image file formats. First is the lossless compression type C lossless , the goal here is to be able to reconstruct the source image without any information loss Compared to the lossless type, the lossy compression C lossy is the second type and does accept differences between the reconstructed and the source image. This can be defined as The two compressive image formats from (5) and (6), can be combined C = {C lossless , C lossy } to define the image compression with the use of file formats. All in all, the task is to embed and compress the source image x into several selected formats. This can be defined as where x c stands for the compressed image. This image compression process can be done in linear time.

BMP:
The image format C BMP = BMP(x) is a lossless image compression format, which supports eight bits per RGB layer. This file format does not use a compressor and stores the pixel values inefficiently. Due to the fact that the file format naively supports a total bit depth of 24 bits for a three layer image. So, it is necessary to lower bit depths in a 24-bit container if the source image x or quantized image x q exceeds this limit [26].

TIFF:
The image format C TIFF = TIFF(x) is a lossless image format, which supports a wide range of bit depths. Additionally, TIFF can hold more then the three {R, G, B} (plus an additional band for the alpha values) channels, for example, it is possible to save 13 channels of multispectral image into one single file [20].

PNG:
The image format C PNG = PNG(ZLIB(x)) is a lossless image compression format and supports a wide range of bit depths. The file format does compress the source images with the ZIP library, which supports nine levels (CL). While the level of zero indicates a PNG as an uncompressed image, a level of one indicates the best performance, and a level of nine indicates a maximally compressed image [25]. Fig. 3 visualizes the different (selected) options; one can clearly see that, especially in a small-scale image, there are no visible differences between the single levels.
JPEG: The image format C JPEG = JPEG(DCT(x)) is a lossy image format, which supports eight bits per RGB layer, while in theory, depths between one and 24 bits per layer are possible. Additionally, the file format use the parameter quality factor (QF) to adjust the level of compressing the image. While a value of 100 is an almost uncompressed image, the QF of one indicates the maximally compressed image [27], [28], [29]. Fig. 4 visualizes the different QFs, one can clearly see that very low QFs lead to a high information loss and chances the color space of the images. Fig. 1 shows the architecture of the setup in this article, after compressing the images, next is the classification of the images x qc . The task of the CNN is to estimate the classŷ and compare different combinations of image quantization and embedding methods. First, we quantize the image with the described algorithm J RGB (x, 2 b ) to bring the image to a given target bit depth b. This step reduces the number of available bits for each {R, G, B} pixel value. In edge cases, such as compressing a source 24-bit image to a two-color image, the impact is that the objects are not differentiable by the color anymore, only by their boundaries. Second, we embed the images C(x) into a file format that might compress the data, depending on the type. While PNGs compressor ZLIB does apply a lossless compression, JPEG causes an information loss, especially at low-QFs. This supports the occurrence of artifacts, such as blocking, in images. The combination of these two applied compression algorithms can be defined as

C. Image Classification Using the CNN
Furthermore, we can define the classifying CNN as F(·) with the parameters Θ. This classifier can estimate the label y of an input image label pair (x, y) by y = F Θ (x). In order to train the network, the loss L(y, y) needs to be minimized. The training can formally be defined as where Θ * describes the optimal parameter and D denotes the dataset x belongs to. Finally, the process of compressing an input image and estimating the class can be formally defined as

D. Evaluation
Outperforming baselines are not the scope of this work. Image compression can have a significant impact on the data, especially, in form of the size of the color space. Additionally, if the compression is aggressive, it does have an influence on the storage size. The scope of this work is neither to outperform existing baselines with compressed data nor comparing existing image quantization methods. The focus is to evaluate the impact aggressive compression has on the result. Therefore, the input images are quantized to a range of bit depths as well as embedded in four different file formats. Finally, the performance metrics are compared over all parameter combinations. Additionally, selected models are then deployed to custom hardware to evaluate if there are any changes in performance or power consumption.

IV. CLASSIFICATION OF COMPRESSED IMAGES
This section applies the build architecture from Section III on three datasets to evaluate the impact of the different compression methods. To evaluate the performance of the datasets and the impact of the compression, we have chosen the CNNs: VGG16, ResNet50, and MobileNet. Because these are basic networks, which are among other things sensitive to the colors and their space. For each dataset, the images are scaled to a resolution of 224 pixels by 224 pixels. All networks are trained for ten epochs with the original ImageNet top layer configuration (see [3]- [5]) and a batch size of 128 (except for AID@ResNet50, where the batch size is 32). Additionally, all models are pretrained on ImageNet. Furthermore, the Adam optimizer has been chosen with a learning rate of 10 −4 , except the VGG16 and MobileNet models of the dataset AID, which are trained using a learning rate of 10 −5 . All experiments are finished on two Nvidia Tesla A100 with 80 GB storage.
While the datasets are presented next, the following part of this section evaluates the classification results on hardware. We consider two different kinds, while the first one is a GPU, the latter covers the evaluation of model inferencing on the custom hardware, specifically a field programmable gate array (FPGA). This type of flexible and programmable hardware is widely researched in current space-industry projects. Additionally, there are certain versions of FPGAs labeled as radiation safe, which is then desired for a range of on-board processing tasks in an embedded platform.

A. Datasets
This section describes the datasets that are used in this work. An overview of the datasets is represented in Table I.
RSI-CB256: The land cover dataset RSI-CB256 is a collection of satellite images from different areas of the world, provided by [18]. This includes urban areas, such as New York or Berlin, as well as larger areas like entire countries. The dataset itself has in total 24,747 images in 35 different classes included in seven parent classes. The image categories are a single class problem. This means that each image can be uniquely assigned to one specific class. Furthermore, the images are remote sensing data based on satellites, with a ground resolution of minimal 0.22 m and a maximum of 3 m per pixel. Each image has a size of height and width of 256 pixels with the color space sRGB [18]. The total storage size of the dataset is 4,734 MiB.
Unfortunately, this dataset comes without spatial metadata and without a train-test split. Hence, we do not know the origin of each image patch and it is not unlikely that spatial correlation between train, test, and validation set does lead to an overestimation of the model performance in terms of accuracy. However, as the train-test split is the same across all our experiments, this does not affect any of the hypotheses.
EuroSAT: The dataset EuroSAT is a collection of georeferenced Sentinel-2 satellite images for large-scale patch-based land use land cover classification, provided in [30]. Sentinel-2 images are acquired using multispectral sensors onboard satellites covering over 13 bands. These images are freely accessible through the Earth observation Copernicus program. The dataset consists of images from 34 European countries and contains ten classes. The images have a pixelwise resolution of 10 m Fig. 5. Each source dataset has been trained and serves as a comparative baseline, this is a sample from the dataset RSI-CB256 on VGG16. While the x-axis represents the epochs, the y-axis is the training accuracy / loss. and the image patches are 64 × 64 pixels in size. Each class contains 2000 to 3000 images accumulating to a total of 27,000 images. The images considered here are optical images with three channels, i.e., Red, Green, and Blue (RGB). These images in the dataset are available as TIF files with a bit depth of 8 bits per channel. The total storage size of the dataset is 3,904 MiB.
AID: The AID dataset is a collection of satellite images taken from Google Earth and has been provided by [31]. Each image corresponds to one of the 30 classes, for example, airport or forest. The dataset includes 10 000 images with a resolution of a height and width of 600 pixels. As it is the case at RSI-CB256, the dataset has no spatial metadata. Furthermore, the images are stored with the format JPEG with a QF of 100. The total storage size of the dataset is 2,616 MiB.

B. Performance Comparison on GPU
The demonstrated architecture is designed to evaluate the impact of compression on images with spatial context. This work is focused on two hardware types, the first is a GPU and the second is custom hardware, such as an FPGA. This part of the section is about the impact of compression using GPUs.
1) Classification Baseline: Before evaluating the impact of compression, we need to create a baseline to compare the results with. The scope here is not to reach the current state-of-the-art models and accuracies, but rather to create stable models, which are unlikely to tend to overfit. This is more important than outperforming the current baselines. The reason is to reduce overfitting to a minimum at the wide range of bit depths and image file formats that are used for the datasets. All models are trained as explained before on models pretrained on ImageNet using the original top layer configuration. Furthermore, each dataset has been split into three parts, where the evaluation of the models is done with a separate test set. Fig. 5 visualizes the training process of the dataset RSI-CB256 as a sample. Additionally, the baseline classification results are stated in Table II. 2) Evaluation on Basis of the Accuracy: As described theoretically, each dataset is quantized and embedded in different  file formats. We applied a very harsh quantization approach by taking the k-means of a representative set of a few million pixels sampled from the whole dataset with k = 2 b and b = {1, 2, 3, 4, 5, 6, 7, 8}. This results in k cluster centers representing a color each. Next, the image is quantized by assigning each pixel to its nearest cluster center and storing the representative color of the cluster in a novel image. This image essentially encoded dominant and high-contrast spatial structures and is difficult to interpret even for humans. Additionally, all datasets are embedded in the image formats TIFF, BMP, PNG with a ZLIB compression level CL = 9, and JPEG with a range of QFs QF = {1, 10, 25, 50, 75}.
The evaluation of the quantized images is done with the trained baselines. Fig. 6(a) represents the classification results from all trained parameter configurations from the dataset RSI-CB256 and the model VGG16. The representation shows a high accuracy overall bit depths for the reference line, which is the quantized image embedded into a TIFF container. Besides, it is also recognized that a lot of bit depths reach high performances. While the source original dataset reaches an accuracy of 98% and requires a storage size of 4734 MiB, other configurations reach comparable values. For example PNG with a ZLIB compression level CL = 9 does reach partly higher values, specifically, with k = 7 an accuracy of 96.53% with a required space of 958 MiB. Another example is JPEG with QF = 10 and K = 4, which reached an accuracy of 97.25% and a required space of 107 MiB. This means that there is a performance drop of 0.75%, while using 2.26% of the originally required storage space on the disk. Additionally, it is also visible that stronger compression and quantization parameters tend to have a higher performance drop.
The results for the models ResNet50, Fig. 6(b), and Mo-bileNet, Fig. 6(b), are comparable to VGG16. The performance decrease between JPEG with QF = 1 and the rest is larger. Additionally, lower bit depths tend to have a slightly higher performance. All other parameter combination does cluster on a smaller bandwidth, which shows a larger degree of robustness. This can have two reasons, first is the small color space that sets a focus on the characteristic features of the image, and second is a tendency to overfit.
While the results of the dataset EuroSAT, visualized in Fig. 6(d), deliver similar results in case of VGG16. For ResNet50 and MobileNet, the performance drop of JPEG with CF = 1 is larger. Additionally, the performance is higher for smaller color spaces and more pronounced as was the case for RSI-CB256. The reduction of the color space sets a scope on the characteristic features of the image, which leads to this performance gains.
The results from the dataset AID visualized in Fig. 6(g) are different in some aspects. It is interesting to see that, even if the key aspects between the datasets are similar, except for the number of images and the image resolution, the accuracies are distributed over a wider bandwidth at higher quantization rates, but the bandwidth of the accuracies gets smaller when increasing the number of bits per channel. The reference image file format for this dataset reaches high accuracies and is partly still outperformed by PNG with CL = 9 and JPEG with QF = 10 and higher.
All in all, the results end up with similar results. While very high compression rates, such as the image format JPEG with QF = 1 and K = 1, tend to decrease the performance significantly in the presented cases. Additionally, the models tend to overfit easier in those edge cases, but also support the CNN by setting the focus of the images to the most characteristic features. Consequently, some combinations of quantization and compression do support the task of classification and increase performance. Additionally, the results do support the hypothesis because even with a small fraction of the information, it was possible to archive a performance comparable to the presented baseline classifications (see Table II).
3) Evaluation With Respect to the Training Tme: Next to the evaluation on basis of comparing the accuracies, the needed time for the training can also be a factor. This leads often to either vertical or horizontal scaling of hardware. Therefore, it is also a factor in this work. In order to compare the models based on time, the training for each dataset and configuration has been measured. Here, it is not necessarily important, which hardware configuration has been used as a training setup. More important is the comparison of how much time has been saved in the overall training process for each parameter configuration. On this basis, we evaluate the impact of quantization and the embedding into file formats on images with spatial information.
The training time of the dataset RSI-CB256 using the model VGG16 is visualized in Fig. 7(a), where each line represents a specific quantization and image format configuration. The reference line stands here for the image file format TIFF. It can be seen that the visualization has three important aspects. The first one is that the reference line, TIFF, is the slowest in all quantization levels and needs training time between 400 and 420 s. It is interesting to see that if the image gets embedded into the file format PNG with a ZLIB compression level CF = 9, the training time can be reduced to 350 s with high quantization rates and 310 s with lower compression rates of the bit depths. Additionally, the formats BMP and JPG do reduce the time for the training to constantly less than 300 s, which is a speedup of 25%. The models ResNet50, as well as MobileNet, archived similar result patterns. Only the bandwidth of the needed training time is different.
The dataset EuroSAT, visualized in Fig. 7(d), delivers roughly similar results. In the case of the VGG16 models, the results distribute with a similar pattern relative to each other, but the time for the training differs, which is not surprising. The results for the dataset AID trained on VGG16 visualized in Fig. 7(g), show slightly different patterns relative to each other than using the previous dataset. The dataset was stored using the file format PNG with ZIP compression, but the reference line has been stored using PNG with a compression level CF = 9, which requires 2923 MiB on disk. This configuration does lead to a training time of around 150 s on average on the model VGG16. The file formats BMP, PNG using ZLIB with CL = 9, and TIFF lead to longer training times than using the reference file format JPEG with QF = 100. Even with respect to the accuracy of estimated labels, JPEG reaches similar and better results than other file formats. For example, JPEG with QF = 75 and k = 8 reaches an accuracy of 90.8% with a training time of 139 s. Compared to TIFF with K = 8, which reached an accuracy of 90.8% with a training time of 240 s, PNG with CL = 9 and k = 8 reached an accuracy of 90.13% with a training time of 182 s using VGG16. This configuration saves 24.16% time during the training process compared to the file format TIFF. JPEG with CF = 100 (reference) and K = 8, does outperform this value and reaches an accuracy of 90.7% and a training time of 152 s. Although this may not take much time for this particular dataset, it can have a significant impact speed of the process and the prediction result for very large datasets. While MobileNet, archived similar result patterns where only the time frame of the training differs, the performance, in terms of time, for ResNet50 is not as stable and consistent as was the case for the other models.
Furthermore, this shows that we can save time by taking a careful choice when choosing the file format and the quantization level. So, the right compression does not only lead to similar or slightly better results, but it does also save computation time on the hardware for each training cycle. As it was the case at the evaluation on basis of the accuracy, it was possible to achieve a lower training time for specific parameter combinations. For example, a combination of using JPEG with QF = 75 does lead to a lower training time with a comparable small decrease of accuracy.

C. Compressed Image Classification on Custom Hardware
When considering the performance of a CNN, there are several aspects next to the amount of correctly estimated labels. An example would be the time of how long a device needs to predict the labels of a defined set of objects. Additionally, to that is the amount of energy needed by the computing device to estimate the labels. In this part of the work, we evaluate how many seconds of a custom hardware and the power consumption. This evaluation is done on an FPGA, specifically a Xilinx ZCU102.
Herein, the objective of testing on an FPGA is to evaluate the inferencing performance of the trained models (e.g., on CPUs or GPUs) in an embedded setup, and investigate the gain of inferencing speeds and power consumptions. To minimize the complexity, we select a specific dataset and models from the previous subsection, there are several aspects to consider when choosing a dataset. In order to show the impact clearly, it is important that the time difference for the training between the reference image file format and other formats should be maximized. Additionally, the performance in terms of accuracy should be robust over the different quantization levels (as shown in Fig. 6). Based on this criteria, the EuroSAT dataset using the network VGG16 has been chosen.
After selecting the dataset and the model, which has been trained on GPU, the model has been deployed to the FPGA. In this process, the model has been quantized using an 8-bit integer. Additionally, a small subset of 1000 images has been selected from the evaluation set of the dataset for the experiments on the FPGA. On hardware, the default configuration does use three DPUs and only one thread.
Three important rails need to be considered to measure the power consumption of the FPGA. First is the I/O Rail MGT,  which is relevant for the codecs and the video signals. This part consumes most of the time near the idle power, the reason is that the components are not utilized. Second is the energy level of the programmable logic (PL) parts of the power management. Last is the power consumption of the processor system (PS). Therefore, the total power consumption is by summing up these three aspects. Fig. 8 visualizes the comparison of the power consumption between the reference image format [see Fig. 8(a)] and JPEG with QF = 50 format [see Fig. 8(b)], sampled with 15 ms. The total power consumption graphs consist of multiple important sections, right after the start of recording, the graph jumps from an idle power of 8.1-8.3 W. This indicated the start of the program. The section between the start and the next jump at index 995 is the preprocessing of the images. In this case, it only covers reading the images. In the last section, where the power reaches a maximum of 13.31 W is the phase where the labels are estimated by using the DPU. One can clearly see that the consumption is higher for the PL than for the PS power. Additionally, there is no significant difference between the power consumption of using the TIFF format and JPEG. While there is no noticeable difference in terms of energy, there is a difference in the runtime of the two programs. The main impact is the preprocessing part, where the reading of the TIFF images takes, 13.4 s, and reading the JPEG files takes 10.7 s. This is a time saving of 25.23% per single run of of preprocessing time. Table III gives an overview over the runtime in seconds of each presented model. This factor becomes even bigger when using four threads, instead of one, because only the time needed for the training itself is shorter, and the preprocessing phase is shorter by an comparable amount (TIFF: 15.2 s, JPEG: 12.7 s). Compared to the total runtime of the task, larger performance increases can be reached by the right choice of threads. Nevertheless, the combination minimizes the total runtime. Fig. 9 visualizes the impact of using four threads instead of only one.  The total runtime is influenced by a wide range of factors. Next considering other image file formats to optimize the preprocessing time before estimating the labels for the dataset, the choice of the architecture of the CNN. While VGG16 has a deep architecture, networks like ResNet50 has some aspects that prevents the deepness. Fig. 10 shows the power consumption of the different aspects for the network ResNet50. The training time is significant, shorter than the one for VGG16. Consequently, the preprocessing takes an even larger part from the total runtime.

V. CONCLUSION
In this article, we have investigated the problem of training a CNN on images, which have been quantized to smaller color spaces, and then, embedded into different file formats. The target was to leverage different image compression methods to improve the performance, in terms of either time, accuracy, or power consumption and was studied on a GPU and FPGA.
In GPU, three different datasets have been trained on three different CNN architectures. These datasets have been quantized to a range of smaller color spaces, additionally, each image was embedded into a number of selected file formats. The evaluation of these benchmark datasets showed that some combinations do lead to faster training results than others. For example, a compromise between classification accuracy and training time, such as compression using JPEG with a QF of 75, does reduce the training time drastically, compared to the file format TIFF. Additionally, this performance gain comes with a reduction of the correctly estimated labels.
Furthermore, some quantization and compression parameters have been selected, based on their performance in terms of time and accuracy. The evaluation of the results showed a drastic, reduction in energy consumption could not be reached in this setup using this architecture. But the time for reading the images is different for each file format, which leads to higher efficiency and shorter computation time for each program. For example, JPEG outperformed TIFF with respect to the computation time of the full program. While a quarter of the preprocessing time can be saved by selecting JPEG over TIFF with a comparatively small loss of accuracy, the selection of a different network architecture does reduce the overall computing time drastically.
Based on the these results, the proposed approach showed that image quantization and embedding into file formats, can lead to accuracies without a significant performance loss, compared to the baselines. This is because the compression changes the feature map and sets the focus on some aspects of the image. Additionally, the efficiency of the training and image processing on custom hardware has been increased. In future work, one can benefit from the findings in this article by extending the proposed approach to related remote sensing tasks, for instance, hardwareoriented remote sensing data compression and high-dimensional satellite images (e.g., HSI) processing.
Gabriel Dax received the B.Sc. and M.Sc. degrees in computer science from the Salzburg University of Applied Sciences, Salzburg, Austria, in 2017 and 2019, respectively. He is currently working toward the Ph.D. degree with the Professorship of Big Geospatial Data Management, the School of Engineering and Design, Technical University of Munich, Munich, Germany.
His research interests include but are not limited to data compression, deep learning on custom hardware, and probabilistic data structures.