DeepCGP: A Deep Learning Method to Compress Genome-Wide Polymorphisms for Predicting Phenotype of Rice

Genomic selection (GS) is expected to accelerate plant and animal breeding. During the last decade, genome-wide polymorphism data have increased, which has raised concerns about storage cost and computational time. Several individual studies have attempted to compress the genome data and predict phenotypes. However, compression models lack adequate quality of data after compression, and prediction models are time consuming and use original data to predict the phenotype. Therefore, a combined application of compression and genomic prediction modeling using deep learning could resolve these limitations. A Deep Learning Compression-based Genomic Prediction (DeepCGP) model that can compress genome-wide polymorphism data and predict phenotypes of a target trait from compressed information was proposed. The DeepCGP model contained two parts: (i) an autoencoder model based on deep neural networks to compress genome-wide polymorphism data, and (ii) regression models based on random forests (RF), genomic best linear unbiased prediction (GBLUP), and Bayesian variable selection (BayesB) to predict phenotypes from compressed information. Two datasets with genome-wide marker genotypes and target trait phenotypes in rice were applied. The DeepCGP model obtained up to 99% prediction accuracy to the maximum for a trait after 98% compression. BayesB required extensive computational time among the three methods, and showed the highest accuracy; however, BayesB could only be used with compressed data. Overall, DeepCGP outperformed state-of-the-art methods in terms of both compression and prediction. Our code and data are available at https://github.com/tanzilamohita/DeepCGP.


INTRODUCTION
B Y 2050, 70% more food production is required to keep pace with the expected increase in food demand and ongoing climate change on a global scale [1]. To achieve this challenge, we need to enhance genetic gains in plant breeding through novel technologies [2], [3]. One such technology is the use of genome-phenotype associations [4], [5], [6]. These include genome-wide association studies (GWAS) [7] and genomic selection (GS) [8]. In GWAS, candidate genes are discovered based on the associations and selecting SNPs has a strong impact on the trait. On the other hand, GS usually does not intend to select important SNPs but to predict genotypic values based on the whole SNPs. While selecting SNPs, the major issue would be correlation among SNPs (linkage disequilibrium). GS is expected to be effective in improving complex traits (e.g., crop yield) controlled by a large number of genes, which have been difficult to improve [9].
The use of genomic data is progressing in various fields, and a massive amount of genomic data has been generated [10] as a resource for plant breeding [6]. Furthermore, with the introduction of high-throughput sequencing technologies, the number of data samples also tends to be large, resulting in challenges for storage and analysis of genomic data in the fields of genomics, bioinformatics, and quantitative genetics [11]. Moreover, the increasing size and dimension of data [12] have led to an intensified need for data compression and compression-based data analysis. The ability to compress genomic data will not only make it easier to store and analyze data, but also aid in streamlining the exchange of data via Web APIs etc. [13], [14].
To effectively analyze high-dimensional data, deep learning (DL) techniques [15] have been introduced in various fields, including genomics, genetics, and breeding. Several DL methods exist [16], [17], [18], [19] that can compress genomic data without compromising model performance. Wang et al. introduced a single sequence based compression method DeepDNA to compress human mitochondrial genome data using hybrid convolutional and recurrent deep neural networks [20]. In DeepDNA, each of compressed sequences can have different dimensions even though the sequences are originally in the same size. Goyal et al. introduced DeepZip, which used recurrent neural networks to compress single sequence based genomics and text data [21]. There have been few recent studies in compressing genomic data using a non-deep learning approach. In a recent paper, Yilmaz et al. introduced Macarons, which is a non-deep learning based SNP selection method that uses the correlations between SNPs to avoid the selection of redundant pairs of SNPs [22]. The SNP selection method of Macarons is fast, but it selects SNPs individually for each trait.
For GS, accurate prediction of phenotypes (strictly speaking, genotypic values) of a target trait is a central and recurring problem in quantitative genetics. Consequently, several genomic prediction methods have been proposed based on machine learning [23], [24], [25], [26], [27], [28] and quantitative genetic models, especially under a Bayesian paradigm [8], [29], [30], [31]. Gonz alez et al. compared Bayes A and Bayesian LASSO with two machine learning algorithms (boosting and random forests [RFs]) to predict disease occurrence in simulated and real datasets [32]. Although the differences between the methods were small, RF outperformed other methods in most cases. Abdollahi-Arpanahi et al. compared the predictive performance of two deep learning methods (multilayer perceptron [MLP] and convolutional neural network [CNN]), two ensemble learning methods (RF and gradient boosting), and two parametric methods (genomic best linear unbiased prediction [GBLUP] and BayesB) using real and simulated datasets [33]. The authors pointed out that the predictive performance of deep learning methods was marginally better than that of parametric methods for large datasets.
Generally, previously proposed methods in the literature had the following limitations: (i) quality of information after the compression was uncertain, and (ii) original data was utilized for predictions using machine-learning methods. In contrast, in this study, the proposed method predicts phenotypes of target traits based on compressed genome-wide polymorphism data instead of original (i.e., uncompressed) data. Despite the compression of several cycles, the proposed method retains high-quality information, and the prediction accuracy of our method is similar to that of genomic prediction based on the original data, which quantifies the quality of our compression method. Furthermore, we used multiple autoencoder networks, in which the calculation cost of the network increased linearly with the number of genome-wide polymorphisms (i.e., the dimension of genomic data), whereas the calculation cost of other popular methods increased with square order, which is also another novel aspect of the proposed method (Supplementary Section S1, which can be found on the Computer Society Digital Library at online available). To the best of our knowledge, there are no prediction methods that can predict the phenotypes of a target trait based on compressed genome-wide polymorphism data using Deep Learning in animal and plant breeding.
In this study, we developed a deep learning approach known as Deep Learning Compression-based Genomic Prediction (DeepCGP) to compress high-dimensional genomewide polymorphism data and predict phenotypes (estimated genotypic values) of rice agronomic traits from compressed information. DeepCGP consists of two models: (i) an autoencoder model to compress genome-wide polymorphism data; and (ii) a regression model to predict the phenotypes of a target trait based on compressed genome-wide marker data. To demonstrate the usage of DeepCGP, we used two different rice genome datasets, C7AIR, consisting of 7098 SNPs (single-nucleotide polymorphisms) and HDRA, consisting of 700000 SNPs. In this study, we demonstrated that DeepCGP could predict the phenotypes of a target trait based on compressed genome-wide polymorphism data, and achieved almost a similar accuracy to the prediction based on the original genome-wide polymorphism data. Additionally, we also compared the compression-based prediction performance of three genomic prediction methods (GBLUP, BayesB, and RF) to determine the general potential of compression-based genomic prediction over the methods for building a regression model.

Deep Autoencoder
To compress genome-wide polymorphism data, we used a deep autoencoder [34], [35] (Fig. 1). This autoencoder is composed of two symmetrical deep belief networks with multiple hidden layers: (i) an encoder network hðx i Þ, where x i 2 R d , an autoencoder first encodes an input x i to a hidden representation hðx i Þ ðlþ1Þ based on Equation (1), and (ii) a decoder network x 0 i , that maps the hidden representation hðx i Þ ðlþ1Þ back into a reconstruction x i 0 ðlÞ computed as in Equation (2): where f is an en encoding activation function, W ðlÞ is an encoding weight matrix, b ðlÞ is an encoding bias vector, g is a decoding activation function, W 0 ðlÞ is a decoding matrix, and b 0 ðlÞ is a decoding bias vector of lth input layer to l þ 1th hidden layer. The activation function of each layer except the middle layer and decoder layer is "ReLU" [36], which scales the negative output value to zero. The activation function of the middle layer and decoder layer is a "sigmoid" [36], which scales the output to the range [0, 1].
The reconstruction error was calculated as mean squared error (MSE) function, which is calculated as follows: where x i and x 0 i are the measured and predicted values, respectively, and n is the number of measured values with i 2 ½1; n.

Overview of DeepCGP
DeepCGP (Fig. 2) model can compress genome-wide polymorphism data and use compressed information to predict the phenotype of rice.
DeepCGP consists of two models. The first is an autoencoder model that compresses genome-wide polymorphism data. The second is a regression model that takes the compressed information generated by the autoencoder as input and attempts to predict the genotypic values of a target trait. Regression models such as random forests (RF), GBLUP, BayesB, etc., can be used for prediction.
In the first model, our aim was to compress the genomewide polymorphism data to the maximum limit. To achieve this, we generated several separated networks and trained the separated autoencoder models. The separated autoencoder models compressed the data, which was defined as Com-press_1, c 1 . To further compress the genome data, c 1 was used as the input, and the separated autoencoder models were trained for the second compression. The second compression was defined as Compress_2, c 2 . In this manner, the separated autoencoder models can compress any genomic data.
After compressing, the regression model will be established to predict phenotypes of a target trait of genotypes (plants/ lines). In this study, we used rice germplasm accessions.
In the present study, the models were trained in three steps: Step 1: The separated autoencoder model was trained to optimize Equation (5) and to compress the genome-wide polymorphism data. Step 2: The regression model mapped the compressed information to phenotypes of the target traits of rice germplasm accessions.
Step 3: After mapping each compression with phenotypes, we trained a regression model.

Datasets and Data Pre-Processing
In this study, we used two datasets of different sizes to demonstrate the broad applicability of our model. Based on these datasets, we evaluated our model using two metrics: (i) how precisely we compress data, where a better compression model is expected to minimize the loss of information in compression, and (ii) how successfully the compressed genomewide polymorphism can predict genotypic values of a target trait, where we expect a prediction performance close to the original data.
C7AIR: The first dataset used was the Cornell-IR LD Rice Array (C7AIR) [37], which offers a second-generation SNP array containing 189 rice accessions for 7098 markers from the Rice Diversity project. The 189 lines had estimated genotypic values for plant height.
HDRA: The second dataset was the high-density rice array (HDRA) [38]. The HDRA dataset consisted of 1568 diverse inbred rice varieties with 700000 SNPs. Among these lines, the genotypic averages of 34 traits were estimated for 388 lines [39], with some missing records. As 29 genotypes had more than or equal to 10 missing data, while 359 had less than 10 missing data, we chose 359 lines in 18 traits (Table 1). Furthermore, the genotype dataset was formatted as a bed matrix in vcf format, in which each entry was scored as 0, 1, or 2, where 1 was identified as heterozygous. Since the accessions used in the study were all inbred lines and were expected to be homozygous in most SNPs, we considered 1 as a missing value and converted 0 and 2 to categorical values A (adenine), C (cytosine), G (guanine), and T (thymine). Subsequently, we saved the output in the csv format. Furthermore, we used the 'gaston' package [40] in R for this conversion.
We pre-processed categorical values (A, C, G, and T) for both datasets by applying one-hot encoding. In addition, all genomes were encoded into one-hot encoding using a 4-bit coding scheme; that is, x 2 R dÂ4 , where d is the length of the genome sequence. "A," "C," "G," and "T" are encoded by "1000", "0100", "0010", and "0001", respectively. The C7AIR and HDRA dataset has $13% and $10% missing genotypes, respectively. Therefore, we encoded the missing values "N" by "0000" (Supplementary Fig. S2, available online).
After processing the raw data through one hot encoding, the dimensions of the C7AIR and HDRA data were 189 Â 28392 and 1568 Â 2800000, respectively. As the dimension of the input data was large, an input data splitting technique was applied, which reduced the computational time. We used the NumPy hsplit to split the one-hot encoded array horizontally (axis ¼ 1, i.e., 28392 and 28,00000 for C7AIR and HDRA, respectively). For C7AIR and HDRA, each split contained 189 Â 28 and 1568 Â 28 of data, respectively, that is, an input layer with 28 neurons in each network. Moreover, 1014 and 100000 separated autoencoder networks were employed for the C7AIR and HDRA datasets, respectively ( Supplementary Fig. S3, available online).

Implementation for Compression Modeling
An autoencoder model was utilized to compress the genome-wide polymorphism data. Each dataset was divided into training (60%), testing (20%), and validation sets (20%) using the scikit-learn 'train_test_split' library. To achieve the optimum performance of a compression model, for both datasets, we executed Keras wrapper class 'KerasRegressor', which permitted us to tune hyperparameters ( Table 2) using scikit-learn's 'RandomizedSearchCV'. Since the dimension of the C7AIR dataset is low, we tuned the hyperparameters on a whole dataset. For the HDRA dataset, we tuned the hyperparameters on a small subset of training data i.e., first 1000 splits of data where each split contained (1568 Â 28).
For the C7AIR genotype data, the selected model had three hidden layers in both the encoder and decoder networks. In Compress_1, the input layer of a network has 28 nodes, the first hidden layer has 14 nodes, the second hidden layer has 7 nodes, with a code size of 3. For further compressing the data, the first compression data (c 1 ) was used as an input in Compress_2. In Compress_2, the input layer of a network has 36 nodes, the first hidden layer has 28 nodes, and the second hidden layer has 10 nodes, with a code size of 5. Both compressions were trained with the Adam optimizer using a learning rate of 0.001. ReLU activation was applied to all layers of the encoder and decoder, except the middle and last layers, for which we applied the sigmoid activation function. The model was trained with MSE loss, and the minibatch size was 52 for Compress_1 and 32 for Compress_2. The epochs were set to 200 for both the compressions. The architecture selected for HDRA genotype data was very similar, except for the number of compressions, training epochs, and network structure. The data was compressed to the greatest extent as the HDRA dataset had very high dimensions. The number of nodes in each layer was [28,14,7,3], [30,15,5], and [25,14,5] for Compress_1, Compress_2, and Compress_3, respectively. Compress_1 was trained with 200 epochs and 52 batch sizes, Compress_2 with 100 epochs and 32 batch sizes, and Compress_3 with 150 epochs and 32 batch sizes. Moreover, the remaining parameters were the same as those for the C7AIR network.
The compression model was implemented using Keras functional API [41], which is written in Python and built on top of Tensorflow.

Random Forests (RF)
In the present study, random forests (RF) [42], [43] were used to predict the phenotypes of a target trait. RF is an ensemble machine learning algorithm consisting of individual decision trees. RF is often a collection of hundreds to thousands of trees, where each tree is built using a bootstrap sample of the original data. The final random forest predictor is computed by averaging the tree predictors over trees, which does not include the given observation in the bootstrap sample. Each tree minimizes the average mean squared generalization error or predictive error, which is used to assess the predictive accuracy. The construction of the RF algorithm can be described in the following steps [44]: 1. Draw ntree bootstrap samples from the original or compressed marker scores. 2. Grow a random forest tree T b for each bootstrap data set. At each node: i. Randomly select mtry variables for splitting. ii. Grow the tree so that each terminal node has no fewer than the node size cases. 3. Aggregate the prediction from each tree for prediction by majority voting and assembling the output of trees fT b g B 1 . An RF can be mathematically expressed as: where each predictor T b ðx i Þ is a decision tree [45] constructed with a bootstrapped sample B of the marker genotype score (or the compressed score) x i at iteration b(for b ¼ 1; . . . . . . ; B bootstrap samples).

GBLUP and BayesB
Moreover, we used GBLUP and BayesB as the commonly used Bayesian regression methods for genomic prediction.
The GBLUP model equation is: where y is the vector of the phenotypes of a target trait, m is the grand mean, 1 is a vector of ones (all-ones vector), u is the vector of estimated genotypic values, W is the design matrix that relates the genotypic values to samples (i.e., varieties/lines), and e is the vector of residual errors. In this study, we only had one phenotypic record for each variety/ line, W is an identity matrix of size n, where n is the number of varieties/lines. The vector u is assumed to follow a multivariate normal distribution u $ Nð0; Gs 2 g Þ, where 0 is a vector of zeros (all-zero vector), s 2 g is the genetic variance explained by genome-wide polymorphisms, and G is the genomic relationship matrix calculated as ZZ 0 = m, where Z is the matrix of original or compressed marker scores, and m is the dimension of the original or compressed marker scores. Each column of the matrix of marker scores Z, is scaled to have mean 0 and variance 1 prior to the calculation of G.
The model equation of BayesB is: where X is the matrix of unscaled original or compressed marker scores, and a is the vector of the original and compressed marker effects. When the marker scores are uncompressed, each element of X represents SNP genotypes, where 0 represents the homozygous genotype of the reference allele and 1 represents the homozygous genotype of the non-reference allele. When the marker scores are uncompressed, each element of X take values of 0 or 1 according to the compressed data. The prior distribution of a marker effect a k (k-th element of a) is assumed to follow a normal distribution with zero mean and marker specific uncertainty variance s 2 a k , and the variance s 2 a k is assumed to follow the same scaled inverse chi-square distribution. A detailed explanation of the BayesB model can be found in [29], [31].

Implementation for Prediction Modeling
In the present study, three prediction models, such as RF, GBLUP, and BayesB were used. In addition, we used compressed information as input and extracted the compressed data as a matrix from each dataset. Furthermore, we prepared the estimated genotypic values of a target trait omitting missing entries and arranged them in the same order as the compressed data. A prediction model for each trait was built separately. To evaluate the accuracy of the prediction models and to compare the accuracy among different compression levels, 10-fold cross-validation with five repetitions were applied, and the results were averaged. We used the same folds for all compression levels to ensure that the results were directly comparable. Furthermore, the prediction ability using the correlation coefficient between the estimated and predicted genotypic values was evaluated. Moreover, we evaluated the accuracy of a prediction model based on the original uncompressed genome-wide polymorphism data. Before building the prediction model, we processed the original uncompressed data converting A,T, G,C to 0 and 1 and NA to average values of 0s and 1s, respectively. A RF model was implemented using the 'ranger' R package [46], which is the fastest and most memory-efficient package to analyze high-dimensional data [42]. To train the RF model, we used default parameter settings of the 'ranger' function (num.trees: 500, mtry: square root of the number of tuning hyperparameters). To implement GBLUP and BayesB, we used the 'BGLR' package [47] in the R language. The MCMC (Markov Chain Monte Carlo) was run for 25000 iterations with a 5000 burn-in period for both GBLUP and BayesB.
All experiments in this study were conducted on a PC with an Intel(R) Core (TM) i9-10980XE, 3.00 GHz CPU, 128 GB RAM, GPU RTX 3090, and a 64 bit Windows 10 pro operating system.

Compression of Genome-Wide
Polymorphism Data The first experiment in this study was aimed to demonstrate the compression ability of DeepCGP for genome-wide polymorphism data. C7AIR and HDRA datasets were used to train separated stacked autoencoders and evaluate the model by calculating the training time and information loss. Furthermore, the compression ratios were calculated for both datasets; compression ratio is defined as the dimension reduction relative to the uncompressed size, and is given as follows: where h is the dimension after compression and x is the dimension before compression. Table 3 lists the dimensions of the compressed data, training time, MSE loss, and compression ratio for the C7AIR and HDRA datasets.

Prediction of Phenotypes Based on the Compressed Data
To evaluate the accuracy of the models and to investigate the compressed data, the prediction models were fitted to the compressed data. Fig. 3A and 3B shows the prediction accuracy of RF for different compression levels, including non-compression for both datasets. We considered the compression level according to the compression ratio percentage, which was 0% (original uncompressed data), 57% (57.14%), and 94% (94.01%) for the C7AIR dataset and 0% (original uncompressed data), 57% (57.14%), and 98% (98.57%) for the HDRA dataset. For the C7AIR dataset (Fig. 3A) an accuracy similar to that of the original data (with an average difference of approximately less than 3%) was attained even at 94% compression. For the HDRA dataset (Fig. 3B), the accuracy obtained outperformed that of the original data after 98% compression (with an average difference of approximately 5%) for all the selected 18 traits (Table 1). Moreover, DeepCGP could successfully predict phenotypes even after high-level compression. The predictive performance was compared between RF and two quantitative genetic models, BayesB [8] and GBLUP [48] (Supplementary Tables S1 and S2, available online). Both models are commonly used in genomic prediction; Figs. 4A and 4B display the predictive performance of BayesB, GBLUP, and RF for the C7AIR and HDRA datasets.
We evaluated the predictive performance of the original uncompressed data to the compressed data for both datasets. In the C7AIR dataset, the largest predictive performance was achieved by RF (0.72), followed by GBLUP (0.68) and BayesB (0.67), despite 94% compression. Contrarily, after 98% compression, the largest predictive performance of the HDRA dataset was delivered by BayesB (0.64) followed by GBLUP (0.63) and RF (0.60). The results suggest that RF yielded the highest accuracy of prediction for both original uncompressed data and compressed data of low-dimensional datasets (i.e., C7AIR). In contrast, it is difficult to apply BayesB to a highdimensional original uncompressed dataset (i.e., HDRA) owing to computational requirements. Therefore, we avoided calculating the prediction accuracy for the original uncompressed HDRA dataset, which is considered as N/A in Fig. 4B. However, BayesB was applied to compressed HDRA We calculated MSE loss for each autoencoder and then showed the average MSE loss after each compression. We compressed the C7AIR data up to approximately 94.01% and HDRA data up to approximately 98.57%. The compression levels of our model can be adjusted depending on storage requirements3.2 Prediction of phenotypes based on the compressed data data, and its prediction accuracy outperformed both GBLUP and RF. After 98% compression, the predictive performances of BayesB model were 0.01 and 0.04 higher than that of GBLUP and RF, respectively. Figs. 5A, 5B, and 5C show the prediction accuracies of RF, GBLUP and BayesB models, respectively, for the selected 18 traits (Table 1) of the HDRA dataset. After 98% compression, the predictive accuracy of trait id 16 was higher than that of low compression levels for GBLUP. The prediction times for RF, GBLUP, and BayesB were shorter at higher compression levels (Table 4). RF demonstrates the lowest time for both datasets at all compression levels compared to the other methods. For the HDRA dataset, BayesB takes a longer time for predicting even after applying compression; BayesB cannot be applied to the original data (i.e., 0%) owing to computational requirements, hence it is considered as N/A.

Compare With Other Compression Methods
We compared the compression performance of DeepCGP with Macarons which is a SNP selection method that takes into account the correlations between SNPs to avoid the selection of redundant pairs of SNPs. For comparing Macarons with our method DeepCGP, first, we selected SNPs using Macarons by setting the k values to 300000 (57%), 50000 (93%) and 10000 (98%). Then, we predicted the accuracy of phenotype using the Random Forest regression method. For predicting the phenotype, we used the same cross validation id as of DeepCGP to ensure that the results are directly comparable. Fig. 6 shows the prediction performance of DeepCGP and Macarons. The methods are compared for three different levels of compressions (57%, 93% and 98%) for the HDRA dataset. The y-axis shows the averaged prediction  Fig. S4, S5, available online). Although the selection method of Macarons is fast, the downside of this approach is to select SNPs for each trait. On the other hand, in DeepCGP, we can compress the data for all the traits using a single task. And the prediction accuracy of our deep learning based method DeepCGP is higher than Macarons, which proved that a deep learning based compression method would be better able to learn meaningful information compared to nondeep learning based compression method.

DISCUSSION
High dimensional genome-wide polymorphism data are extensively utilized for plant and animal breeding; this necessitates for the development of innovative platforms that can considerably reduce the resources required for storage and processing. Studies have shown that the intrinsic biological patterns found in genomic data provide a unique opportunity for researchers to compress high-dimensional genome-wide polymorphism data. Several individual studies have been conducted to compress the genome data and predict phenotypes. However, in most studies, there is uncertainty regarding the quality of data after compression and compressed data are not used during the prediction method. For instance, a fast reference-free genome compression method [16] used an autoencoder to compress genome   data, which could maintain the compression ratio at an acceptable level, while reducing the compression time for a small part of the gene. However, they did not include any information about the quality of the data obtained after compression. In contrast, the proposed method in this study was scalable for high-dimensional data owing to its design that uses a large number of autoencoders in parallel and iteratively, and retains high-quality information from compressed data that can be used for any kind of data analysis instead of the original data. Montesinos-L opez et al. suggested that DL prediction performance was higher for highdensity data sets when compared to conventional genomic prediction models [49]. Li et al. provided an integrated framework to conduct GWAS and GS in crops, with an environmental dimension that enhanced prediction performance in breeding for future climates [50]. However, to the best of our knowledge, to date no research has been conducted on the combined application of a compression and a prediction model. In this study, we developed a DL based compression-based genomic prediction model, DeepCGP, which substantially improved breeding and crop yields while providing a considerable reduction in storage requirements related to DNA sequence data. The most prominent advantage of using DL for compression is its ability to learn meaningful information from the underlying genetic architecture. This method is capable of modeling complex patterns with less intense computer requirements than other algorithms. The experimental results obtained in this study are extremely promising as we were able to provide phenotype predictions by evaluating the robustness of compressed data. The compression levels of our model can be adjusted depending on the storage requirements or prediction accuracy level. In addition, we investigated the predictive performance of three popular prediction methods, RF, BayesB, and GBLUP, to evaluate the potential of compression-based analysis. The results showed that the predictive performance of BayesB was slightly higher than that of GBLUP and RF. However, application of BayesB to the original uncompressed HDRA data was not possible, as the method was extremely time-consuming for analyzing highdimensional data (Table 4). For this reason, it is important to compress high-dimensional genomic data to apply methods, such as the BayesB method. Furthermore, it is important to compress data to address the computational challenges for managing large-scale genomic data, including storage, processing, complex data analyses, visualization, retrieval, and sharing [51]. Transporting large genome-wide marker data from one database to another (via the Internet), and sharing data among multiple databases using API (e.g., Breeding APIs BrAPI) [13] requires transportation efficiency as well as computational efficiency. These can be achieved by compressing the genome-wide marker data.
In addition, Deep Learning is still on the way of improvement and currently is not suitable to make suggestions for SNP sets. In another word, finding SNP sets using Deep Learning can be an important and a large research theme although we did not try it in this paper. Future work includes analyzing gradients on each element of the neural network that predicts phenotypes from SNP data.
A potential limitation of our approach is that we used diverse rice germplasm data to predict phenotype from the compressed data. We have not yet conducted experiments to different datasets such as soybean or human genome data. Hence, researchers have to use this new method with caution as DeepCGP's information loss can occur when applying it to the other datasets.

CONCLUSION
In conclusion, a novel deep learning model DeepCGP as a new paradigm was introduced to compress genome-wide polymorphism data that successfully predicts phenotype from the compressed information. DeepCGP methodology can potentially consider complex modeling into account. For example, lower-dimensional compressed data allow us to explicitly include interactions among polymorphisms (epistasis) in BayesB owing to the smaller dimensions (i.e., a smaller variable number) of the compressed data. Another novelty of the proposed method is that it provides a combinatorial application using DL for genomic prediction, which may substantially improve the computational efficiency of DL by using compressed data as input variables. The proposed method also provides a strong alternative for compressing high-dimensional genomic data and predicts phenotypes from compressed data, which is beneficial for saving storage as well as computational time.
Chyon Hae Kim received the doctor degree in engineering from Waseda University, in 2008. He is a director CTO of Sky Ocean Technology Co., Ltd. He is a visiting associate professor with Iwate University (2020-). He is technical adviser of AISing Ltd (2020-). He is an invited researcher Shimono Hiroyuki received the doctor degree from Hokkaido University, in 2003. He is professor with the Faculty of Agriculture, Iwate University. Japan Prize in Agricultural Sciences, Achievement Award for Young Scientists (2010) & Award for Young Scientists of Japanese Society of Crop Science (2010) were received. His research interests include focuses on agronomy, phenotyping technologies, stress physiology, and simulation modeling.
Akio Kimura received the master's degree in computer and information sciences from the Graduate School of Engineering, Iwate University, in 1993, and joined Sony Corporation. While with Sony, he was engaged in research and development of magnetic recording. In 1995, he joined Iwate university as an assistant professor, and is now an associate professor of the department of Systems Innovation and Engineering. He is engaged in research related to image processing, computer vision, and machine learning. He holds a Dr. Eng. degree and is a member of IEICE, IPSJ, the Institute of Image Electronics Engineers of Japan, and the Society for Art and Science.