Shape Based Deep Estimation of Future Plant Images

Plants exhibit dynamic changes as they grow. For example, a new leaf may appear suddenly, and rotate and fold over time. Therefore, it is difficult to predict the growth of plants. For accurate growth predictions, it is important to predict the shapes, colors and textures of the leaves. The conventional methods simply use RGB images to predict the next plant at once. In this paper, we propose a novel deep network which is divided into two subnets of shape estimation and color reconstruction. Four gray time-series images are first aligned to a future target using a spatial transformer network (STN) for shape estimation. They are then fused using U-Net with two LSTMs to generate a future shape image. The color reconstruction subnet fuses the predicted shape with a RGB plant image to restore the color information. In addition, we use gray images with texture information for shape estimation instead of binary images with only simple information and RGB images with too much information. The proposed deep network can robustly generate future plant images for plant growth prediction. It is evaluated using our proprietary dataset as well as two public datasets for different types of plants. The experimental results demonstrate that our proposed network predicts the leaf shape more accurately and restores RGB. As a result, our method can create accurate future plant images.


I. INTRODUCTION
It takes much effort to grow plants. For instance, plants need to be constantly supplied with the appropriate amounts of water, humidity, sunlight, temperature, and nutrients, and those environment factors ensure their high-quality growth [1], [24]- [27], [36], [37]. To improve productivity in plant cultivation, we can control the environment factors adaptively, depending on plant growth status. Sensors and monitoring systems have been used in a number of agricultural technologies to increase plant production [2], [28]- [31], [38]. However, they require expensive equipment to measure plant growth, and the measurement also takes much time and is destructive. If plant growth could be predicted with a cheap and remote sensing device such as The associate editor coordinating the review of this manuscript and approving it for publication was Pinjia Zhang . a vision sensor, that would contribute to cost reduction and automatic environment control without human intervention.
There are very few researches on plant growth prediction. Existing studies focused on literary analysis, especially plant leaf prediction among plant growth, and used existing backbone networks without proposing a new network [3]- [5], [15], [32]. These studies also pass through the network at once and predicts the next plant image. Also, they use only RGB images when predicting plant growth. This paper has a research gap from previous studies. We propose a new network structure which is suitable for plant growth prediction, uses additional gray images without using only RGB images, and is divided into two parts in detail without predicting the plant growth at once to further improve growth prediction accuracy.
In this paper, we propose a novel deep network that generates a future plant image from past and current ones for  plant growth prediction. To predict the growth accurately, the proposed method consists of two subnets. Fig. 1 illustrates the basic concept of the proposed deep network architecture. First, the overall shape of the plant is predicted from a number of input gray images without a background. Unlike conventional methods that work in the RGB domain, we predict the shape of the plant in gray domain. This reduces the interference of the texture and background in the plant image during the shape prediction [3]- [5]. Each individual input gray image is first aligned to a future target with a spatial transformer network (STN). The resulting multiple aligned images enter the network, and are fused in the shape subnet (U-Net with two LSTMs) to generate a future shape image. Second, the predicted shape image is grayscale, it does not have sufficient texture information. For its colorization and texture replenishment, it passes through the color subnet (auto-encoder net) where it is fused with an RGB image which is simply predicted by a STN from a current RGB image. Finally, the future RGB plant image is generated by the color subnet.
The main contributions of the paper are summarized as follows. First, the growth prediction task is decomposed into the two processes, shape and RGB prediction. The separation of these processes allows the accuracy of the shape prediction to be improved by minimizing the interference of the RGB information in affine transform with STN. In addition, prediction performance was improved using gray images by excluding color information unnecessary for shape prediction. Second, all the shape inputs are geometrically transformed into the future target shape before they enter the network. This leads to more effective shape prediction compared to direct fusion without the STN. Finally, the proposed method is evaluated with our proprietary dataset as well as two public datasets for different types of plants. We built a new dataset which was acquired from plant factory actually.

II. RELATED WORKS A. STN BASED VIDEO PREDICTION
Although CNN has been widely used, there are still some limitations due to spatial invariance. One of the limitations is that CNN cannot detect objects well when they change in size, rotate, or shift in space. In contrast, spatial transformer network (STN) can manipulate data spatially in the network [6]- [9]. It has been proposed to learn invariance to spatial changes such as size changes, rotations, and position translations. Even today, it is still widely applied to motion estimation and future frame prediction, and is also used in modified forms depending on the purpose [33]- [35], [39], [40]. In [10], a network with dual adversarial training is proposed. The network is divided into the frame and flow branches. The frame branch directly predicts future frames, and the other flow predicts the future flows. The outputs of the two branches are fused to predict the final future frame. In [11], the authors proposed a network that predicts the transformation parameters rather than directly predicting the future frames on the transformation domain. The affine transformation parameters for the final frame prediction are learned via a CNN by fusing a series of successive affine transformation parameters from the input video. That is, the network generates the affine transformation parameters for future frames.
Even in video prediction studies without STN, there have been proposed several methods based mainly on LSTM. In [12], the authors proposed a network that predicts future frames by separating video information into motion and content. The network is trained in an end-to-end manner (rather than separately), and the motion and content parts are combined after passing through different encoders. A network that performs the three processes of pose estimation, pose prediction, and image generation was proposed [13]. After estimating the pose of the time-series input images, the future frames are predicted in the pose domain. It is then reconstructed through image generation subnet from the pose input.

B. PLANT IMAGE PREDICTION
In general, a clean plant image can be obtained only when post-processing such as noise removal is accompanied [20]. However, in as stable environment such as our plant factory, noise or distortion hardly occurs in the acquired image. The architecture of the shape subnet. The ((t -3)-th∼ t -th) gray shape images pass through the spatial transformer network (STN) to be aligned to the target (t + 1)-th image. The results from the STN are passed through two LSTMs to estimate the (t + 1)-th shape image. Lastly, the color is reconstructed by employing RGB image.
In the meantime, there have been studied a number of deep learning methods to predict future plant images. Research on plant growth prediction mainly adopts the approach of future video generation. The conventional plant growth prediction popularly uses an auto-encoder structure with ConvLSTM, which is often used in video prediction. Multiple autoencoders corresponding to individual input are fused through ConvLSTM [14]. In [15], the network takes RGB images of a plant and their labels as an input, simultaneously, and again produces both label and RGB images in the output. Although it is a simple network architecture consisting of only an auto-encoder with ConvLSTM, it introduces four loss functions. In [3], the network uses GAN in addition to an auto-encoder with ConvLSTM. Furthermore, by extending [15], several auto-encoders are hierarchically fused with 1/2 and 1/8 resolution via ConvLSTM. The method was also modified by replacing the LSTM based fusion with simple concatenation of CNN-based channels.
On the other hand, there are studies that measure the features or contours of individual leaves rather than the overall appearance of plants and utilize them for growth prediction or recognition [21], [22]. In another aspect, a method of predicting growth has been studied by focusing on the prediction of indicators such as live weight rather than the appearance of plants [23].
The proposed network is highly different from these conventional plant image generation methods. The future frame generation task is divided into the shape and RGB prediction processes, which are performed in both the gray and RGB domains. The predicted gray shape image is then combined with the RGB image estimated by a STN for its colorization. In addition, extensive experiments are performed using three datasets, which consist of our own established dataset as well as two public ones.

III. THE PROPOSED METHOD
As plants grow, their leaves change dynamically. For example, the leaves are gradually enlarged and new leaves are generated. In addition, their movements are diverse from VOLUME 10, 2022 bending to spreading with time. In this paper, we attempt to predict the future plant image from a number of past and current images. This technology is particularly useful in plant factories, where the environmental factors for plant cultivation can be intelligently controlled according to the status of plant growth. One of the simplest methods to quantitatively calculate the amount of growth is to measure the leaf area. Thus, it is very important to accurately estimate the shape of the plant. This is the reason why the proposed estimation process is divided into 'shape estimation' and 'color reconstruction' parts. Fig. 2 illustrates the overall architecture of the proposed deep network. Gray and RGB images are fed into the network as inputs for shape estimation and color reconstruction, respectively. The shape of the leaves is estimated with gray images, while the color information is recovered with an RGB image which is the affine-transform version of a current one. The proposed deep network is divided into two subnets.
First, an affine transformation is performed by the STN to estimate the amount of plant growth from the past or current time to future target time. To avoid interference of background during the transformation, the leaves are segmented in advance. In addition, it was found from diverse experiments that the color information can be an obstacle for accurate affine transformation. Thus, the shape estimation is performed in gray domain. After the STN, the input images at distinct times are geometrically aligned to a future ground truth shape, and the resulting affine-transformed images are fed into the shape subnet to predict the future shape of the input plant. The color information needs to be replenished because the generated image is gray. For the task of color reconstruction, the affine-transformed RGB image is fused with the estimated shape image through an auto-encoder. It is estimated that the higher accuracy of the affine transformation could be obtained from a current image rather than from past images. Thus, we use an RGB image transformed from just a current time for the fusion. The two images are hierarchically fused by connecting the encoder of the RGB to the decoder of the shape as shown in Fig. 2.

A. SPATIAL TRANSFORMER NETWORK
The spatial transformer network (STN) conducts the affine transformation for the input, and finds affine parameters. In STN, there are three parts, localization network, grid generator, sampler. Fig. 3 shows the detailed structure of the STN. The localization network takes the input image and outputs a set of affine transformation parameters, θ. We input each of four gray shape images (from S t−3 to S t ) to the localization network, and each individual shape image is affine-transformed to the target image, S t+1 . This affine operation with the STN is expressed by where θ t+1 is a set of affine parameters, output from the localization network. Note that the t-th RGB image, I t , is not transformed, and the parameters for S t are used directly for the STN because the STN has poor performance on the RGB image. In other words, f loc (I t ) is equal to f loc (S t ) as follows.
The grid generator then creates a sampling grid, which is a set of points to be sampled from the input image. The predicted transformation parameters θ t+1 are used to generate the sampling grid. The grid generator applies the affine transformation as where (x t+1 i , y t+1 i ) are the target coordinates of the regular grid in the shape image S t+1 , (x t i , y t i ) are the source coordinates in the shape image S t that define the sample points, and θ t+1 is the affine transformation matrix, which applies cropping, translation, rotation, scaling, and skewing to the input image. The affine transformation matrix consists of six parameters which are produced by the localization network.
The attention allows cropping, translation, and isotropic scaling by varying s, t x , and t y . Finally, the sampler takes the input image and sampling grid, and produces the affine-transformed output by performing bilinear sampling to generate the shape image S t+1 .

B. SHAPE ESTIMATION
After performing the STN, the future plant shape is predicted by employing U-Net with two LSTMs. For shape estimation, we use gray plant images rather than binary or RGB images. The plant is segmented from the original plant image, and the resulting image is converted into grayscale. The gray plant image has no background. So, it can focus on leaves only without the interference of its surrounding background.
Eventually, it was found through diverse experiments that using gray images results in better shape prediction. First of all, for shape estimation, each of the four sequential gray images that passes through the STN is fed into the encoder which is connected to a decoder by two LSTMs. In detail, there are two LSTMs that works in 1/2 and 1/8 image sizes on the auto-encoder structure. After passing through the encoder, the 1/2-size image passes through the first LSTM, and the 1/8-size image passes through the second LSTM, as shown in Fig. 4. Four time-series images are sequentially combined by U-Net with two LSTMs. A sequential combination is made at the two distinct hierarchical layers. Likewise, the outputs of the two LSTMs are combined by decoder LSTMs as shown in Fig. 4. Using two LSTMs, we can predict the plant shape, S t+1 that is sophisticated and accurate. The final shape output S t+1 is obtained by fusing the outputs from the two decoder LSTMs with a U-Net decoder.

C. COLOR RECONSTRUCTION
The previous shape subnet generates only a gray shape image. The color information should be reconstructed additionally, and the estimated shape image still lacks of texture information. We fuse the estimated RGB image, I t+1 with the shape from the auto-encoder.
The RGB image, I t+1 is the aligned version of a current RGB to a target. Note that the STN is performed with the parameter set θ t+1 already obtained by the shape subnet without performing the STN with the RGB image actually. The final shape output S t+1 from the shape subnet and the (t + 1)-th RGB image I t+1 are employed for restoring the color information. S t+1 and I t+1 pass through their respective encoders, and are hierarchically fused at a decoder. In detail, the RGB image I t+1 passes through the encoder and its resolution is reduced to 1/2 and 1/4 by convolution. Then, they are concatenated to the shape decoder. The color reconstruction process is detailed in Fig. 5. Finally, we can get the result of predicting the (t + 1)-th plant image from the (t-3)-th∼ t-th plant images. The final output shows good growth tracking and sophisticated color and texture for each leaf as it goes through the shape and color subnets in sequence.

D. LOSS FUNCTION
We used the L1 loss to train our network. The network was trained with other losses such as mean square error and SSIM, and the best results were obtained when trained using L1 loss. There are six L1 losses. Four of these are STN losses, and the remainder are shape and texture losses, as the follows.
where µ 1 , µ 2 and µ 3 are the coefficients that are experientially determined. The first term, L STN , is a loss after passing through the spatial transformer network. We calculate L STN for the four gray shape inputs, which is given by The second term, L SHAPE , is the loss between the final shape output image (which is the output of the shape subnet) and the (t + 1)-th gray shape ground truth, and is given by The last term, L COLOR , is the loss between the final RGB output of the overall network and the (t + 1)-th RGB ground truth, and is given by

E. USE SCENARIO
Practical usage is illustrated in Fig. 7. A plant is captured every day (one-day interval). Four snapshots (three past and one current) are required as the network input, and then, the network generates the plant image of the next day. From the network output, we can estimate the plant growth state at the next day. In this way, the network can predict the future plant image from just previous four images.
Although the time interval is set to one-day in this paper, the network can be trained for any interval. However, the prediction performance decreases as the time interval increases. The future work is to increase the time interval while the performance is preserved. Also, see Fig. 13 which compares the performance between one-day and two-days intervals.

A. EXPERIMENTAL SETTING
Our proposed network is trained with the plant datasets. The input image of size 128 × 128 is used as it is without dividing the image into patches. The network is implemented using the PyTorch framework on a PC with 2 NVIDIA RTX 2080ti GPUs [16]. We adopted the Adam optimizer for loss optimization. The batch size is 8 [17]. The initial learning rate is 0.0001 and is divided by 10 for every 30k iterations.

B. TRAINING DATASET
We used a variety of datasets for the experimental evaluations on different types of plants. In particular, we grew plants in a plant factory to create our own dataset. The acquired plant images were cropped and segmented for our own plant dataset. The proposed network is evaluated with the three kinds of datasets, Aberystwyth [19], Komatsuna [20] and our own Butterhead. The Aberystwyth and Komatsuna datasets consist of time-series RGB leaf images and their corresponding binary shapes.
We grew green Butterhead lettuce, which is one of the types of lettuce directly in a plant factory. Our own dataset, which is named as ''Butterhead'', consists of several Butterhead growth images captured at one-day intervals over a period 18 days. The dataset pre-processing step is shown in Fig. 6. First, a camera is attached to the ceiling of a plant factory, Butterhead lettuce seeds are planted in soil, and they are grown for 18 days. The acquired Butterhead images are cropped to 128 × 128 size so that the plant is located in the center of an image. Next, from the resulting RGB image, a binary leaf shape image is obtained by segmentation. The leaves of the Butterhead images can be easily segmented by appropriately thresholding RGB values due to almost uniform background. The binary shape is then multiplied with the RGB image to extract only the foreground plant leaves without the image background. Finally, the images acquired in the previous step are converted to gray. The remaining two datasets also go through the same process.
For data augmentation, we rotated the plant image 90 degrees, 180 degrees, 270 degrees and flipped the plant images along the x-axes and y-axes. As a result, training data in the Aberystwyth dataset is a total of 1,674 images, Komatsuna dataset is a total of 448 images and our Butterhead dataset is a total of 972 images. Averystwyth [19], Komatsuna [20]  When training, four (t − 3)-th ∼ t-th time series images and one t-th RGB image are used as inputs. At the end of the training, the (t + 1)-th gray shape image and RGB image are obtained as outputs.

C. COMPARISON WITH THE EXISTING METHODS
Various experiments have been conducted to demonstrate that our network is the best suited method for plant growth prediction. Furthermore, plant growth prediction is evaluated with the existing video prediction networks [12], [13] as well as plant growth methods for comparison. Video frame generation is very similar to plant growth prediction because the movement of an object is predicted over time in both tasks. The task of video prediction estimates the next motion of a moving object, which is very similar to the growth of a plant in our plant prediction task if the moving object is considered as a plant. In other words, object motion corresponds to plant growth in our work.
The future plant images generated by U-Net [3], U-Net-LSTM [3], MC-Net [12], and HP-Net [13] are evaluated. The experiments of the existing methods were conducted by directly putting a sequence of plant images into the FIGURE 9. Comparison of plant image prediction results from three datasets. The first and second rows from the Averystwyth [19] dataset, the third and fourth rows from our Butterhead dataset, and the last two rows from the Komatsuna [20] dataset. Note that the background is not included to the task of plant image prediction, and the object of a plant only is predicted. Artifact in the background occurs during color reconstruction.
networks instead of video frames. Fig. 9 shows the experimental results. For comparison in detail, take a closer look at Fig. 9. Fig. 9 (a) shows the RGB ground truth while Figs. 9 (b) [3] and 9 (c) [3] show the experimental results from the existing growth prediction methods. The only difference between the two methods is how to fuse multiple encoded inputs. Fig. 9 (d) [12] and (e) [13] show the respective results from MC-Net and HP-Net, which are used for video prediction.
As shown in Fig. 9 (b), in the existing U-Net leaves growth is not predicted correctly. The shapes of the individual leaves are distorted. Moreover, the leaves are blurred and heavy artifacts occur. U-Net fails to detect the growth movement in the time-series data because of the simple concatenation VOLUME 10, 2022  fusion, consequently leading to inaccurate prediction of the subsequent leaf.
Replacing concatenation with LSTM for fusion improves the growth prediction of the individual leaves, as shown in Fig. 9 (c). Compared with U-Net, we can see that U-Net-LSTM estimates the shape of individual leaves more accurately. The importance of LSTM in time series data is also confirmed in the other methods. Note that all the methods except for Fig. 9 (b) adopt one or more LSTMs in the network architecture.
The first result image in Fig. 9 (d) is similar to the ground truth. Without leaf artifact, the leaf boundary is less distorted and the individual leaves are well generated to the right size. However, the second shows more artifacts than the first. At the beginning of plant growth, the amount of growth over time can be clearly recognized at a glance. However, as the growth is matured, its growth rate decreases. For this reason, we can see that the leaf shape prediction is poor in the second result compared to the first. MC-Net is originally divided into motion and content encoders as our proposed network structure is divided into shape estimation and color reconstruction subnets. For the comparative experiments, we use motion encoders to predict the leaf shape and content encoders to replenish the leaf color. Fig. 9 (d) shows that the adoption   of content encoders in MC-Net improves the textures of the plant leaves compared to Fig. 9 (b) and Fig. 9 (c), for which no content reinforcement modules are used. Fig. 9 (e) achieves a similar performance with less artifact for the first and second images, regardless of the degree of plant growth. However, if individual leaf is observed closely, they grow less, so the leaves are either comparatively smaller or shorter. And the overall shape of the leaves is distorted compared to the ground truth. HP-net originally consists of pose estimation and image generation. Similar to MC-Net, we predict the shape of plant leaves with pose estimation for the comparative experiments and enhance the texture inside the plant leaves using image generation. Three convolutional LSTMs are used for the pose estimation. Because of the poor posture estimation, the leaves do not grow properly and the shapes are distorted. The image generation subnet creates the final image by concatenating the difference between the t-th and (t + n)-th pose with the t-th RGB image. Look at the leaf texture in Fig. 9 (e), the texture is blurred. These experimental results show that the image generation subnet performs poorly in plant growth prediction.
The proposed method in Fig. 9 (f) shows highly accurate prediction of plant leaves. It is superior to the existing methods in terms of plant boundary, artifact, and blur. In particular, the proposed method performs better in predicting the plant leaf shapes than the existing methods. This is because the STN first aligns the plant leaves, and then they are fused with LSTM for leaf shape prediction. Our experimental results demonstrate that the blur is reduced while the texture of the leaves is enhanced. In addition, if the execution time is compared, the conventional U-Net-LSTM [3] takes 3.298 seconds while the proposed method takes 3.351 seconds. The existing method predicts a future plant image directly from RGB inputs. On the other hand, the proposed method consists of two subnets (shape estimation and color reconstruction). So, the network is more complex and takes longer to go through STN. However, the usage scenario of the proposed method does not require real-time processing. The proposed prediction is used for the control of plant cultivation environment whose interval is relatively long (one-day at least). In other words, the prediction is made every day, and based on the prediction, the environment factors of the plant factory are controlled. Fig. 10 shows the intermediate outputs of the proposed network shown in Fig. 2. Initially, the shape inputs are aligned to the target, S t+1 by passing through the STN. Fig. 10 (b) shows the STN output for the input S t . The aligned outputs are then fused with the U-Net with two LSTMs to estimate the shape output, which is shown in Fig. 10 (c). If the IoU values (the measure of binary shape estimation) are compared between Figs. 10 (b) and (c), the accuracy of the shape output is higher than that of the STN output, as expected. The result confirms that the shape estimation subnet can generate a shape image that is more accurate than the affine-transformed version. The final RGB output in Fig. 10 (d) shows that the leaf texture is restored through the RGB reconstruction subnet. Table 2 shows the SSIM and PSNR comparisons of the results from U-Net, U-Net-LSTM, MC-Net, HP-Net, and the proposed network. Please, note that SSIM and PSNR are quantitative evaluation metrics for digital images. The former measures scene structure similarity while the latter does the error of image signals. The formulas for SSIM and PSNR are as follows.
where µ is the mean, σ is the standard deviation, C is a predefined constant where s is the maximum value in the image and MSE indicates mean square error. Detailed explanations are provided in [41], [42].
Overall, U-Net-LSTM and MC-Net achieve better results than HP-Net and U-Net among the conventional networks, and our proposed network achieves the best quantitative results. Fig. 11 shows the intermediate results of the proposed network. The (t − 3)-th ∼ t-th input images are aligned to the (t + 1)-th image with STN as a preprocessing. As can be seen in Fig. 11, the accuracy of STN is higher if the time interval between the current image and the target image is short. The STN output of S t →Ŝ t+1 with the shortest time interval is the most similar to ground truth. This is straightforward as expected. By aligning each input to the target as a preprocessing, the shape estimation subnet efficiently works, and it can generate better accurate shape.

D. EFFECT OF SHAPE IMAGE TYPES
The proposed shape estimation subnet runs in the gray domain. Its performance is compared for the different image types of binary, gray, and RGB images. As shown in Fig. 12, prediction with binary images causes blur and shape distortion in the final RGB output, resulting in poor visual quality and inaccurate shape estimation. RGB inputs are poor at predicting the shapes of the leaves, leading to shape boundary artifacts. Note that for RGB inputs, the color reconstruction subnet is not needed, and the shape estimation subnet works only on RGB. The proposed gray inputs accomplish less blur and better texture restoration for each leaf as confirmed in Fig. 12 (e). Comparing Fig. 12 (c) with Fig. 12 (d) and Fig. 12 (e), we can see the importance of the separate shape estimation. In addition, gray images, which are simpler than RGB images and contain more information than binary images, are the most suitable for the shape estimation subnet. Recall that for RGB inputs, the shape estimation subnet VOLUME 10, 2022 generates the final RGB output without the color reconstruction subnet, unlike the binary and gray inputs.

E. EFFECT OF TIME INTERVAL
When time-series plant images are fed into the network, the time interval between inputs is increased to identify its effect on the plant growth prediction accuracy. The proposed network is evaluated for the two time intervals, of one-day and two-days. As shown in Fig. 13, the capability of plant growth prediction is superior for the one-day interval rather than the two-days interval as expected.

F. ABLATION STUDIES
The proposed network is evaluated by changing its structure to demonstrate its superior performance. First, the shape output in the shape estimation subnet is removed to find its importance as a constraint. Fig. 14 shows the results of the experiment with and without the shape output. Note that 'with shape output' means adding a shape loss at the output of the shape estimation subnet. As shown in Fig. 14 (b), the leaf boundary is not restored properly for the case of 'without shape output'. The results are closer to input rather than ground truth. For the case of 'with shape output' in Fig. 14 (c), plant leaf shape prediction is better without artifact, and unlike 'without shape output', it is closer to the ground truth than the input. This confirms the significance of the shape constraint.
Next, our network is compared with the network without the STN preprocessing to know the effectiveness of the STN. If there is no STN preprocessing, the boundaries of the leaves are not clear and their shape distortion is severe. On average, the IoU value is lower when STN preprocessing is not performed. In particular, the second row of Fig. 15 (b) shows a significant difference between with and without STN preprocessing in IoU values. As a result, we can see that the alignment of the plant shape before the shape estimation subnet can contribute to clear and less distorted shape.
Finally, we evaluate the performance of different shape estimation subnet structures. The proposed subnet is compared to two structures. One is to use a single LSTM, and the other is to fuse multiple inputs by concatenation instead of the proposed LSTM fusion. As confirmed in the red box in Fig. 16 (b), the size of the predicted leaf is still small. This means that 'one LSTM' fails to predict the plant growth properly. For concatenation in the red box of Fig. 16 (c), we observe a gray boundary artifact in the outer area of the leaf. This is an error caused by the poor plant shape prediction. The proposed method shown in Fig.16 (d) is the most similar to the ground truth. This demonstrates that fusing with two LSTMs produces the best results.

V. CONCLUSION
In this paper, we proposed a new plant growth prediction network, which consists of two subnets; shape estimation and color reconstruction. In the former, the shape of the plant leaves is estimated with gray images to increase the accuracy of affine transformation by the STN. Then, the latter performs color reconstruction by the hierarchical fusion of shape and RGB images with auto-encoder. Unlike existing networks, we first align four shape inputs with a future target as a preprocessing, and this achieves better shape prediction. We evaluate three different kinds of plant datasets where the Butterhead dataset is created by growing plants ourselves in plant factory. Diverse experiments demonstrate that the proposed network shows the superior prediction performance for plant growth, compared with the existing plant growth prediction method and video frame generation methods.
In the future, we will study the prediction of plant growth using other information (e.g., illumination on leaves) as well as plant image.