Multi-Task Deep Learning With Dynamic Programming for Embryo Early Development Stage Classification From Time-Lapse Videos

Time-lapse is a technology used to record the development of embryos during in-vitro fertilization (IVF). Accurate classification of embryo early development stages can provide embryologists valuable information for assessing the embryo quality, and hence is critical to the success of IVF. This paper proposes a multi-task deep learning with dynamic programming (MTDL-DP) approach for this purpose. It first uses MTDL to pre-classify each frame in the time-lapse video to an embryo development stage, and then DP to optimize the stage sequence so that the stage number is monotonically non-decreasing, which usually holds in practice. Different MTDL frameworks, e.g., one-to-many, many-to-one, and many-to-many, are investigated. It is shown that the one-to-many MTDL framework achieved the best compromise between performance and computational cost. To our knowledge, this is the first study that applies MTDL to embryo early development stage classification from time-lapse videos.


I. INTRODUCTION
In-vitro fertilization (IVF) [1]- [3] is a frequently used technology for treating infertility.The process involves the collection of multiple follicles for fertilization and in-vitro culture.Cultivation, selection and transplantation of embryo are the key steps in determining a successful implantation during IVF [4], [5].During the development of embryos, the morphological characteristics [6] and kinetic characteristics [7] are highly correlated with the outcome of transplantation.
Time-lapse videos have been widely used in various reproductive medicine centers during the cultivation of embryos [8] to monitor them.A time-lapse video records the embryonic development process in real time by taking photos of the embryos at short time intervals [9].Thus, a large amount of time series image data for each embryo are produced in this process.At the final stage of embryo selection, an embryologist reviews the entire embryo development process to score and sort them.Studies with different time-lapse equipment reported improved prediction accuracy of embryo implantation potential by analyzing the morphokinetics of human embryos at early cleavage stages [8]- [12].These features have been shown to be statistically significant to the final outcome of the transplantation [7].
There have been only a few approaches to analyze timelapse image data [9], [13]- [18].Due to the limitations of the time-lapse technology, stereoscopic cells of different heights overlap in the images when photographed.It is difficult for even an experienced embryologist to accurately count the number of cells in a single time-lapse image when there are more than eight cells.Therefore, most research focused on the early development stages of embryos.Wong et al. [9] identified several key parameters that can predict blastocyst formation at the 4-cell stage from time-lapse images, and employed sequential Monte Carlo based probabilistic model estimation to monitor these parameters and track the cells.Wang et al. [13] presented a multi-level embryo stage classification approach, by using both hand-crafted and automatically learned embryo features to identify the number of cells in a time-lapse video.Conaghan et al. [14] used an automated and proprietary image analysis software EEVA TM (Early Embryo Viability Assessment), which exhibited high image contrast through the use of darkfield illumination, to track cell divisions from onecell stage to four-cell stage.Their experiments verified that the EEVA Test can significantly improve embryologists' ability to identify embryos that would develop into usable blastocysts.There are also several other studies on embryo selection by using EEVA TM [19]- [22], but they did not provide the details of the used EEVA Test.Jonaitis et al. [15] compared the performance of neural network, support vector machine and nearest neighbor classifier in detecting cell division time.Khan et al. [18] used a deep convolutional neural network (CNN) to classify the number of cells, and also semantic segmentation to extract the cell regions in a time-lapse image [16].Ng et al. [17] combined late fusion networks with dynamic programming (DP) to predict different cell development stages and obtained better results than a single-frame model.
Multi-task learning has been successfully used in many applications, such as natural language processing [23], speech recognition [24], and computer vision [25].Its basic idea is to share representations among related tasks, so that each trained model may have better generalization ability [26].This paper proposes a multi-task deep learning with dynamic programming (MTDL-DP) approach, which first uses MTDL to pre-classify each frame in the time-lapse video to an embryo development stage, and then DP to optimize the stage sequence so that the stage number is monotonically non-decreasing, which usually holds in practice.To our knowledge, this is the first study that applies MTDL to embryo early development stage classification from time-lapse videos.
The remainder of this paper is organized as follows: Section II introduces four classification frameworks for timelapse video analysis.Section III proposes our MTDL-DP approach.Section IV presents the experimental results.Finally, Section V draws conclusion.

II. CLASSIFICATION FRAMEWORKS
This section introduces four frameworks for embryo early development stage classification from time-lapse videos.We first describe our dataset and the baseline network architecture, and then extend it to many-to-one, one-to-many and many-tomany MTDL frameworks.

A. Dataset
The time-lapse video dataset used in our experiments came from the Reproductive Medicine Center of Tongji Hospital, Huazhong University of Science and Technology, Wuhan, China.It consisted of 170 time-lapse videos extracted from incubators, using an EmbryoScope+ time-lapse microscope system1 at 10-minute sampling interval.Each frame in the video is a grayscale 800 × 800 image with a well number in the lower left corner and a time marker after fertilization in the lower right corner, as shown in Fig. 1.The embryo is surrounded by some granulosa cells in the microscope field.The scale bar in the upper right corner indicates the size of the cells.Each video began about 0-2 hours after fertilization, and ended about 140 hours after fertilization.We only used the first N = 350 frames in each video, which were manually labeled for the embryo development stages.Therefore, we had a total of 170 × 350 = 59, 500 labeled frames in the experiment.
As in [17], we focused on the first six embryo development stages, which included initialization (tStart), the appearance and breakdown of the male and female pronucleus (tPNf), and the appearance of 2 through 4+ cells (t2, t3, t4, t4+).We counted the number of images in different embryo development stages in the dataset, and show the summary in Fig. 2. Note that t3 was rarely observed in our dataset.

B. The Baseline One-to-One Classification Framework
Let x n be the nth frame in a time-lapse video.For image classification, a standard one-to-one classification framework learns a mapping: where y n is the stage label of x n , and L the label set of the embryo development stages.When information of the previous and future frames is used, the standard one-to-one classification framework can be extended to many-to-one, one-to-many and many-to-many MTDL frameworks, as illustrated in Fig. 3.
We used ResNet [27], which won the 2015 ImageNet classification competition, to process individual video frames.Table I shows our baseline ResNet50 model.The input image had three channels (RGB), each with 224×224 pixels (the 800×800 images were resized).The model was initialized from the ResNet weights pre-trained on ImageNet [28], which can help reduce overfitting on small datasets.

C. The Many-to-One MTDL Framework
The many-to-one MTDL framework, shown in Fig. 3(b), is frequently used in video understanding [29]- [31] because multiple frames in the same video usually have the same label, and hence they can be considered together to predict the final label.Many-to-one can better make use of input context information than one-to-one.
Many-to-one performs the following mapping: where τ is the number of neighboring frames before and after the current frame (the input context window size is hence 2τ + 1).
There are two common approaches to fuse time domain information from the 2τ + 1 frames: Conv Pooling [32] and Late Fusion [30].
1) Conv Pooling: This is a convolutional temporal feature pooling architecture, which has been extensively used for video classification, especially for bag-of-words representations [33].Image features are computed for each frame and then max pooled.The pooling features can then be sent to fully connected layers for final classification.A major advantage of  this approach is that spatial information in multiple frames, output by the convolutional layers, is preserved through a max pooling operation in the time domain.Experiments [32] verified that Conv Pooling outperformed all other feature pooling approaches on the Sports-1M dataset, using a 120-frame AlexNet model [34].
2) Late Fusion: In Late Fusion, all frames in the input context window are encoded via identical ConvNets.The final representations after all convolutional layers are concatenated and passed through a fully-connected layer to generate classifications.The concatenation can happen to a subset of frames in the input context window [30], or to all frames in that window [17].Previous research [17] demonstrated that Late Fusion ConvNets using 15 frames and a DP-based decoder outperformed Early Fusion for predicting embryo morphokinetics in time-lapse videos.

D. The One-to-Many MTDL Framework
One-to-many, shown in Fig. 3(c), means each input is mapped to multiple outputs, which is also called multi-task nets [35] in deep learning.This paper uses hard parameter sharing of hidden layers [26], as illustrated in Fig. 4. The parameters of the convolution layers are shared among different tasks, but those of the fully connected layers are trained separately.
In one-to-many, each x n is used in classifying 2τ + 1 stages centered at n, i.e., it learns the following one-to-many  mapping: x n 's classification for the stage at time index t ∈ [n− τ, n+ τ ] is a probability vector pt (x n ) ∈ R |L|×1 .At each Frame Index n, the corresponding label is estimated by 2τ + 1 neighboring x t , t ∈ [n − τ, n + τ ].We need to aggregate them to obtain the final classification.This can be done by an ensemble approach.
Because each frame x n is involved in 2τ + 1 outputs, the total loss on a training frame x n is computed as the sum of the loss on all involved outputs: where w t is the weight for the t-th output, and y t is the true label for Frame t. w t = 1 and the cross-entropy loss were used in this paper.The cross-entropy loss on the t-th output can be written as follows: where pt,yt (x n ) is the y t -th element of pt (x n ).

E. The Many-to-Many MTDL Framework
Many-to-many can be viewed as a combination of one-tomany and many-to-one.Each input frame is processed by a separate CNN.Late Fusion was used, and the parameters of the fully connected layers were also trained separately, as shown in Fig. 3(d).

III. MULTI-TASK DEEP LEARNING WITH DYNAMIC PROGRAMMING (MTDL-DP)
This section introduces our proposed MTDL-DP approach.

A. Ensemble Learning for MTDL
As mentioned in Section II-D, a multi-task net has multiple outputs.The easiest approach to get the final classification corresponding to a specific frame is to choose the middle output of the network.A more sophisticated approach is ensemble learning [36].We consider two common probabilistic aggregation approaches in this paper: additive mean and multiplicative mean.
Let pn (x t ) be the predicted probability vector at Frame Index n, given Frame x t , t ∈ [n − τ, n + τ ], as illustrated in Fig. 5.The ensemble probability pn at Frame Index n, aggregated by the additive mean, is: If the multiplicative mean is used, Since each pn (x t ) is a vector, the summation in ( 6) and multiplication in (7) are element-wise operations.The final classification label ŷn for Frame x n is obtained by probability maximization: where pn,l is the l-th element of pn .

B. Post-processing with DP
The number of cells in the development of an embryo is almost always non-decreasing [37].However, this is not guaranteed in the classification outputs of MTDL.We use DP to adjust the classifications so that this constraint is satisfied.
For each video, the groundtruth stages {y n } N n=1 form a sequence.MTDL outputs a probability vector pn = [p n,1 , ..., p n,|L| ] T before likelihood maximization at Frame n, where pn,l is the estimated probability that Frame n is at Stage l.We define E(ŷ, P) as the total loss for an estimated prediction ŷ = {ŷ n } N n=1 , given the model output probability matrix P = [ p1 , ..., pN ].The total loss is the sum of the per-frame losses N n=1 e(ŷ n , pn ), which must be optimized subject to the monotonicity constraint: ŷn+1 ≥ ŷn , ∀n.
Two common per-frame losses [17] were used.The first is negative label likelihood (LL), defined as: The second is earth mover (EM) distance, defined as: The final classification stage sequence ŷ * = {ŷ n } N n=1 can be obtained as: s.t.ŷn+1 ≥ ŷn , ∀n.
which can be easily solved by DP, as shown in Algorithm 1.

C. MTDL-DP
Our proposed MTDL-DP consists of three steps: 1) construct a multi-task net with the one-to-many or many-to-many MTDL framework; 2) use multiplicative mean to aggregate the prediction of the multi-task net; and, 3) post-process with DP using the EM distance per-frame loss.Its pseudocode is given in Algorithm 2.
The one-to-many MTDL framework can also be replaced by the many-to-many MTDL framework.

IV. EXPERIMENTAL RESULTS
This section investigates the performance of our proposed MTDL-DP.

A. Experimental Setup
We created training/validation/test data partitions by randomly selecting 70%/10%/20% videos from the dataset, i.e., 41,650/5,950/11,900 frames, respectively.We resized each frame to 224×224 so that it can be used by ResNet50, our baseline model.Random rotation and flip data augmentation was used.All MTDL frameworks were initialized by the weights trained by one-to-one (ResNet50).Then, the convolution layer parameters were frozen, and the fully connected layers were further tuned.
We used the cross-entropy loss function and Adam optimizer [38], and early stopping to reduce overfitting, in all experiments.Multiplicative mean and EM distance perframe loss were used in the MTDL-DP.All experiments were repeated five times, and the mean results were reported.

B. Classification Accuracy
First, we considered MTDL only, without using DP.The classification accuracies are shown in the left panel of Table II, with τ = {1, 4, 7} (the output context window size was 2τ + 1).All MTDL frameworks outperformed the one-toone framework, suggesting using neighboring input or label information in multi-task learning was indeed beneficial.
For the many-to-one MTDL framework, when τ increased, the performance of Late Fusion also increased, whereas the performance of Conv Pooling decreased.This is intuitive,  II.Post-processing increased the classification accuracies for all classifiers and different τ , e.g., the five classifiers achieved 2.3%, 1.2%, 2.1%, 1.5%, and 2.0% performance improvement when τ = 1, respectively.However, as τ increased, the classification performance improvements became less obvious.After post-processing, the many-to-many and one-to-many frameworks had higher accuracies than the many-to-one framework, and only manyto-many consistently outperformed one-to-one for all τ , suggesting post-processing may be more beneficial when more input and output information was utilized.

C. Root Mean Squared Error (RMSE)
We also computed the root mean squared error (RMSE) between the true video label sequences and the classifications.The RMSEs without DP post-processing are shown in the left panel of Table III.All MTDL frameworks had lower RMSEs than the one-to-one framework, suggesting again that using neighboring input or label information in multi-task learning was beneficial.
The results after DP post-processing are shown in the right panel of Table III.DP post-processing reduced the RMSE for all MTDL frameworks and different τ , suggesting that DP was indeed beneficial.Though all MTDL frameworks outperformed the one-to-one framework only at τ = 1, the many-to-many framework consistently outperformed one-toone for all different τ .

D. Training Time
The training time of different models, averaged over five runs, is shown in Table IV.The training time of the manyto-one and many-to-many MTDL frameworks increased about linearly with the input context size; however, the training time of the one-to-many MTDL framework was insensitive to τ , which is an advantage.

E. Comparison of Different Ensemble Approaches
We also compared the performances of different ensemble approaches introduced in Section III-A, without considering DP post-processing.The CNN models were constructed using the one-to-many and many-to-many MTDL frameworks.The results are shown in Figs. 6 and 7.Both additive mean and multiplicative mean achieved performance improvements.Multiplicative mean also slightly outperformed additive mean.As τ increased, the performance of the many-to-many MTDL framework was improved.The one-to-many MTDL framework had the best performance when τ = 4.

F. Comparison of Different Losses in DP Post-Processing
Next, we studied the effect of different per-frame losses in DP post-processing.The RMSEs for different τ and different MTDL frameworks are shown in Fig. 8.The EM loss always gave smaller RMSEs than the LL loss.
The true stage labels, and the classified labels before and after DP in two time-lapse videos, are shown in Fig. 9. Clearly, DP smoothed the classifications, and its outputs were closer to the groundtruth labels.The confusion matrix for the one-to-many MTDL framework, using the multiplicative mean and τ = 1, is shown in Fig. 10(a) before DP post-processing, and in Fig. 10(b) after DP post-processing.The diagonal shows the classification accuracy of each individual cell stage.Post-processing improved the accuracy of all embryonic stages except t3, whose classification accuracy before DP (16%) was much lower than others.There may be two reasons for this: 1) Stage t3 had much fewer training examples in our dataset (see Fig. 2), and hence it was not trained adequately; and, 2) the low accuracy of t3 may also be due to multipolar cleavages from the zygote stage, which occurs in 12.2% of human embryos [39].
V. CONCLUSION Accurate classification of embryo early development stages can provide embryologists valuable information for assessing the embryo quality, and hence is critical to the success of IVF.This paper has proposed an MTDL-DP approach for automatic embryo development stage classification from timelapse videos.Particularly, the one-to-many and many-to-many MTDL frameworks performed the best.Considering the tradeoff between training time and classification accuracy, we recommend the one-to-many MTDL framework in MTDL-DP, because it can achieve comparable performance with the manyto-many MTDL framework, with much lower computational cost.
To our knowledge, this is the first study that applies MTDL to embryo early development stage classification from timelapse videos.

Fig. 2 .
Fig. 2. Percentage of frames in different embryo development stages.

Fig. 3 .
Fig. 3. Different classification frameworks.(a) one-to-one; (b) many-to-one; (c) one-to-many; (d) many-to-many.The convolutional layers are denoted by 'C'.Blue and red rectangles denote the flatten layer and the max-pooling layer, respectively.Orange rectangles denote the fully connected and softmax layers.

Algorithm 2 :
MTDL-DP Input: N , the number of frames in a time-lapse video; D, set of labeled time-lapse videos; {x n } N n=1 , frames to be labeled; τ , the number of left and right neighboring frames in the context window.Output: ŷ * , the labeled stage sequence.Use the one-to-one framework to train a baseline model f 0 from D; Initialize an MTDL model, whose convolution layer parameters are identical to f 0 ; Fine-tune the fully connected layer parameters of the MTDL model on D; for n = 1, ..., N do Use the MTDL model to compute pt (x n ), t = n − τ, ..., n + τ ; end for n = 1, ..., N do Compute pn by (7); Compute the per-frame loss e EM (ŷ n , pn ) in (10); end Solve for ŷ * in (11) by Algorithm 1; Return The optimized stage sequence ŷ * .because more input information was ignored in Conv Pooling when τ increased.The classification accuracies with DP post-processing are shown in the right panel of Table

TABLE II CLASSIFICATION
ACCURACIES FOR DIFFERENT CLASSIFICATION FRAMEWORKS AND τ , BEFORE AND AFTER DP POST-PROCESSING.

TABLE III RMSES
FOR DIFFERENT CLASSIFICATION FRAMEWORKS AND τ , BEFORE AND AFTER DP POST-PROCESSING.

TABLE IV TRAINING
TIME FOR DIFFERENT CLASSIFICATION FRAMEWORKS AND τ .