Current Progress and Challenges in Large-Scale 3D Mitochondria Instance Segmentation

In this paper, we present the results of the MitoEM challenge on mitochondria 3D instance segmentation from electron microscopy images, organized in conjunction with the IEEE-ISBI 2021 conference. Our benchmark dataset consists of two large-scale 3D volumes, one from human and one from rat cortex tissue, which are 1,986 times larger than previously used datasets. At the time of paper submission, 257 participants had registered for the challenge, 14 teams had submitted their results, and six teams participated in the challenge workshop. Here, we present eight top-performing approaches from the challenge participants, along with our own baseline strategies. Posterior to the challenge, annotation errors in the ground truth were corrected without altering the final ranking. Additionally, we present a retrospective evaluation of the scoring system which revealed that: 1) challenge metric was permissive with the false positive predictions; and 2) size-based grouping of instances did not correctly categorize mitochondria of interest. Thus, we propose a new scoring system that better reflects the correctness of the segmentation results. Although several of the top methods are compared favorably to our own baselines, substantial errors remain unsolved for mitochondria with challenging morphologies. Thus, the challenge remains open for submission and automatic evaluation, with all volumes available for download.

only crucial to basic neuroscience research, e.g., neuron type identification [1], but also informative to clinical studies, e.g., bipolar disorder [2] and diabetes [3].High-resolution imaging technologies like electron microscopy (EM) have been used to reveal their detailed 3D geometry at the nanometer level with the terabyte data scale [4].Consequently, to enable an in-depth biological analysis, we need high-throughput and robust 3D mitochondria instance segmentation methods.Publicly accessible datasets that can exemplify the challenges are also of essential importance for understanding the empirical gain of segmentation approaches in this field.
The goal of this study is to (1) analyze the current progress in the mitochondria segmentation task based on the results of the Large-scale 3D Mitochondria Instance Segmentation challenge (MitoEM), 1 at the IEEE International Symposium on Biomedical Imaging (ISBI) 2021, and (2) present an in-depth analysis of the state-of-the-art evaluation metrics for identifying mitochondria instance segmentation errors, that reveals the difficulties of the current approaches and can be used as a guide for the creation of the next generation mitochondria segmentation models.To the best of our knowledge, MitoEM was the first open comparison of mitochondria instance segmentation algorithms on EM volumes.Moreover, we describe the associated annotated dataset of two 3D EM image stacks at the scale of (32.8 × 32.8 × 30µm), which are freely available from the challenge website, and are two of a few large-scale 3D image volumes suitable for testing instance segmentation algorithms.
A. Previous Works 1) Mitochondria Segmentation Datasets: The de facto benchmark dataset for evaluating methods of mitochondria segmentation from EM images is the EPFL Hippocampus dataset [5], referred to as the Lucchi dataset in this paper.This dataset includes two EM image volumes along with corresponding binary segmentation masks.Subsequently, Kasthuri et al. [6] provided annotation for mitochondria masks for selected regions within a 3-cylinder volume.Additionally, Casser et al. [7] improved the annotation quality for both datasets through the implementation of a consistent annotation protocol for mask boundaries.Despite these efforts, the datasets remain small in size, less than 0.3 Gigavoxels and (5 µm) 3 physically, which does not adequately capture the complexity of mitochondria morphology.The complete image stack measures 2048 × 1536 × 1065 voxels, yet only approximately 35% of it was manually annotated, comprising two sub-volumes, each with dimensions of 1024 × 768 × 165 voxels.Furthermore, the provided binary masks are not easily converted into instance segmentation masks, which are necessary for detailed biological analysis as the instances of mitochondria can be connected to each other.
2) Instance Segmentation Evaluation Metrics: The evaluation of instance segmentation results can be done at either the pixel level or the instance level.The pixel-level metrics assume high-quality ground truth instance masks and measure the correctness of the pixel grouping with a clustering-based criterion, such as the Rand index [8].However, as dataset sizes grow, it becomes increasingly difficult to manually refine all masks for pixel-level accuracy.As a result, instance-level metrics are more commonly used for large-scale datasets.For each predicted instance mask, if its intersection-over-union (IoU) score with a ground truth mask is higher than a predefined threshold, it is considered a true positive.Similarly, predictions that fall below the IoU threshold are considered false positives, while ground truth predictions without a match with the true positive prediction are considered false negatives.For biomedical image datasets, metrics based on true positives, false positives, and false negative rates, such as accuracy are widely used in the literature [9], [10], [11].In the case of natural 2D images, popular methods like Mask R-CNN-based approaches, typically predict the confidence for each instance detection, and the average precision (AP) metric is used to average results over different detection thresholds [12], [13].In addition, instances are usually divided into small/medium/large groups for separate evaluations.Wei et al. [14] provided an efficient implementation of the AP metric for instances inside 3D volumes.To further break down the analysis of instance matching results, Ka et al. [15] proposed association metrics, categorizing them into one-to-one, over-segmentation, under-segmentation, many-to-many, missing, and background.In summary, the combination of these metrics and categories allows for a comprehensive evaluation of instance segmentation methods in the context of biomedical imaging applications.While these metrics are often used individually, their collective utilization provides a more thorough assessment of performance.
3) Machine Learning Methods: Despite the advances in large-scale instance segmentation for neurons from EM images [16], [17], similar efforts for mitochondria have been largely overlooked in the field.The lack of a large-scale, public dataset has led to the majority of recent mitochondria (semantic) segmentation methods being benchmarked on the Lucchi dataset [5], where mitochondria instances are small in number, simple in morphology, and relatively sparse in distribution.Even in non-public datasets [18], [19], the complexity of mitochondrial shapes is limited by the small size of the dataset and the use of non-mammalian tissue.In the field of mitochondria semantic segmentation, previous studies have employed a variety of techniques to segment the Lucchi dataset.Early works have leveraged traditional image processing and machine learning techniques [20], [21], [22], [23], while recent methods made use of 2D or 3D deep learning architectures for mitochondria segmentation [7], [24], [25], [26].Furthermore, Liu et al. [27] proposed an instance segmentation approach by means of a modified Mask R-CNN [28], while Xiao et al. [29] achieved instance segmentation through a tracking approach.However, it remains uncertain how the performance of these methods, developed on small datasets, would extend to larger datasets (e.g., (30 µm) 3 cube) for neuroscience analysis, where mitochondria exhibit more complex variations in appearance and shape.

A. Dataset
The basis for this challenge is our previously released large-scale 3D mitochondria instance segmentation benchmark, known as the MitoEM dataset [14].The MitoEM dataset comprises two 3D EM image stacks, each measuring 32.8 × 32.8×30 µm in size, with a voxel dimension of 8×8×30 nm.These image stacks originate from distinct sources, one from adult rat brain tissue (MitoEM-R) and the other from adult human brain tissue (MitoEM-H).Notably, the MitoEM dataset represents a substantial increase in scale, being approximately 1,986 times larger, 2 than the previous Lucchi benchmark [5].From the 1, 000 consecutive slices of each stack, groundtruth mitochondria instance labels were provided for the first 500 slices and split into training (400 slices) and validation (100 slices) subsets.The annotations of the remaining 500 slices of each volume were kept private and used as the test set.For information regarding the dataset acquisition and annotation strategy, we refer readers to Wei et al. [14].
1) Improved Annotation (V2): After the initial release of the MitoEM dataset, we identified three consistent categories of annotation errors (as depicted in Fig. 1).These errors include instances of organelles with a similar dark appearance that were mistakenly labeled as mitochondria, instances of neighboring mitochondria that were falsely merged into a single mitochondrion, and instances of mitochondria-on-astring (MOAS) [30] that were occasionally incomplete due to their thin microtubule connections.According to the findings reported by Zhang et al. [30], MOAS have been identified as a novel phenotype that exhibits increased prevalence during disease progression and the accumulation of mutations in both rat and human brain analyses.Therefore, accurate segmentation of this type of mitochondria, without inducing any splitting, holds significant importance in understanding its role and implications in cellular biology.
To rectify these annotation errors, we engaged the expertise of three neuroscience specialists with in-depth knowledge of EM images and mitochondria morphology.Each expert independently scrutinized the previous annotations, meticulously comparing them to the visual information depicted in the EM 2 Calculated by comparing the size of the MitoEM dataset with that of the Lucchi benchmark as follows: images.In cases where discrepancies or differences of opinion arose among the experts, collaborative discussions were held to resolve these issues and reach a consensus.Through this collaborative effort, we consolidated the necessary changes, resulting in a more accurate and reliable ground truth for the challenge.
Consequently, the number of confirmed instances in the MitoEM-H dataset was reduced from 24.5K to 19K, while in the MitoEM-R dataset, it was reduced from 14.4K to 10.8K.These revised annotations were subsequently updated and uploaded to the Grand Challenge platform in December 2021, and all participating methods were re-evaluated accordingly.Notably, despite the modifications in the annotations, the overall rankings on the leaderboard remained largely unaltered.

B. Evaluation Metric
In our initial release of the challenge, we used the evaluation metric proposed by Wei et al. [14], which computes the AP-75 score for small/medium/large groups of instances based on the instance size.However, upon conducting an analysis of the errors in the challenge submissions, we recognized the need to make certain improvements to the evaluation metric.
1) Improved Metric: From AP to Accuracy: We found that the AP-based metrics that were originally designed for top-down instance segmentation methods, such as Mask RCNN [28], are not well-suited for our challenge.In our case, most submission methods employed a bottom-up approach for instance segmentation, in which there is no estimation of the detection confidence score for each instance.To address this issue, Wei et al. [14] approximated the confidence score with the size of the instance, which can lead to unintuitive evaluation results, as further discussed in Section IV.After careful consideration, we decided to adopt the popular accuracy metric [10] for evaluating the challenge submissions.This metric matches prediction instances with ground truth instances, providing a more intuitive evaluation of the methods' performance.
2) Improved Instance Grouping: From Volume to Cable Length: In our initial release of the challenge, we used a splitting rule based on the volume to categorize mitochondria instances into small, medium, and large groups.However, we noticed that this approach was not effective for correctly categorizing complex mitochondria instances, such as the MOAS.For that reason, we opted for the cable length3 instead, using length thresholds of 1 µm and 4 µm to split the mitochondria into three groups: small, medium and large (as in the original MitoEM release).Under this new categorization, the number of small, medium, and large mitochondria instances are respectively: 5106, 3608, and 164 in MitoEM-H and 1292, 3832 and 524 in MitoEM-R.A visualization of the mitochondria of each new split is depicted in Fig. 2. Notably, all instances classified as MOAS are now grouped within the large category, aligning with our previous expectations.A fast inspection reveals that (1) the human tissue contains many more small mitochondria than the rat tissue, and (2) the large mitochondria from the human tissue are notably thinner than those of the rat tissue.It is important to note that these differences between human and rat tissues may not be generally extrapolated, as making such statements would require additional samples and references to establish reference ranges.
All these changes in the evaluation became effective in July 2022 in the Grand Challenge platform.In contrast to the improved annotation, these modifications resulted in significant alterations in the leaderboard as described in Section V.

C. Organization
The challenge was accepted to ISBI 2021 in October 2020 and officially announced in November 2020.This announcement was accompanied by the creation of a dedicated website and the preparation of an evaluation system.The two image volumes, MitoEM-R and MitoEM-H, were made immediately available to participants to enable them to begin developing their methods.Participants performed the segmentation on their own computers.The challenge was widely advertised and was open to any interested participants.A total of 257 individuals registered for the challenge and 14 teams submitted their results.For comparison, we also used two "internal submissions" corresponding to our 2D and 3D baseline methods (U2D-BC and U3D-BC, Section III-A).To lower the barrier of entry for the challenge, an initial version of the code of U3D-BC was made publicly available.The teams were also asked to submit a description of their method.Eight teams were invited to a workshop on April 13, during the ISBI 2021 conference, and to participate in the writing of this article.The winners of the challenge were announced at this workshop.
Some of the teams that participated in the challenge did not register for the conference or participate in the workshop.However, six teams did submit short papers and presented their methods.The results announced at the workshop (ranked using the original AP-75 metric) are given in Table V in the Appendix.Those results may be based on updated submissions.After the workshop, the challenge has remained open to submissions and all image volumes, as well as their groundtruth labels, are available for download.The testing labels continue to be confidential.

III. SUMMARY OF SEGMENTATION METHODS
In this section, we present our proposed baseline methods together with the evaluated segmentation methods submitted by the eight teams who successfully completed the challenge.An overview of the principal components of each method, including our baseline methods, can be found in Table I.Detailed information about the algorithms employed by each team is provided in their respective manuscripts, which were submitted according to the MitoEM challenge policies.These manuscripts are available on the challenge webpage under the "manuscripts" tab.

A. Open-Source Baseline Methods
To enhance the accessibility of our challenge, we have developed and released two open-source baseline methods: U2D-BC and U3D-BC.These methods are designed to handle 2D and 3D input images, respectively.Both approaches make use of a U-Net-based architecture [36] to predict binary foreground segmentation masks and instance contours masks (referred to as BC in the methods' name).Following the prediction, the two outputs are thresholded and combined.Next, a connected components operation is applied to generate distinct, non-touching mitochondria instance seeds.Subsequently, a marker-controlled watershed algorithm [37] is applied, using three key components: 1) the inverted foreground probabilities as the input image (representing the topography to be flooded), 2) the generated instance seeds as the marker image (defining starting points for the flooding process), and 3) a binarized version of the foreground probabilities as the mask image (constraining the extent of object expansion).The collective implementation of these components facilitates the creation of individual mitochondria instances (see Fig. 3 for a visual representation).
1) U2D-BC: The core architecture of the U2D-BC method consists of a 5-level U-Net.The initial level of the U-Net comprises 16 filters, which are doubled in each subsequent level.Dropout regularization is applied in each block, with the dropout rate gradually increasing from 0.1 to 0.3 (in the bottleneck layer), and then decreasing back to 0.1 in the upsampling layers.Exponential linear units (ELU) [38] are employed as activation functions.To perform upsampling in the decoder, transposed convolutions are used, following the approach proposed in [26].The model was trained using an input size of 256 × 256 and optimized until convergence, approximately 180 epochs, over a reduced version of the dataset (20% of training data) with stochastic gradient descent (SGD) using a fixed learning rate of 0.002.The reduced training set was created by selecting one image every four slices along the z-axis.This decision was based on the repetitive nature of the slices along this axis, where no significant 3D information would be lost in the process.The intention behind using a smaller training dataset for the U2D-BC model was to expedite the training process and provide a baseline 2D network for competitors to build upon.We further applied median filtering in the y-z axes to improve the network output predictions.The model was implemented with BiaPy [39] and can be reproduced based on the tutorial provided by the challenge organizers. 4) U3D-BC: The U3D-BC method utilizes a 5-level residual U-Net architecture, inspired by Lee et al. [40].The model incorporates batch normalization as a regularization technique and employs ELUs as activation functions.Transposed convolutions are used for upsampling.To account for the anisotropy of the datasets, the model is trained with an input size of 225 × 225 × 17 for the x, y, and z axes.Notably, feature maps are not downsampled or upsampled along the z-dimension, and each residual block consists of a combination of 2D and 3D convolutions [40].The model was optimized for 150 epochs over the entire training data, with an initial learning rate of 0.04 and cosine learning-rate scheduling [41].We also applied Gaussian blending and test-time augmentations (selfensemble) to improve the prediction quality.The model was implemented with PyTorch Connectomics [42] and can be reproduced based on the tutorial provided by the challenge organizers. 5n comparison to our previous work [14], we made improvements to the implementation details in order to achieve superior results.Specifically, we have incorporated a number of additional data augmentation techniques, including misalignment (which simulates xy-plane displacement during data acquisition in microscopes), CutBlur [43], CutNoise, and motion-blur.These augmentations supplement the brightness, flip, elastic transform, and missing parts augmentations used in the original MitoEM paper [14].Furthermore, we increased the probability and intensity of all augmentations to enhance the robustness of the models.

B. Participants' Methods
The following methods by the participant teams produced successful results that were submitted to the challenge.Notice that the method names used here may differ from the team names found on the MitoEM webpage.
• VIDAR (USTC) 6  restored using the trained interpolation network before performing segmentation.Finally, the semantic masks and instance boundaries are used to synthesize a 3D affinity volume, which enables hierarchical agglomeration [46] for extracting individual instances.
• IIPPR (SJTU) 7 : The submissions of this team were based on the U3D-BC baseline method provided by the challenge organizers.The main difference from the original U3D-BC configuration is the input size of 256 × 256 × 32 with an overlap of 128 × 128 × 16.To separate touching instances, ground truth masks were eroded using a 3 × 3 kernel, while instance contours were extracted through morphological erosion with a 7 × 7 kernel.For MitoEM-H, they used threshold values of 0.7, 0.6, and 0.6 to extract seeds, instance contours, and foreground, respectively.For MitoEM-R, they used threshold values of 0.85, 0.6, and 0.8.Moreover, they removed all objects with fewer than 1024 voxels based on the fact that all mitochondria instances in the challenge volumes are at least 2000 voxels in size.
• VGG (NEL-BITA) 8 [33]: This team used a contrastive learning [47], [48] framework, employing a representative voxel sampling strategy and a loss function that combines a voxel-wise similarity term to increase the similarity of voxels from the same class and the separability of voxels from different classes.Additionally, an inter-frame consistency term is included to enhance the sensitivity of the 3D model to changes in image content from frame to frame.The backbone network used is a classic 3D U-Net [44], which outputs binary masks and boundary maps.A marker-controlled watershed [37] algorithm is then applied to extract the final instances.Feature maps are extracted from the last two layers of the backbone decoder to capture voxel features and build positive and negative pairs based on their classes.This enables the use of contrastive learning to maximize the similarity between feature vectors of the same class while minimizing the similarity between feature vectors of different 7 R. Xin, H. Liu, H. Chen. 8Z. Li, J. Zhao, X. Chen.
classes.Similarly, the consistency loss term is designed to enhance the feature similarity between voxels belonging to the same class at the same position in adjacent slices and contrastively decrease the similarity between voxels from different classes.
• EMBL (Heidelberg) 9 : A 5-level 3D U-Net [44].The network predicts foreground probabilities and long-range affinity maps [40], specifically targeting pixel distances of 1, 3, and 9 along the xand y-axes, and 1, 2, and 3 along the z-axis, taking into account the dataset's anisotropy.For the same reason, no pooling is performed across the z-axis in the first two pooling layers of the 3D U-Net.The network is trained using the Dice score as the loss function.For obtaining an instance segmentation based on the foreground and affinity predictions, Mutex Watershed [49] is applied in parallel on the predictions of subvolumes of the entire dataset.To segment only mitochondria, the segmentation algorithm is applied solely to the foreground mask obtained by thresholding the foreground probability predictions at 0.5.Finally, the whole-volume instance segmentation is obtained by solving a Multicut clustering problem [50].
• CEM-PDL (NIH) 10 : A Panoptic-DeepLab model [51] with a ResNet50 [52] backbone is trained to perform instance segmentation in 2D slices.More specifically, the model has three outputs: semantic masks, instance centers, and instance center regressions (offset from each pixel to its corresponding center).Instances are obtained by simple post-processing (assigning each pixel to the closest predicted center).The backbone network uses weights pre-trained on CEM500K [53], a large dataset of EM images.Training is performed on patches of size 512 × 512 and the inference is applied to the full-size image (4096 × 4096).Several post-processing methods are used including z-filtering, 2D watershed to split false mergers, and the Hungarian algorithm [54] and the Intersection-over-Area merging strategy to merge false splits.This method has been further developed since submission to the MitoEM challenge into an open-source model called MitoNet [34].
• FCI (London) 11  stacks.Due to memory constraints, one training stack was presented per epoch.An initial model was trained to predict binary masks on both human and rat data, which was then fine-tuned on MitoEM-R and subsequently on MitoEM-H.The final weights from the MitoEM-R binary mask model served as the initial weights for the MitoEM-R boundary prediction model, and the weights from this model were used as initialization for the model predicting boundaries in MitoEM-H.To improve boundary predictions, the team combined predictions from all three views of the volumes after reslicing the data in the x z and yz planes and interpolating the z-scale from 30nm to 10nm, resulting in a voxel size of 8 × 8 × 10nm [56].Individual instances were then extracted using marker-controlled watershed [37] after creating seeds by subtracting the boundary masks from the binary masks.
• ABCS (FNL) 12 : The ABCS team trained two simplified 3D versions of the original U-Net architecture [36]  Training and validation samples were randomly extracted from the corresponding quadrants at runtime and fed to the GPU.To account for boundary conditions, the original volumes were padded with blank pixels along all three axes.Basic data augmentation techniques, including flipping in all three axes, were applied during training.
During inference, ensemble prediction with patch overlap was performed.Blank paddings were added to the two testing volumes as required by the sample extraction process.The combined segmentation outputs from both trained networks showed slightly better performance compared to the individual networks.
• H2RNet (Zurich) 13 : The H2RNet method is a hybrid instance segmentation approach that combines 2D and 3D processing.Initially, the method performs a segmentation on individual 2D slices of the volume using a modified HRNet [57] network.The HRNet has two heads, one for predicting the energy surface and the other for estimating the curvature of mitochondria in each 2D slice.The outputs of both heads are fused to obtain the final prediction.The input size of the network is set to 256 × 256, but during inference, a patch size of 1024 × 1024 is used to minimize border effects or artifacts from reassembled image crops.For training, a weighted mean-square-error loss function is employed.The weights are determined based on the frequency of a given value for the energy head, and bending loss [58] is used to compute the weights for the contour head.
Unlike some other methods, H2RNet does not require watershed post-processing in 2D, as a cut-off from the learned surface energy is used as a hyper-parameter in the prediction.Due to computational limitations, the 2D predictions are downsampled before applying markercontrolled watershed [37] in 3D to connect regions across sections.The connected regions are then upsampled using nearest-neighbor interpolation.

IV. HITS AND MISSES OF CURRENT EVALUATION METRICS
Instead of using a single set of metrics, as it is common practice in the literature of 3D instance segmentation of biomedical images, we decided to perform an in-depth analysis of the most commonly used metrics to create a compact and informative error report to debug 3D instance segmentation methods.Moreover, this analysis allowed us to decide on the optimal metric to finally base the challenge ranking on.

A. State-of-the-Art 3D Instance Segmentation Metrics
There are three commonly used sets of metrics for the 3D instance segmentation task: AP-based, matching-based, and association-based.
1) AP-Based Metric: The AP metric [12] relies on the calculation of other metrics such as IoU, precision, and recall.Let an instance be set of pixels/voxels belonging to an object, then the IoU measures the overlap between two instances (A, B) and can be calculated as where |.| denotes the number of pixels (in 2D) or voxels (in 3D).Precision and recall are then defined as where the true positive (T P), false positive (F P), and false negative (F N ) values depend on a predefined IoU threshold value (to consider two overlapping instances the same) and a probability confidence threshold (to consider the instance a mitochondrion).More specifically, a predicted instance is considered a TP if its IoU value with a ground truth instance is larger than the IoU threshold value, otherwise it is considered a FP.Moreover, ground truth instances without matching predictions are considered FN.For a fixed IoU threshold, a precision-recall curve can be created for a set of different confidence threshold values.The AP is the area under the precision-recall curve: where p is precision and r recall.The trade-off between precision and recall will decrease the precision-recall curve monotonically, as increasing one will decrease the other.Nevertheless, this rule does not occur consistently, resulting in a zigzag pattern.Henceforth, precision at each recall level r is interpolated by taking the maximum precision when the corresponding recall exceeds r : where p(r ) is the measured precision at recall r .Then, the AP is commonly approximated as the mean precision ( p) at a set of eleven equally spaced recall (r ) levels (from 0.0 to 1.0 with 0.1 increments): r ∈{0,0.1,...,1} In the present scenario, we employ AP-75, a metric that quantifies the AP computed at a threshold of 75% of the minimum IoU required to classify a detection as a TP (following [14]), otherwise the ground truth instance is considered a FN and the prediction a FP.In this manner, the IoU is calculated at pixel level but whereas a TP, FP and FN is defined at instance level.
A drawback of the AP metric is that it requires sorting predictions by confidence, which is not provided by most bottom-up segmentation approaches.Wei et al. [14] heuristically used the instance size as the prediction confidence, which can lead to undesirable biases for method ranking.
2) Matching-Based Metrics: Metrics based on matching focus on quantifying correctly predicted instances, transforming instance segmentation results into an object detection framework.In this paradigm, the emphasis shifts from uniquely labeled instances to detecting the presence or absence of instances.This transformation is achieved by establishing a criterion for instance overlap, commonly measured through IoU.Unlike traditional segmentation evaluations that rely on nuanced pixel-level overlaps, this approach simplifies assessment by classifying instances as successful (TP) based on a predefined IoU threshold.This aligns with decision-making processes in detection problems, providing a streamlined and robust evaluation strategy.
These metrics can combine informative statistics, i.e., TP/FP/FN, into a single value to rank the methods.More specifically, we use accuracy, which is defined as follows: To decide which predicted instance corresponds to a ground truth instance we make the following definitions.Let us assume we have two sets: one consisting of the predicted instances, denoted by P, and another set containing the ground truth instances, denoted by G.The mathematical representation of the scenario can be expressed as follows: where p i are the predicted instances and n is the number of instances in the predicted set.
where g j are the ground truth instances and m is the number of instances in the ground truth set.Following previous work [9], [10], in order to decide the optimal assignment of predicted and ground truth instances,  we use the Hungarian algorithm [54] whereby an instance cannot be assigned to multiple ground-truth instances (and vice versa).In our case, the optimal assignment is given by the following cost where X is a boolean matrix, wherein an element X i, j is true if and only if row i is assigned to column j, and C is the cost matrix, defined as: where T is the threshold 0.75, as in AP-75, and N is the number of assignments (minimum between the number of predicted instances and the number of ground truth instances).
3) Association-Based Metrics: Many segmentation methods need to set hyper-parameters to control the ratio between false-split and false-merge errors.Thus, a pie chart displaying the proportion of different types of segmentation association error [15] is critical for a more interpretable result understanding.Using the previously computed IoU values as in the matching-based metrics, the set of pairs of associated regions between p i and g j can be defined as follows: Let us define the two sets A(g j ) = { p i |( p i , g j ) ∈ A} and A ′ ( p i ) = {g j |( p i , g j ) ∈ A} corresponding to the ground truth instances g j associated with predicted instances p i and predicted instances p i associated with ground truth instances g j , respectively.Then, different cases of resulting reciprocal mapping are considered: • One-to-one, when there is an exact match between g j and p i : A(g j ) = { p i } and A ′ ( p i ) = {g j }.
• Over-segmentation, when one instance in the ground truth is divided into two or more in the prediction: |A(g j )| > 1 and ∀ p i ∈ A(g j ), A ′ ( p i ) = {g j }.
• Under-segmentation, when two or more instances in the ground truth are merged in the prediction: |A ′ ( p i )| > 1 and ∀g j ∈ A ′ ( p i ), A(g j ) = { p i }.
• Missing, for instances of the ground truth that are not captured in the prediction: A(g j ) = ∅.
• Many-to-many, all other cases.In summary, background associations are typically expressed as a percentage of the total number of predicted instances, whereas the remaining associations are expressed as a percentage of the total number of ground truth instances.Consequently, the cumulative percentage of these non-background associations amounts to 100%.

B. Discussion on 3D Instance Segmentation Metrics
To better understand the pros and cons of each metric, we created a toy example with ground truth mitochondria instances of different sizes and realistic model predictions (see Fig. 4).The ground truth volume contains only six instances: one large (MOAS type), three medium, and two small mitochondria based on their cable lengths.The prediction presents an over-segmentation of small and medium instances, a merger of two mitochondria, and several split errors in the MOAS.The corresponding AP-75, association, and accuracy values are shown in Table II.
1) AP-75 Overvalues Small-Size Instances: In our ISBI challenge, we developed an efficient implementation of AP-75 for 3D volumes.Due to the lack of confidence prediction for each instance, we sorted mitochondria instances by size, resulting in small instances having the lowest confidence values.Therefore, when a small instance is merged with a medium one in the prediction, the small instance is considered an FN.Additionally, the large instance in the ground truth is split into several instances that do not reach the minimum IoU of 75% with the ground truth, so most of those instances are considered as medium FPs.This means the large mitochondrion is only matched with the blue instance that represents its bottom part (since it is the largest among all pieces).Although the prediction contains several small FPs, as well as more small and medium FPs considering the rest of the MOAS pieces not matched with it (e.g.all but the blue instance), the AP-75 values for small instances are still high.Note that other drawbacks of the AP metric were discussed in recent papers [59] from different angles.
2) Accuracy Metric Provides a Good Overall Evaluation: As shown in Table II, the association metrics are useful for understanding the fate of the ground truth instances in the prediction but do not provide information on the overlap between the prediction and the ground truth.On the other hand, the matching metrics do provide this information by considering a prediction as a TP if the IoU with ground truth is greater than 75%.However, the association metrics have multiple values, rather than just a single one, which complicates the direct comparison of the performance of different methods.For example, it is not clear whether a low under-segmentation rate is better or worse than a low oversegmentation rate, or whether many-to-many is worse than the previous two.These questions depend on the specific task at hand.Therefore, it is useful to have a single number, such as accuracy, to enable easy comparison of the performance of different models.In the toy example, there are many small FPs in the prediction, as previously mentioned, which results in low values for all matching metrics except recall.For medium instances, only the one merged with the small instance is not considered a TP due to its low IoU (< 0.75).

3) Association Metrics Provide a Detailed Breakdown of
Errors: Examining the association metrics helps us to understand where and how the prediction fails.A missing value of zero in all cases indicates that all ground truth instances have been captured by the prediction.More specifically, out of the two small mitochondria in the ground truth, one has been correctly predicted and is labeled as one-to-one.The other one was merged with a medium mitochondrion, resulting in both small and medium being labeled as under-segmentation.The remaining three medium mitochondria are also labeled as one-to-one.Also, the ground truth MOAS that was divided into medium-sized pieces in the prediction is labeled as oversegmentation.

V. ANALYSIS OF CURRENT PROGRESS ON MITOEM
In this section, we leverage on the evaluation metrics defined in the previous section to analyze in detail the performance of the participant and baseline methods in the challenge.The matching metric values corresponding to the top submissions of all methods on the leaderboard are presented in Table III.The IIPPR method demonstrates superior performance compared to VIDAR in most cases, except for recall.This trend is also observed in other methods, where high recall comes at the cost of precision.For instance, U3D-BC, VGG, and U2D-BC exhibit much higher recall than precision values, indicating a larger number of false positives.A detailed breakdown analysis of matching-based metrics for each instance category (small, medium, and large) can be found in Table VI in the Appendix.Notably, all methods exhibit better segmentation of small and medium mitochondria compared to large mitochondria in both MitoEM-R and MitoEM-H datasets.Furthermore, it is evident that segmenting large mitochondria in MitoEM-H is more challenging than in MitoEM-R, as confirmed by visual inspection in Fig. 2.
2) Association-Based Evaluation: Table IV presents the association metric values for all ranked methods, including our own baseline methods.The absolute numbers of association types per instance category for each participant method is illustrated in Fig. 10 in the Appendix.In both human and rat tissues, the IIPPR method achieves the highest one-toone value, representing the percentage of correctly associated ground truth instances.Furthermore, as shown in Table IV, IIPPR exhibits very low over-segmentation values, in contrast to other methods such as U3D-BC, EMBL, or VGG.This aligns with their previously observed high recall values (Table III), which are a result of a larger number of false positive instances.However, accurately assessing the methods solely based on the percentage of correctly assigned instances (one-to-one value) is insufficient, as it can be accompanied by a high number of background associations, as observed in the VGG or U2D-BC methods.
To gain further insights into the types of association errors made by the top three methods (IIPPR, VIDAR, and EMBL),  we present two analyses in Fig. 6: (1) an overview of the errors relative to the ground truth instances, and (2) their absolute magnitudes for method comparison.Generally, Fig. 6 reveals that the relative magnitude of missing instances is similar among the top three methods for both tissues.However, the absolute magnitudes indicate better performance for IIPPR and VIDAR compared to EMBL.The top methods tend to exhibit over-segmentation rather than under-segmentation (except for IIPPR in human tissue).This highlights the challenges faced by these methods in accurately segmenting the most difficult instances in MitoEM, particularly the MOAStype mitochondria.This observation is also supported by the high number of over-segmentation associations for large mitochondria, as illustrated in Fig. 10

B. Comparison Across Skeleton Length
Overall, the complexity of mitochondria is influenced by the length of the skeleton.The length of the skeleton can vary depending on the type and size of the cell in which the mitochondria are located.Based on the overall performance of the methods, we have identified a clear issue of oversegmentation of large mitochondria in both tissues.However, we have not yet considered the number of instances that the splitting or merging of instances involve.Therefore, it is important to compare the number of instances associated with over-segmentation, under-segmentation, and many-tomany associations to determine which type of error has the most significant impact.
Let us define the sets OS(A), US(A), and MM(A) corresponding to the over-segmentation, under-segmentation and many-to-many associations in A, respectively.We then define the subset S of association splits as Similarly, we define the subset M of association mergers as: To compare the number of instances in each subset, we use ||S|| p and ||M|| g , where ||.|| p is the number of elements related to prediction instances in the subset (i.e.|A ′ ( p i )|) and ||.|| g is the number of elements related to ground truth instances in the subset (i.e.|A(g j )|).In Fig. 7, we present the number of split and merger instances as a function of the cable length (measured in the ground truth instances) for both MitoEM-H and MitoEM-R, focusing on the results of the top two methods (VIDAR and IIPPR).It can be observed that, in all cases, the number of instances associated with splits tends to increase with cable length, while the number of instances associated with mergers remains relatively low across different lengths.This observed trend can be attributed to the presence of MOAS-type mitochondria, where larger structures tend to consist of a higher quantity of smaller constituent elements.Furthermore, it is worth noting that the results for MOAS in human tissue exhibit a greater number of splits compared to rat tissue.This discrepancy is likely due to the thicker connections present in rat MOAS within this specific dataset, as depicted in the middle of Fig. 7, which make them easier to segment in 3D.However, it is important to acknowledge that these differences in size may not be representative of all humans and rats.Therefore, further investigation and a larger sample size would be necessary to validate and establish reference ranges for mitochondrial sizes in these species.Fig. 8 provides a detailed analysis to identify the types of mitochondria that exhibit the highest failure rates for the top three methods.When considering the absolute number of cumulative association errors (missing, over-segmentation, under-segmentation, and many-to-many), the results align with the ranking presented in Table III, with IIPPR performing the best, followed by VIDAR, and finally EMBL.However, in terms of false negatives, VIDAR outperforms IIPPR.This finding is consistent with the results discussed in Section V-A, which indicate that VIDAR is capable of identifying more instances, albeit at the expense of higher false positive rates and lower precision.

VI. DISCUSSION ON REMAINING CHALLENGES
Despite the notable improvements achieved during the competition, there are still several challenges that the research community needs to address.

A. Model Challenge
In the current setting of full-supervised learning with a 40-10-50% data split, the IIPPR method serves as a strong baseline, achieving an overall accuracy score of 0.770.However, for practical deployment on recent petabyte-scale datasets [17], instance segmentation methods must achieve even higher accuracy to make the proofreading process feasible at scale (preferably above 0.9 based on our own proofreading experience).In addition to the inherent challenges posed by the datasets, such as complex geometries and crowded instances, there remains an open challenge in accurately segmenting "large" instances, particularly MOAS instances with super-thin connections, as they often result in over-segmentation.To address this issue, the VIDAR team at USTC's lab has proposed the use of knowledge distillation training [60] as a potential solution.Furthermore, while the majority of methods show significantly better performance for "small" instances compared to "large" instances in terms of accuracy, all methods demonstrate much better results for "medium" instances (refer to Table III in the Appendix).This observation holds true even when the dataset contains a larger number of "small" instances than "medium" instances, as observed in the case of MitoEM-H.This indicates that the current architectures are more suitable for a specific length of mitochondria, leaving room for improvement in designing methods that can handle various lengths effectively.

B. Limited Label Challenge
While the challenge was conducted within a fully supervised learning framework, it is important to acknowledge that in practical scenarios, the availability of labeled data is often limited to around 5-10% of the entire volume.Therefore, it is crucial to develop data-efficient methods that can achieve high accuracy with a limited amount of annotation.This includes exploring new data augmentation techniques [61], investigating unsupervised learning approaches [62], exploring semi-supervised learning methods, and leveraging active learning strategies.By addressing the limited label challenge, we can enable the development of models that effectively utilize a small amount of labeled data to achieve accurate segmentation results.We believe that our MitoEM dataset can also serve as a valuable resource for simulating and evaluating these learning settings.

C. Proofreading Challenge
Regarding the suitability of a scoring system based on accuracy, one should assess the purpose of the segmentation result and its subsequent processing.In particular, for large datasets such as MitoEM, the current strategy assumes a proofreading phase after automatic segmentation.In that sense, a metric that does not penalize false positive predictions as much as false  negative ones may be the most appropriate.In fact, eliminating false positives is proven much faster than correcting false negatives when proofreading 3D instances [63].In a more general framework, the association and matching metrics provided by our in-depth analysis help us complete the big picture in terms of evaluation.

VII. CONCLUSION
In this paper, we present the results of the ISBI 2021 challenge on MitoEM, the first large-scale instance mitochondria segmentation challenge that thoroughly benchmarks state-ofthe-art methods in the field.To gain insight into the common errors of the proposed methods and identify current challenges that remain unresolved, we analyze the performance of the methods using various types of evaluation metrics.
The release of MitoEM had the dual goal of attracting new computer vision researchers to the problem of EM mitochondria segmentation and pushing the state of the art forward.We believe that the challenge was successful in this regard, as the participants improve over our own initial baseline methods.Furthermore, the competition received a very positive reaction from the community and had good attendance at its corresponding workshop at ISBI 2021.
After conducting a comprehensive analysis of the challenge results, we identified consistent annotation errors and addressed them by releasing an updated version of the ground truth labels (V2).Furthermore, through a thorough examination of the state-of-the-art evaluation metrics, we identified issues with the evaluation system based on the AP-75 metric and updated the challenge and method ranking using accuracy, which is a more robust metric that takes into account false negatives and over-segmentations more effectively.Nevertheless, the current accuracy values are still insufficient for fully automatic segmentation, therefore the challenge remains open for submissions.
Finally, we would like to highlight the potential of our large-scale annotated dataset for a wide range of applications beyond its original purpose.The dataset can be used for tasks such as deep feature pre-training, 3D shape analysis, and testing novel approaches including active learning or domain adaptation.The availability of this dataset provides valuable opportunities for researchers to explore new directions and tackle various challenges in the field of mitochondria segmentation.
As future work, we will consider expanding the MitoEM dataset to create new interations of the challenge using the newly proposed score system, and thus enhancing the limited generalizability of the results produced on only two EM datasets.

APPENDIX
The original challenge leaderboard, which initially ranked the methods based on AP-75 performance, is presented in Table V.For a detailed breakdown analysis of matching-based metrics per mitochondria category, we refer to Table VI, which showcases the results of the top-performing submissions from both the participant and baseline methods.
Fig. 9 illustrates visual examples of common segmentation errors made by each participant method.The examples cover all mitochondria categories and tissues, allowing for a visual inspection of the errors made by different methods.
Additionally, to provide a comprehensive understanding of the associations per mitochondria category (small, medium, and large), we present the distribution of associations in Fig. 10 for the best submissions among all participant methods.

Fig. 1 .
Fig. 1.Common annotation errors in the initial MitoEM dataset [14] (v1): (a) organelles that look similar to mitochondria and where often false positives are created, (b) false merges of mitochondrion, and (c) incomplete segmentation.Those errors were fixed after another round of expert proofreading (v2).

Fig. 3 .
Fig. 3. Processing pipelines of our open-source baseline methods (U2D-BC/U3D-BC).The model predicts foreground and contour probabilities that are fused to create three inputs for a marker-controlled watershed[37] to produce individual instances.

Fig. 4 .
Fig. 4. Synthetic example of mitochondria instance segmentation.Left to right: ground truth instances, same instances color-coded by size (small in red, medium in blue and large in green), model prediction, matching errors (FP in blue and TP in green) and association errors (one-to-one in green, background in blue, over-segmentation in yellow and undersegmentation in magenta).

Fig. 5 .
Fig. 5. 3D visualization of MOAS instances for error inspection.We show the ground truth and segmentation results from the two top-performing models (IIPPR and VIDAR) in one MOAS instance per dataset.Different colors represent different instances.

Fig. 6 .
Fig. 6.Summary of association errors in MitoEM for the top three methods: IIPPR, VIDAR, and EMBL.The pie charts illustrate the proportions of association errors relative to the ground truth instances.The bar plots below depict the absolute magnitudes of the association errors (left), and the total number of false positives (FP) and false negatives (FN) for each method (right).
in the Appendix.Some visual examples of over-segmentation of MOAS specifically for the top two methods are shown in Fig. 5. Additionally, examples of common errors for all methods in all mitochondria categories are shown in Fig. 9 in the Appendix.

Fig. 7 .
Fig. 7. Split vs. merger instances distribution for the top two methods (VIDAR and IIPPR) on the MitoEM-H (top) and MitoEM-R (bottom) test sets.Each data point represents the number of instances for mitochondria of different lengths, with vertical lines indicating the standard deviation.The size of each data point is proportional to the number of instances within that length bin.Representative instances of various cable lengths are displayed in the middle and connected to their respective bins by dashed lines.The skeleton length is evenly divided into 15 bins and measured in the ground truth instances.

Fig. 8 .
Fig. 8. Summary of the absolute number of error types per instance category for the three top-performing methods.The errors shown include cumulative association (i.e., missing, over-segmentation, undersegmentation and many-to-many ) errors (on the left) and false negatives (on the right) for each method.

Fig. 9 .
Fig. 9.Some examples of common segmentation errors by the analyzed methods in small, medium and large mitochondria of MitoEM-H and MitoEM-R tissue from the test set.Every instance is given a unique color.The scale bar represents 0.5 µm.

Fig. 10 .
Fig. 10.Distribution of types of associations for all participant methods on the MitoEM-H (top) and MitoEM-R (bottom) test sets for small (left), medium (center) and large (right) mitochondria.The methods are ordered from left to right by lowest-to-highest value of AP-75.

TABLE I OVERVIEW
OF THE MitoEM PARTICIPANT METHODS.BASELINE METHODS FROM THE CHALLENGE ORGANIZERS ARE ALSO SHOWN (MARKED WITH *).CE -CROSS-ENTROPY, WBCE -WEIGHTED BINARY CROSS-ENTROPY, MSE -MEAN SQUARED ERROR, WMSE -WEIGHTED MSE, SGD -STOCHASTIC GRADIENT DESCENT, HA -HIERARCHICAL AGGLOMERATION, MCWS -MARKER-CONTROLLED WATERSHED, MWSMC -MUTEX WATERSHED AND MULTICUT, CC -CONNECTED COMPONENTS, HUA -HUNGARIAN ALGORITHM.( †) REUSE U3D-BC CODE to simulate different fields of view.The first network had an input size of 128 × 128 × 64, and the second network had an input size of 256 × 256 × 64.Each 4096 × 4096 × 1000 volume was vertically split into four quadrants, resulting in four quadrants of size 1024 × 1024 × 500 as sub-volumes for four-fold cross-validation training.During each fold of the cross-validation training, three out of the four quadrants were used as the training set, and the remaining quadrant was used as the validation set.

TABLE II AP
-BASED, MATCHING-BASED AND ASSOCIATION-BASED METRICS EVALUATION OF THE SYNTHETIC EXAMPLE OF FIG.4.ASSOCIATION METRICS ARE EXPRESSED IN %

TABLE III MATCHING
-BASED METRICS OF ALL METHODS ON THE MitoEM CHALLENGE LEADERBOARD.THE RANKINGS PRESENTED IN THIS TABLE ARE ORDERED BY OVERALL ACCURACY, THUS DIFFERING FROM THE ORIGINAL CHALLENGE LEADERBOARD, AS DISCUSSED IN THE MANUSCRIPT.THE BASELINE METHODS FROM THE CHALLENGE ORGANIZERS (MARKED WITH *) ARE SHOWN BUT NOT INCLUDED IN THE RANKING.THE BEST SCORES ARE SHOWN IN BOLD A. Overall Performance 1) Matching-Based Evaluation:

TABLE IV ASSOCIATION
-BASED METRICS (IN %) OF ALL METHODS ON THE MitoEM CHALLENGE LEADERBOARD.BASELINE METHODS FROM CHALLENGE ORGANIZERS (MARKED WITH *) ARE SHOWN BUT NOT INCLUDED IN THE RANKING.THE TERMS 'CORRECT', 'MISSING', 'OVER', 'UNDER', AND 'MANY' REPRESENT 'ONE-TO-ONE', 'MISSING', 'OVER-SEGMENTATION', 'UNDER-SEGMENTATION', AND 'MANY-TO-MANY' ASSOCIATIONS, RESPECTIVELY.THE 'BACKGROUND' PERCENTAGE IS CALCULATED RELATIVE TO ALL PREDICTED INSTANCES, WHILE THE REMAINING ASSOCIATION VALUES ARE CALCULATED RELATIVE TO THE NUMBER OF GROUND TRUTH INSTANCES AND COLLECTIVELY ADD UP TO 100%.THE BEST SCORES ARE INDICATED IN BOLD

TABLE V THE
MitoEM CHALLENGE LEADERBOARD AS ANNOUNCED AT THE WORKSHOP AT ISBI 2021.THE METHODS ARE RANKED ACCORDING TO THEIR AP-75 SCORES, WITH THE HIGHEST SCORES DISPLAYED IN BOLD.THE RANKINGS PRESENTED IN THIS TABLE ALIGN WITH THE ORIGINAL CHALLENGE LEADERBOARD, BUT DEVIATE FROM THOSE PRESENTED IN THE PRESENT MANUSCRIPT DUE TO THE MODIFICATION OF THE EVALUATION METRIC.THE BASELINE METHODS FROM THE CHALLENGE ORGANIZERS (MARKED WITH *) ARE DISPLAYED BUT WERE NOT INCLUDED IN THE RANKING

TABLE VI MATCHING
-BASED METRICS OF ALL METHODS ON THE MitoEM CHALLENGE LEADERBOARD PER CATEGORY.THE BASELINE METHODS FROM THE CHALLENGE ORGANIZERS (MARKED WITH *) ARE SHOWN BUT NOT INCLUDED IN THE RANKING.BOLD AND UNDERLINED NUMBERS DENOTE THE 1ST AND 2ND SCORES, RESPECTIVELY