Nuclei Detection for 3D Microscopy With a Fully Convolutional Regression Network

Advances in three-dimensional microscopy and tissue clearing are enabling whole-organ imaging with single-cell resolution. Fast and reliable image processing tools are needed to analyze the resulting image volumes, including automated cell detection, cell counting and cell analytics. Deep learning approaches have shown promising results in two- and three-dimensional nuclei detection tasks, however detecting overlapping or non-spherical nuclei of different sizes and shapes in the presence of a blurring point spread function remains challenging and often leads to incorrect nuclei merging and splitting. Here we present a new regression-based fully convolutional network that located a thousand nuclei centroids with high accuracy in under a minute when combined with V-net, a popular three-dimensional semantic-segmentation architecture. High nuclei detection F1-scores of 95.3% and 92.5% were obtained in two different whole quail embryonic hearts, a tissue type difficult to segment because of its high cell density, and heterogeneous and elliptical nuclei. Similar high scores were obtained in the mouse brain stem, demonstrating that this approach is highly transferable to nuclei of different shapes and intensities. Finally, spatial statistics were performed on the resulting centroids. The spatial distribution of nuclei obtained by our approach most resembles the spatial distribution of manually identified nuclei, indicating that this approach could serve in future spatial analyses of cell organization.


I. INTRODUCTION
The explosion of new tissue clearing protocols in the last decade has revolutionized the field of three-dimensional (3D) microscopy [1]. It is now possible to image intact organs, developing embryos and even whole mice [2]- [5]. Imaging tissue samples without sectioning preserves tissue integrity, cell shape and organization, and provides a more complete view of the patterns and connections that enables organs to function. Whole-organ imaging with single-cell resolution is bringing us toward organism-level systems biology, where molecular and cellular interactions can be studied across organs, and the relationship between structure and function can be explored at larger scales [6]. Additionally, spatially resolved transcriptomics, especially fluorescence in situ hybridization (FISH) and in situ sequencing (ISS) can give "barcodes" to cells, adding even more functional information to each image [7], [8]. Just as diseases affect the inner function of individual cells, diseases can also bring subtler changes in different but interacting cell populations over whole organs and organisms. Tissue clearing, fast 3D microscopy and more targeted fluorescent staining are now allowing us to study these large-scale systems [2].
However, new data analysis tools are needed to extract information from these large multichannel 3D datasets. To quantify protein expression levels across organs, or to determine how diseases affect cell migration or cellular differentiation, tasks such as segmentation (e.g., cell, tissue, layer), tracking, and spatial analysis of fluorescent markers have to be performed. And with hundreds of thousands of cells per mm 3 of tissue, such data processing efforts would be overly time-consuming and impractical if performed manually on large 3D image sets. Automatic data processing is thus essential to obtain quantitative measurements for analysis.
The spatial organization of cells plays an important role both in healthy and diseased organs, and studying cell organization would further our understanding of embryonic development, aging, disease pathways, and treatment strategies. However, few studies have looked into quantifying such organizations, with most experiments focusing on qualitative observations, or quantitative results obtained from a few cells. Fields such as graph theory and spatial statistics include many formal techniques to characterize spatial point distributions. Groups are starting to use such tools to analyze cellular distributions [9]- [13], but they require the location of each cell. The powerful toolsets of graph theory and spatial statistics could be used to study the 3D architecture of tissue at the cellular level if a robust 3D centroid detection algorithm could obtain a reliable spatial distribution.
In practice, 3D cell detection can be complicated by many factors. Whole tissue, organs and embryos have a multitude of different cell types, which are varied in shape and size, and have heterogeneous appearances when stained with fluorescent markers. Thus, few assumptions should be pre-built in cell detection algorithms. Additionally, low magnification microscope objectives have poor resolution in the depth axis, and the out-of-plane signal can make segmentation difficult. This is especially true since many existing algorithms were developed for 2D images of tissue sections and cell monolayers. However, using high magnification objectives with superior depth sectioning is not always an appropriate solution, since low magnification objectives have the long working-distances necessary for imaging deep in tissue, and their large field-of-view enables faster image acquisition. Additionally, imaging whole organs with high magnification objectives would generate a significant data burden.
An optimal 3D cell detection algorithm would accurately find the centroid position of cells in unsectioned tissue. It would perform optimally even on images acquired with low magnification objectives regardless of cell type, shape, and size. It would also translate easily from sample to sample within a study, and even adapt to different organs or animal models. Accurate detection of cell centroids would also enable spatial analysis of cell organization.
In this study we:

1.
Propose a new segmentation-regression network to locate nuclei centroids in high density intact tissue.

2.
Demonstrate the robustness of our approach by testing it in two different tissue types (cardiac embryonic tissue and brain stem tissue).

3.
Perform spatial statistics analysis to determine which nuclei detection approach best preserves the true spatial organization of each tissue type.

II. RELATED WORK
Multiple methods for automatic nuclei detection have been developed for different imaging modalities, including approaches designed for volumetric datasets. Before the recent advances in deep learning, approaches for cell segmentation included nuclei thresholding, matching predetermined templates to the image, the watershed transform, and deformable models [14]. To improve watershed results, algorithms have been developed to either identify seeds, or to iteratively refine the results based on the expected size and shape of the segmented objects [11], [15]. In the case of non-spherical nuclei found in muscle cells, complicated sequences of ellipse fitting, ellipse refinement and clustering have been used [16]. However, with the growing availability of fast and affordable graphics processing units (GPUs), approaches entirely based on deep-learning have become increasingly popular.
Deep learning is a promising approach when it comes to identifying objects with very few predetermined characteristics such as the nuclei of multiple cell types, which are of varied shapes and sizes. Most nuclei detection algorithms can be separated into two classes: either one attempts to segment each nucleus, or nucleus centroids can be directly detected without the need for prior object segmentation. The first option can provide many characteristics for each nucleus in addition to segmentation, such as the centroid location, the nucleus volume, and nucleus orientation. However, the second option may perform better in datasets where nuclei are clustered or overlapping, and where accurate segmentation is difficult. Recently, a third option has appeared, with multiple algorithms combining both a segmentation arm and a centroid detection arm, thus benefiting from the strengths of both approaches.
We now review recent studies falling within these three broad categories.
Many deep learning nuclei segmentation approaches in 2D and 3D are based on the U-net architecture [17]. U-net implementations are usually trained on two pixel classes, nucleus and background, and as a result are not separating touching nuclei without additional processing. In early publications, a thin line of "background" pixels was used to separate neighboring cells, and these pixels were heavily weighted during training to improve the final segmentation [17], [18].U-net implementations with three pixel classes (nucleus, nucleus boundary, background) have also been used in 2D with or without a variable weight distribution for training [19], [20]. The three pixel classes concept has also been extended to 3D, and has provided good segmentation results when paired with vectors pointing toward the nuclei center during training [21]. U-net has also been combined with other approaches, such as Mask R-CNN to separate clusters of nuclei in histology slides and 2D fluorescence images [22], and Hessian analysis to separate connected objects in 2D and 3D datasets [23].
Nuclei segmentation approaches have also used other neural network architectures beside U-net to segment nuclei. The idea of boundary detection was built in DCAN, a network with two output branches for segmenting nuclei and nuclei boundary separately in histology images [24]. Segmentation networks inspired by the 3D U-net architecture have also been proposed, such as DeepSynth which segments nuclei after training on synthetic images, and separates touching nuclei with a 3D watershed transform [25]. The V-net architecture [26], which is similar to 3D U-net but uses the Dice loss to prevent class imbalance, has also been used for cell segmentation using a multi-class approach to detect cells and cell boundaries [27].
Centroid locations can be calculated once nuclei are segmented, or they can be detected directly by a network. Regression neural networks have been used to find object positions, where they can be easier to train than classification networks [28], [29]. For cell detection, a centroid probability map can be calculated by a network, with the local maxima identified in post-processing indicating the cell centroid. A regression approach to detect nuclei has been advocated for by Kainz et al. who compared classification and regression approaches in random forests to localize cells in histology images [30]. Regression also performed well in convolutional neural networks [31], [32], fully convolutional networks (FCN) [33], and residual networks [34], especially on high cell density 2D images with clustered or overlapping objects. Regression-based networks were also robust and translated easily between different imaging modalities and cell types [33], [34]. Höfener et al.
systemically tested different FCN implementations for centroid regression in histology images, and highlighted the importance of the post-processing steps to best identify centroids, recommending median filtering and Gaussian smoothing [35]. Finally, Hirsh and Kainmueller compared the centroids predicted by their proposed nuclei segmentation approach to a regression network inspired by Höfener et al. and found the regression method performed similarly to their own approach [21].
Recently, different groups have proposed pairing segmentation and centroid detection algorithms to improve accuracy, especially volumetric datasets with high nuclei density. One approach has used two 3D U-net networks, one to segment nuclei and the other to detect their centroid to improve segmentation accuracy in embryo images [36]. Another network, SphEsNet, was also inspired by a 3D U-net architecture and added small centroid detection and radius estimation branches to predict the size of spherical nuclei. This algorithm was also able to improve detection accuracy over methods such as Mask R-CNN and a previous network proposed by the authors [37]. Finally, a 3D U-net inspired segmentation network was combined with a small regression-based centroid detection branch and obtained an F1-score of 82% in a high nuclei density 3D fluorescent image set, and 93% in a low nuclei density data set [38].
Although regression-based approaches have proven to be especially strong at cell-detection tasks in 2D images, few attempts have been made in 3D datasets. Additionally, when segmentation and detection approaches have been paired, the detection branches were usually composed of a few convolutional layers at most, and not similar to the highly accurate regression networks used in 2D [37], [38]. In this manuscript we propose a new segmentation-regression approach that can be applied in intact cardiac and brain tissues, which both include irregularly shaped and clustered nuclei. We also compare our own segmentation-regression approach to other previously explored methodology, such as the inclusion of a nuclei-boundary pixel class to improve nuclei centroid detection.

A. LEARNING TASK AND PROPOSED APPROACHES
Nuclei segmentation techniques have often been developed with the assumption that nuclei are spherical objects that do not significantly touch or overlap. In volumetric images of intact cardiac tissue, many nuclei appear to be touching because of the high tissue density and out-of-focus signal from cells at various depths. Additionally, nuclei of muscle cells tend to be elongated and elliptical in shape, while surrounding nuclei can take various shapes. As in other muscle cells [16], numerous small areas of high DAPI intensity are present within the nuclei, which complicates intensity-based thresholding and other segmentation algorithms that require homogeneous intensity. For these reasons, many existing cell segmentation and cell counting approaches lead to merged, split or missed nuclei, and inaccurate nuclei counts.
In this manuscript, five different approaches were tested to identify nuclei centroids in the DAPI stained embryonic heart wall:

1.
VRegNet: V-net with two classes followed by a regression neural network,

3.
Vnet-3W: V-net with three classes followed by 3D watershed and centroid calculation,

4.
Vnet-2W: V-net with two classes followed by 3D watershed and centroid calculation,

5.
Segmentation and nuclei labeling in CellProfiler followed by centroid calculation.
VRegNet is our proposed two-step approach where nuclei are first segmented from the background using V-net, followed by a regression network to directly identify centroids.
The other four methods are inspired by promising cell segmentation approaches found in the literature. Vnet-3 is a variation on the more traditional segmentation-based method, where in addition to the usual two classes used for segmentation (background vs nucleus), a third category (nucleus boundary) is introduced in an effort to separate touching nuclei. This approach has proven effective in 2D [20] and has been attempted in 3D [21], [27].
In Vnet-3W, Vnet-3 is combined to 3D watershed in post-processing to increase the separation between nuclei even more. To understand if the three-class approach provides any improvement over the more commonly used two-class approach (background vs nucleus), Vnet-2W functions as a stand-in for the many neural networks that have been proposed where nuclei are first segmented from the background, then separated using watershed [22], [25], [39]. A visual summary of VRegNet, Vnet-3W and Vnet-2W can be seen in Fig.1a.
Finally, nuclei segmentation is also attempted using CellProfiler 4.0, a commonly used open-source software for cell segmentation, and a non-deep learning alternative that requires no upfront training.

B. SEGMENTATION USING V-NET
V-net [26] is a fully convolutional network built for volumetric image segmentation, and is used as the main segmentation network in our first four approaches. It uses a Dice loss layer to prevent class imbalance, an important factor since the class "nucleus boundary" has the fewest voxels while having an essential role in separating overlapping nuclei. We found during training that V-net performed similarly or slightly better at segmenting cardiac nuclei than other common 3D segmentation networks such as 3D U-net. We thus decided to use V-net for all approaches to make comparisons easier. We used a MATLAB (MathWorks, Natick, MA, USA) implementation of V-net [40].

C. REGRESSION NETWORK
The regression network in VRegNet (Fig.1b) is a simple encoder-decoder network, inspired by Xie et al [33] with 2D convolution layers extended to 3D, and with added batch-normalization and dropout layers to regularize training and prevent overfitting. The convolution layers have a size of 5 × 5 × 5 voxels with a stride of 1 and zero-padding of 2 voxels around all edges. The transposed convolution layers also have a size of 5 × 5 × 5 voxels with a stride of 2. The result of each layer is cropped so that in each dimension the output is double the input in size. Weights were initialized using the Glorot initializer [41], and biases were initialized to zero.
The central task of the regression network is to determine the centroids' position from the nuclei image. However, contrary to examples from previous work where the regression network was applied directly to the DAPI image [33], the input of our network is the output of the two-class V-net before the final voxel classification layer (i.e., the probability of each voxel belonging to a nucleus). This first step homogenizes the appearance of the nuclei and simplifies the regression network's task.
To train the regression network, a centroid probability map is created by placing a small 3D Gaussian (e.g., standard deviation: 3 voxels) at the location of each centroid in the input volume to serve as the ground truth. During training, the loss is calculated as the mean square error between the ground-truth probability map and the predicted probability map.
In post-processing, the local maxima of the probability map are identified as the centroid locations.

D. CELLPROFILER
Nuclei segmentation was performed in CellProfiler 4.0, a widely used open-source software. "Optimal" settings were found through trial-and-error and visual assessment on each testing dataset. First, a morphological opening operation was performed on the DAPI image with a disk structuring element. This homogenized the appearance of each nucleus. Then Otsu thresholding was performed to separate the background from the nuclei. A 3D watershed transform was performed on the distance map of the binary image, and each cell was then labeled with a unique identifier. Centroid position for each of the labelled nuclei was then calculated in MATLAB.

A. SAMPLE PREPARATION AND DATA ACQUISITION
Quail hearts at embryonic development day 9 were surgically removed, fixed overnight in 4% paraformaldehyde at 4°C, then stained for 72h with DAPI (1mg/mL in 1% PBS). Optical clearing was performed at room temperature using LIMPID [42]. Three-dimensional image stacks of the posterior left ventricle wall were acquired on an inverted confocal microscope (SP8 with HyVolution 2, Leica Microsystems Inc., Buffalo Grove, IL, USA) using a 20x/ 0.75 NA objective set for glycerol immersion. Voxel size was 177.6 × 177.6 nm × ∼1.06 μm (x, y, z) and optical sectioning was 3.12μm.DAPIimageswereacquiredusing405nmexcitation and 416-490 nm emission. The excitation light intensity was progressively increased with depth to compensate for tissue absorption.
Brain stem from adult mice (C57BL/6) were surgically removed, fixed for two hours in 4% paraformaldehyde at room temperature, then vibratome sliced, stained with DAPI (10 ng/mL in 1% PBST), and cleared using LIMPID. Three-dimensional image stacks were acquired on the same microscope and objective as the quail heart sample, with a voxel size of 0.75 × 0.75 × 0.8 μm (x, y, z).
All procedures were performed in accordance with relevant guidelines and regulations under the approval of the Case Western Reserve University Institutional Animal Care and Use Committee (IACUC). IACUC approval was not required for the quail embryos as the Policy for use of Avian embryos at Case Western Reserve University states "if embryos will be sacrificed prior to 3 days before hatching, the research will not be subject to IACUC review."

B. GROUND TRUTH CREATION
The training dataset (400 × 200 × 129 voxels, approximately 200 nuclei) and validation dataset (200 × 200 × 129 voxels, ∼100 nuclei) were chosen from one embryonic quail heart and each nucleus was manually segmented and assigned a unique label in ITK-SNAP [43]. Three ground truth datasets were created: 1) a two-label (background/nuclei) volume, 2) a three-label (background/boundary/nuclei) volume where a three-voxel wide boundary surrounded each nucleus, and 3) a nucleiprobabilitymapwith3DGaussians(standard deviation 0.5 μm) located at each nuclei centroid.
Testing datasets were created from the same embryonic quail heart (heart #1, 700 × 700 × 150 voxels), a second quail heart of the same developmental stage but imaged in a separate session (heart #2, 390 × 390 × 150 voxels), and a mouse brain stem sample (brain #1, 390 × 390 × 265 voxels). Small training and validation sets from brain #1 (200 × 390 × 36 voxels each) were also manually segmented and annotated for fine-tuning of the brain stem data. Nuclei centroids were manually identified by one user using ITK-SNAP in all datasets and were considered as the ground truth. Two more users also annotated all centroids in heart #1 to estimate inter-user variability. It was determined that different users positioned centroids on average 0.54 ± 0.36 μm away from other users, with no significant bias in any directions.

C. DATA AUGMENTATION AND PATCH SELECTION
Small volume-patches were extracted randomly from the larger training volume every iteration. In order to expand the size of the training dataset and prevent overfitting, data augmentation was performed on-the-fly on each patch during training. One of following four operations was randomly assigned: reflection with respect to the x or y axis, volume rotation by 90° (counterclockwise) around the z-axis, or a combination of reflection followed by 90° rotation. One of these four operations was applied 50% of the time, and no augmentation was applied the other 50% of the time.

D. DEEP LEARNING EXPERIMENTS
All experiments were implemented in MATLAB 2020a. The computer had 256 GB of memory, a 24 core CPU at 2.2 GHz, and an 11 GB GeForce RTX 2080Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA).
Three networks in total were trained on the same training and validation dataset: 1) the regression network part of VRegNet, 2) the two-class V-net part of VRegNet and Vnet-2W, and 3) the three-class V-net part of Vnet-3 and Vnet-3W. The training parameter space was explored and the best parameters for each network were chosen based on the segmentation accuracy and centroid placement achieved on the validation set. The best network parameters found are listed in Table 1. The Adam optimizer [44] was used with default parameters. Training was stopped if the validation loss did not decrease for more than two epochs, or increased for more than one epoch. Since a random image patch was extracted from the training volume every iteration, the total iteration number is equal to the number of unique image volumes seen by the network during training.
The regression network can be difficult to train when most voxels in the ground-truth probability map are equal to zero, with few centroids corresponding to local peaks equal to 1 in intensity (see Fig.1a, VRegNet/centroids panel). To facilitate network convergence and avoid predicting every voxel to the background value of zero, the ground-truth probability map was multiplied by a factor of 10,000 during training, which forces the network to fit the Gaussians [33].

E. DATA POST-PROCESSING
In VRegNet, the nuclei were identified in the original DAPI volume using a two-class V-net one patch at a time. The whole output volume of the V-net was median filtered (filter size: 3 × 3 × 3 voxels) to smooth the transition between neighboring patches while preserving the nuclei feature, then passed onto the regression network. Once the output of the regression network was obtained, the local maxima in the centroid probability map were detected. First, an intensity threshold was applied to the probability map to set background values to zero. The volume was then filtered using a 3D gaussian kernel (standard deviation: 4 × 4 × 2 voxels), taking care to smooth each of the peaks without merging them with their neighbors. Local maxima within a 2 μm radius region were then identified as centroids and their locations recorded.
In Vnet-3W and Vnet-2W, the nuclei in the original DAPI volume were identified using the two-class or three-class V-net network. The voxels identified with a probability > 0.5 of belonging to a nucleus were separated from the background. The segmented volumes were smoothed using a 3D gaussian filter (standard deviation: 5 × 5 × 5 voxels) and resampled so that voxels were isometric (tricubic interpolation). A distance map was then calculated on each binary volume, and a 3D watershed transform was applied to label all nuclei. The centroid position was then calculated for each labeled nucleus.
Each of the five approaches, but especially VRegNet, did not perform optimally when nuclei were intercepted by the edges of the volume, since it is difficult to identify the centroid of an incomplete object. To solve this issue, a 30 × 30 × 5 voxel border was removed from each volume in post-processing, and nuclei in this region were not considered during analysis.

F. SUCCESS METRICS FOR CENTROID DETECTION
Predicted centroid locations were compared to the ground-truth to quantify the accuracy of each approach. Cardiac nuclei were approximately 2-4 μm in radius, while brain stem nuclei were approximately 3-6 μm in radius depending on the cell type and cell orientation. A centroid predicted by one of the five approaches was thus considered as a true positive (TP) if its position was within 3 μm of the ground-truth centroid. Ground-truth centroids with no corresponding predicted centroid were considered false negative (FN). Predicted centroids that were positioned further than 3 μm from any ground-truth centroids were considered false positive (FP). Additionally, if two or more centroids were predicted within the 3 μm boundary of the same ground-truth centroid, the closest would be considered as TP, and the others as FP.
The precision (P), recall (R), and F1-score were calculated as follow: When a centroid was determined to be a TP, the distance between the predicted and groundtruth centroid was calculated. The predicted cell count was obtained by adding the number of TP centroids to the FP number. The cell count error was calculated as the true cell count subtracted from the predicted cell count, divided by the true cell count.

A. COMPARISON OF CENTROID-FINDING NETWORKS IN CARDIAC TISSUE
The performance of the five different approaches was first evaluated in the intact, optically cleared, DAPI stained quail embryonic heart (heart #1). The testing volume was from the same heart as the volume used for training, and both were collected in the posterior left ventricular wall. A representative 2D slice and a 3D rendering of the test volume can be seen in Fig.2a, b. The test volume was acquired starting at approximately 5 μm in depth from the epicardium, to approximately 164 μm in depth. The location of the TP, FP and FN centroids predicted by VRegNet, Vnet-3W, Vnet-2W are marked for the same 2D slice and 3D volume in Fig.2c-h (see also supplementary videos 1-3). The 2D slice shown (Fig.2a, c, e, g) was acquired at 115 μm in depth. It can be seen that while VRegNet tends to miss a few nuclei (FN, magenta, Fig. 2c), Vnet-3W more often double-counts nuclei (FP, yellow, Fig. 2e) and Vnet-2W both misses some nuclei (FN, Fig. 2g) and falsely labels background voxels as centroids (FP, Fig. 2g).
The quantitative analysis of each approach's performance can be seen in Table 2 for heart #1. VRegNet had the highest F1-score and the most accurate cell count. However, Vnet-3W had the smallest distance between ground-truth and predicted centroids. Vnet-2W had a low F1-score compared to VRegNet and Vnet-3W. Both Vnet-3W and Vnet-2W over-counted the centroids, often by assigning more than one centroid per nucleus. On the other hand, Vnet-3 (no watershed) and the CellProfiler approaches were both very likely to merge nuclei and thus undercount them.
The performance of all five approaches was also evaluated on a second embryonic heart (heart #2), which was imaged on the same microscope with similar image settings, but on a different day as heart #1. This second sample tested how well the different neural networks performed on a new sample on which they had not been trained. The success metrics for heart #2 can be seen in Table 3. The F-1 score for VRegNet has decreased slightly, while it has increased slightly for Vnet-3W. The cell count is also closer to the ground truth in Vnet-3W than in VRegNet. The precision of Vnet-2W has increased; however, the network is now undercounting the nuclei.
The results from Table 3 were obtained without fine-tuning or transfer learning to adapt to the new dataset. This indicates generalization and robustness of both VRegNet and Vnet-3W. Optimally, one could analyze multiple samples of the same cohort for a larger study without requiring any additional training from one sample to another. Fine-tuning of the regression and V-net networks was attempted using a small 200 × 390 × 32 voxels training sub-volume from heart #2, but this did not lead to any noticeable improvement amongst the centroid detection metrics. We thus report the result without fine-tuning.

B. COMPARISON OF CENTROID-FINDING NETWORKS IN BRAIN STEM TISSUE
Ideally, networks trained for centroid detection in one tissue type would be robust and easily adaptable to different samples and tissue types without the need for significant retraining, which would require a large time commitment for manual ground-truth labeling. We thus tested each of the five approaches on a different tissue type: the adult mouse brain stem. The brain stem sample was optically cleared and DAPI stained similarly to the quail embryonic hearts; however, it was imaged with a different voxel size, on a different day, and by a different user who manually adjusted parameters such as excitation intensity. The brain stem cell nuclei are also more spherical and homogeneous in appearance. Additionally, the sample includes neuron nuclei, which are significantly larger with lower fluorescent intensity than the surrounding cells (see Fig. 3a, arrow). Many nuclei are also clumped, which can make them difficult to segment with traditional techniques. A 2D slice of the brain stem data and a 3D rendering of the test volume can be seen in Fig 3a, b.
As mentioned in section IV. B, small training and validation sets from the brain stem data were used for fine-tuning the regression and V-net networks. Without this fine-tuning, the F1-scores for VRegNet, Vnet-3W and Vnet-2W were 65.57%, 87.97% and 80.35% respectively. The test volume was kept separate from the training and validation subvolumes used for fine-tuning. To facilitate transfer learning, each patch from the brain stem data was resized during training in the x-y dimensions to three times its original size, to better match the apparent size of the cardiac nuclei. For the regression network, a learning rate of 5e-5, with a learning rate drop factor of 0.8 and a period of 5 epochs was chosen. It trained for 45 epochs. For the three-class V-net and two-class V-net, the learning rate for fine-tuning was 5e-5, with a drop factor of 0.8 and a period of 5 epochs. They trained for 10 epochs after which no further improvement in accuracy could be seen. The location of the TP, FP and FN centroids predicted by VRegNet, Vnet-3W and Vnet-2W after all networks were fine-tuned are marked for a 2D slice and 3D volume in Fig.3c-h.
VRegNet performed well with few FP and FN seen in Fig. 3c-d. Vnet-3W also performed well, but had particular difficulties identifying neuron nuclei (Fig. 3e, white arrow). Vnet-2W merged many close nuclei, which lead to many two FNs (magenta) surrounding one FP (yellow), as seen in Fig. 3g-h.
The quantitative evaluation of the different approaches in the brain stem volume can be seen in Table 4. For each approach, the F1-scores and cell count errors are similar to the ones obtained in heart #1, with VRegNet and Vnet-3W performing best. The distance between predicted and ground-truth centroids were also similar between approaches. Surprisingly, CellProfiler performed better on this sample than on both cardiac samples, possibly because the spherical shape of the nuclei was similar to the type of samples for which CellProfiler is optimized. However, it still undercounts nuclei by 20% and has the worst centroid distance of all methods.
Overall, both VRegNet and Vnet-3W consistently had high F1-scores, low centroid distance and accurate cell counts on all embryonic quail heart and adult mouse brain stem datasets, followed by much lower scores using Vnet-2W. These results were achieved with no fine tuning after the original network training in the heart data, or minimal fine tuning on a small volume when adapting the networks to a different tissue type. By being robust and flexible, both VRegNet and Vnet-3W have strong potential for routine image analysis in a biological research setting.

C. PROCESSING TIMES
Times required to detect nuclei in heart #1 (700 × 700 × 150 voxels, 993 nuclei) were compared for all methods. The results can be seen in Table 5, and are broken down in three sub-categories. Nuclei segmentation was accomplished using the two-class V-net (VRegNet and Vnet-2W) or the three-class V-net (Vnet-3W and Vnet-3). Instance detection was performed using the regression neural network in VRegNet to detect centroids, and the watershed transform in Vnet-3W and Vnet-2W to label individual nucleus. In Vnet-3, connected components were used to label individual nucleus. CellProfiler performed both nuclei segmentation and the watershed transform in one process. Finally, centroid detection was performed either by identifying local maxima for VRegNet, or calculating the centroid of each labeled nuclei for all other methods. All times shown are the mean of five trials.
As seen in Table 5, VRegNet is significantly faster than other methods by directly detecting the centroids of nuclei instead of relying on the watershed transform. It can accurately detect nearly a thousand nuclei in under a minute. However, VRegNet was mostly performed on a GPU, which accelerated the process compared to Vnet-3W, Vnet-2W and CellProfiler, which executed the watershed transform on a CPU.

D. SPATIAL STATISTICS OF 3D NUCLEI DISTRIBUTION
In this manuscript, we present different approaches that would replace manual cell identification and permit spatial statistical analysis in organs such as the developing heart and the brain stem. For those automated approaches to be most useful, they would need to replicate as closely as possible the spatial distribution of manually identified cells. We thus compared all five approaches to the manually segmented ground truth using three common spatial statistical functions: the F-function, the G-function and the H-function [45]. Each function was calculated in 3D using the Spatial Statistics 2D/3D plugin in ImageJ [9], [46], [47].
The F-function describes the cumulative probability for a randomly chosen location anywhere in the test volume to be within a certain distance of the nearest centroid as represented in Fig.4a. The G-function describes the cumulative probability of the average centroid to be within a certain distance of its nearest centroid neighbor as seen in Fig.4b. Finally, the H-function, which describes the cumulative probability of a centroid to be at a certain distance from any other centroids in the volume as can be seen in Fig.4c.
The F, G and H-functions for the ground-truth and all five approaches for the embryonic cardiac nuclei (heart #1) can be seen in Fig.4d-f, including the mean absolute error (MAE) between each approach and the ground truth. For each spatial statistics function, VRegNet most resembles the probability distribution of the ground-truth followed by Vnet-3W. In the F-function, curves are shifted to the right when a method undercounts cells, since it leads to more empty space in the image volume, and thus a larger distance on average between a random location and the nearest centroid in the volume. A decrease in the curve's slope would also indicate that regions without centroids have appeared in the volume more often than in the ground-truth because of errors in centroid identifications. In the G-function, curves are also shifted to the right when undercounting, since the average distance between nearest centroids is increased. A shift to the right could also indicate that centroids are more regularly spaced from each other, while a shift to the left could indicate clusters. In the H-function, both under-and overcounting leads to a shift to the right, since they are both more likely to increase the average distance between one centroid and all other centroids, than to decrease it.
The F, G and H-functions for the mouse brain stem data can be seen in Fig.4g-i, respectively. It can be seen from Fig. 3g, h that Vnet-2W tends to merge clustered nuclei, which affected the G-function by shifting it to the right without affecting the F-function. Vnet-3W also tends to merge clustered nuclei, but this effect is balanced by over-splitting larger neuron nuclei (Fig. 3e, white arrow), which creates incorrect centroid clusters. This may explain why Vnet-3W seems to perform well in the F and G-functions but fails to replicate the ground-truth distribution in the H-function.
When comparing the G-functions of the ground truth, we observed that 50% of nuclei were within 8.48 μm of their nearest neighbor in the heart, while 50%ofnuclei were within 10.34 μm of their nearest neighbor in the brain. This indicates that the cells in the embryonic quail heart might be more tightly packed than in the mouse brain stem. Those distances are 8.83 μm and 10.74 μm, respectively, when measured using the VRegNet G-functions, indicating that the difference between tissues is preserved.
In the cardiac embryonic dataset (heart #1), VRegNet most closely resembles the spatial pattern of the manually labeled ground truth in large part because it obtained the most accurate cell count, but also because it did so while maintaining the distribution of distances between centroids and their neighbors. VRegNet also performs well across all three functions in brain #1, even though it did not achieve the most accurate cell count. This demonstrates that the position of nuclei with respect to their neighbors, not just the total nuclei count, plays an important role when describing spatial distributions.

VI. CONCLUSION
In this manuscript, we present VRegNet, a new combination of nuclei-segmentation and centroid-regression networks to improve detection of nuclei in large 3D fluorescence datasets. This new approach was able to detect centroids with high accuracy in both intact quail embryonic hearts and the mouse brain stem, even though the tissue types included clustered nuclei of different shapes, sizes and fluorescent intensity. VRegNet was robust and maintained similar accuracy when tested on a heart sample that was not included in the training dataset, and it easily adapted to the brain stem images with minimal fine-tuning. VRegNet also maintained the spatial distribution of the nuclei found in the manually segmented ground-truth data, and thus could be used in further studies of spatial cell patterns.
This is an additional demonstration that regression-based approaches provide high accuracy nuclei detection, including in 3D datasets. Nuclei centroids in 3D tissue images cannot be accurately positioned by a 2D centroid detection algorithm working on one image slice at a time. It is thus important that algorithms that have proven effective in 2D [30]- [34] are fully adapted to 3D image processing.
A V-net approach with three different classes (nucleus, nucleus boundary and background) followed by a 3D watershed transform was also highly effective at separating nuclei and obtained a higher accuracy when calculating the nuclei position in 3D. If desired for analysis, the three-class V-net would also provide more information about each nucleus (e.g., size, shape and orientation especially for the ellipsoid-shaped cardiac nuclei). However, we estimated it took approximately 5 times longer for an expert user to manually segment nuclei for three-class V-net training compared to manually indicating centroid position for the same dataset. It also took nearly 18 times longer to identify centroids in testing data (heart #1) with Vnet-3W than with VRegNet (see Table 5). A regression approach would thus be less time-consuming at the initial training stage, when adapting the networks from one tissue type to another, and when detecting nuclei with a trained network.
VRegNet successfully detected nuclei in two hearts imaged on different days and with slightly different imaging parameters (such as excitation laser power which was manually adjusted every imaging session). However, the network might need fine-tuning if samples were acquired on multiple microscopes, with different staining and clearing protocol, or with different objectives. Nonetheless, as shown in the brain stem sample that was acquired with different imaging parameters and voxel size, the architecture of VRegNet is robust and high-quality results can be obtained after network fine-tuning to new samples.
As indicated in Höfener et al., it is likely that centroid detection would be improved by increasing the size of the training datasets [35]. Additionally, different post-processing steps to identify the local maxima of the centroid probability map could also improve our results without changing the regression network itself. We will explore those possibilities in future studies.
In conclusion, we present VRegNet, a novel combination of the segmentation architecture Vnet, and our own 3D fully convolutional regression network as a robust and accurate method to locate nuclei centroids. Our method performed similarly or better to segmentationbased methods combined to 3D watershed, a common post-processing step. Additionally, VRegNet most closely resembled the spatial centroid distribution obtained from manual nuclei identification making it appropriate for future studies of tissue organization.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.  Visual summary of methods and the network architecture. (a) Two-step process to segment nuclei and identify centroid locations using a combination of a two-class V-net and a regression network (first row), a three-class V-net followed by a distance transform and 3D watershed (second row), or a two-class V-net followed by a distance transform and 3D watershed (third row).    Spatial statistics analysis of nuclei centroid detection approaches. Graphical representation of the centroids (blue), with random locations (x) selected, and the distribution of distances (yellow) represented for the (a) F-function, (b), G-function, and (c) H-function. (d) Ffunction, (e), G-function, and (f) H-function calculated for each centroid-detection approach in the embryonic quail heart (heart #1), and in a mouse brain stem (brain #1) (g-i). MAE: mean absolute error, calculated between each approach and the ground truth curve. Training parameters for the centroid regression network, the two-class V-net (nucleus/background), identified V-net (2) and the three-class V-net (nucleus/nucleus boundary/background), identified V-net (3). Patch size is given in units of voxels.   Average times for each processing step in heart #1 (whole volume) over five timed trials for each method.
* Indicates processing performed on CPU instead of GPU.