Evaluating Registrations of Serial Sections With Distortions of the Ground Truths

Registration of histological serial sections is a challenging task. Serial sections exhibit distortions and damage from sectioning. Missing information on how the tissue looked before cutting makes a realistic validation of 2D registrations extremely difficult. This work proposes methods for ground-truth-based evaluation of registrations. Firstly, we present a methodology to generate test data for registrations. We distort an innately registered image stack in the manner similar to the cutting distortion of serial sections. Test cases are generated from existing 3D data sets, thus the ground truth is known. Secondly, our test case generation premises evaluation of the registrations with known ground truths. Our methodology for such an evaluation technique distinguishes this work from other approaches. Both under- and over-registration become evident in our evaluations. We also survey existing validation efforts. We present a full-series evaluation across six different registration methods applied to our distorted 3D data sets of animal lungs. Our distorted and ground truth data sets are made publicly available.


Introduction
Microscopy has a long tradition, and microscopic imaging is still one of the most frequently used and powerful tools in biomedical research. From light microscopic (LM) techniques, including conventional fluorescent stainings, to transmission and scanning electron microscopic (EM) methods, the last two decades have witnessed substantial methodological progress in terms of resolution, speed, and automation. Tissue clearing and super resolution LM at the one end, and serial block-face as well as focused ion beam scanning EM at the other end, have paved the way for a three-dimensional visualization of biological specimens.
Still, the use of serial sections remains an essential and cost-efficient tool to gain 3D insight into specimens for several reasons: Despite the progress in LM techniques the penetration depth of staining solutions, in particular fluorescent antibody staining, is limited, thus limiting the size of the sample that can be visualized. Genetically modified organisms, such as mice, expressing fluorescent proteins under cell-specific promoters, are available. This method, though, cannot be applied to human samples for obvious reasons. Thus, the use of serial sections in LM is often the method of choice for obtaining 3D information both from conventional and fluorescent microscopic imaging, in particular when human samples are investigated.
However, microscopic sections are inherently two-dimensional (2D) and their 3D information has to be regained from 2D images. The manual or automated cutting of thin sections for microscopy, however, induces-even in perfectly trained and experienced hands-varying distortions and deformations, such as stretching or compression. Positioning the sections on the glass slides further contributes to spatial distortion. This problem is especially evident in large sections. Further processing such as antigen retrieval and staining may further damage the section. Even digitization of the sections can be faulty and create partially corrupt representations.
The alignment or registration is an important method in medical image processing. An absence of the ground truth is a major problem during the development of new and fine-tuning of existing registration methods. While some simple synthetic data or phantoms can be generated, those would not adequately represent the problem. Firstly, some registration methods, for example, those based on feature detection, thrive from complexity of the input data. Such "sparse" methods might perform very well especially in the lung tissue, where the fraction of empty space is very high. Secondly, the distortions in phantom data might not truly represent the distortions in serial sections. Thirdly, in most real cases, no ground truths from other modalities exist in the typical acquisition resolution of the serial sections. Most "real" 3D methods either do not reach the resolution of conventional LM (e.g., micro-CT) or have typically much smaller spatial dimensions of the probe (e. g., nano-CT, EM). Aforementioned LS microscopy and tissue clearing are possible palliatives in model animals, but all those methods are still too complex, too expensive, or require a radically different biological processing pipeline that makes it impossible to apply both modalities to the same specimen. It is also much harder to apply aforementioned advanced methods in humans.

Contributions
In this paper we present a methodology to apply typical sectioning distortions to real data sets from other modalities. We digitally "mock up" the distortions from sectioning on real biological data. Arbitrary general-purpose images can be distorted (Fig. 1,  Fig. 2) and registration methods can be applied to distorted images. The results of the registrations can be immediately compared with ground truth data. Our benchmark is open to further registrations, new quality measures, and new images. As we present the method and not only the data, further data sets, even from additional modalities, can be produced by others. We focus our current presentation on animal lung images. However, our method is generic; it should be applicable to virtually any kind of innately 3D data of any organ from any species. Our goal is to enable the evaluation of registration methods for serial sections with a ground truth from real biological data. To fulfill it, we mimic sectioning distortions in an artificial, but statistically meaningful and reproducible manner. We then proceed to evaluate some existing registrations with our method. Among other approaches we present a full-series evaluation.
In this paper, we consider possible distortions during sectioning and apply those to 2D series from the innately 3D data. Our data sets originate from further modalities in bioimaging. The data sets aim to come close to LM sections of the lung in their scale-on both sides. We use both CT and LS as coarser scale and EM as a finer scale. As original, non-distorted data fit perfectly, those serve as a ground truth.
The contributions of this paper are threefold. Firstly, we provide an overview over the field with the emphasis on validations of registration. In most such validations, the problem of an absent ground truth motivates the search for further methods. Exactly our approach has been not suggested before. Secondly, we suggest a technique to generate a benchmark input from existing inherently 3D data. This way, we are, thirdly, able to compare registration methods on a common foundation by comparing the registered data with the ground truth. We perform an extensive image-based statistical evaluation of the full series.
The source code for this paper is available under https://github.com/olegl/distort, the distorted and ground truth data sets can be found under https://zenodo.org/record/4282448.

Paper structure
The remaining part of the paper is organized as follows: In Section 2 we survey existing registration methods and discuss the approaches towards validation of registration. Section 3 elaborates on our approach for generating distortions. The same section also presents the registration methods we used in our benchmark. In Section 4 we present the data sets and evaluate the results of the registration benchmark. We compare the registered images with the ground truth in this section. We present there both image-based evaluations and statistical gauges of the results. Section 5 discusses possible limitations and further developments of our method. Section 6 concludes the manuscript.

Registration in general
There is a lot of research on registration, esp. in the context of medical imaging. Brown (1992), Zitová and Flusser (2003) provide early surveys; Maintz and Viergever (1998), Pluim et al. (2003), Khalifa et al. (2011), and Oliveira and Tavares (2014) are more focused on the medical topics. Viergever et al. (2016) and Pichat et al. (2018) are recent overviews of the field. Zitova (2019) gives a mathematical overview of the methods. Although some manual alignment (e.g., van Krieken et al., 1985) has been performed in the past, we ultimately focus on computational methods. Lester and Arridge (1999), Crum et al. (2004), Sotiras et al. (2013) survey non-rigid registration methods. In the context of the registration of serial sections, non-rigid registration is definitely required, as distortions during sectioning are non-linear. Even if we currently do not represent tearings or foldings of the tissue, cutting-induced local distortions are still present even in the most perfectly prepared sections. (Instead of tearings and foldings we can easily encompass missing parts of the images. Salvaging damaged or missing sections is a separate problem in our eyes.) Non-rigid registration methods include Rueckert et al. (1999); Schnabel et al. (2003); Chui et al. (2003); Hömke (2006); Zhang et al. (2015). Saalfeld et al. (2010) focused on as-rigid-aspossible registration for EM. Punithakumar et al. (2017) is a recent example of a GPU-accelerated registration. Crum et al. (2003;2004) provide an overview of medical image registration, they highlight both the importance of validation and its difficulty. One of the popular software packages for registration is Elastix (Klein et al., 2010;Shamonin et al., 2014) and further developments around it Marstal et al., 2016). Another popular package is ANTs (Avants et al., , 2014. One of the somewhat frequent ideas is to work with images on multiple levels (see, e. g., Haber and Modersitzki, 2006;Bagci et al., 2012;Lorenz et al., 2012;Lobachev et al., 2017b). A kind of "sparse" methods involves feature detection and description. Ma et al. (2020) presents a recent survey of the field. The actual detectors and descriptors include SIFT (Lowe, 1999(Lowe, , 2004, SURF (Bay et al., 2006(Bay et al., , 2008, AKAZE (Alcantarilla et al., 2012(Alcantarilla et al., , 2013. A basic "sparse" registration identifies distinctive regions of both input images and then computes a correspondence between them based on the correspondence of the regions alone. In such a rigid registration RANSAC (Fischler and Bolles, 1981) is used. "Sparse" methods have been used to register medical images (Can et al., 2002;Urschler et al., 2006;Ruiz et al., 2009;Sargent et al., 2009;Han, 2010;Wu et al., 2012;Ulrich et al., 2014;Shojaii and Martel, 2016;Jamil and Saman, 2017;Lobachev et al., 2017b;Wang et al., 2020). Arganda-Carreras et al. (2010) is the origin of ImageJ's "Register virtual stack slices" implementation. (ImageJ (Schneider et al.) and Fiji  have served as a basis for many registration and analysis approaches.) They focus strongly on various rigid approaches, although an elastic extension exists. Arganda-Carreras et al. use micro-CT for quality control, they utilized Hausdorff distance as a measure. The ImageJ plugin "TrakEM2"  also utilizes feature detection, but it not only performs registration, but also includes tools for 3D modeling, editing, and annotation. Ma et al. (2019) and Zhang et al. (2020a), for example, improve the correspondence of features ("matching"). Cieslewski et al. (2019) is an example of an alternative to feature descriptors. Registration of whole sections (e. g., Mueller et al., 2011) motivated the usage of feature detection in Ulrich et al. (2014).

Validation, image generation, and benchmarks
In a sense, this paper is dual to Cifor et al. (2011). They thought explicitly about possible distortions during sectioning and displaced the real section images in a way Our method works also on general-purpose images. The test image is from the Japanese ITE data set of UHD images, we took 1k × 1k pixels crop from the center of the "Ship" image (U10, 2K version). A lot of straight lines allows for good identification of distortions. Global rigid transformation is omitted for its simplicity. The distortion map is in line with our usual settings. The visualization in (i) compares PSNR between (b) and (h); it is thresholded at 20%. that would counter this distortion-a similar idea is behind most registration methods. Our distorted ground-truth data are the input of a registration benchmark. We validate multiple existing registration methods by comparing (previously distorted and) registered images to the distorted, but not registered, and to the (not distorted, perfectly aligned) 6 (a) Before deformation (b) Locally deformed (c) Differences, PSNR, 5% (d) Differences, PSNR, 20% original data. Sections 3.6 and 4.4.1 detail on our evaluation methodology. Pluim et al. (2016) provide an overview on validations of medical registrations. Cunliffe et al. (2012) are concerned with changes to medical images effected by registration. Van Sint Jan et al. (2002) showcase a very special kind of a registration that was validated with kinematics. The method by Delaby et al. (2010) is a more typical case, where the 3D reconstructions were validated by a different modality. Schnabel et al. (2003) discuss physically plausible distortions in breast MR data. Shojaii et al. (2011) use block-face images and fluidical markers for validation of the registration. Kybic (2008) undertakes special efforts for the evaluation of registration accuracy in absence of ground truth. In contrast to all those approaches, we suggest to use multiple image-based quality measures for the evaluation of the registrations. The essence of the present manuscript is the availability of ground truth, so no further modalities, manual interventions or external markers are required.
Generation of further images is by far not new in medical image processing, to name a few, Xue et al. (2012) generate synthetic images for better T1 MRI; Duchateau et al. (2018) generate pathological cardiac images; and Grova et al. (2001) use computergenerated SPECT data to validate their MRI-SPECT registration. Image generation is connected to validation, because as long as we are able to generate images with given properties, these can be used to validate other image-based methods. Hamarneh et al. (2008) use all kinds of statistical and physically-based distortions, noise, artifacts but their method is focused on MRI and CT data. Even though distortions of 2D images are also possible with their framework, the method is focused on other modalities. Vlachopoulos et al. (2015) generate distorted CT images using landmarks and thinplate splines. Their warping method was specially chosen to imitate aspiration. The images were used to evaluate registration methods in normal lungs and organs with interstitial lung disease. The idea of the evaluation is similar to ours, however, we focus on histological serial sections and provide an elaborate methodology to generate such distorted images. Both the nature of the deformation and the method of its implementation are different in this work. We are concerned with sectioning and processing artifacts in removed tissue. We do not use thin-plate splines and landmarks to compute distortions. Zhang et al. (2020b) both generate synthetic images and use real data to compare their global registration method to others with promising results. Unlike the present work here, they distort the images using quadric 2D polynomials and focus on either homography or distortions commonly found in digital cameras. These kinds of distortions differ a lot from our approach, since they are induced by the optical pathway in the camera and not by physical sectioning of the specimen.
Related to above methods are registration benchmarks and challenges. We would like to specially mention the EMPIRE10 challenge (Murphy et al., 2011). Borovec et al. (2018) benchmark registrations of differently stained serial sections (a rather distinctive kind of registration: Likar and Pernuš, 2001;Mueller et al., 2011;Song et al., 2014;Trahearn et al., 2017;Lobachev, 2020;Wodzinski and Müller). Basically, registration of differently stained serial sections is about transferring the distortions from one kind of staining to another one. Also co-registrations across modalities are a tangent topic to our work, e.g., CT to MRI (Bardinet et al., 2002;Tang et al., 2012;Gehrung et al., 2020), serial sections to MRI (Jacobs et al., 1999;du Bois d'Aische et al., 2005;Foster et al., 2017), or MRI to ultrasound (Guo et al., 2020). We distort images from other modalities (similarly to the distortions in serial sections) in order to obtain challenges for testing registration methods with a known ground truth. In this work, during each of the registration attempts we remain within a single selected modality.
A recent ANHIR challenge (Borovec et al., 2020) uses multiple data sets, where the ground truth is obtained by external markers on the histological series. Their landmarks were placed by multiple human annotators. We use automatic image-based metrics in this work and do not use external markers, however our approach is open to further measures. (It would be easiest to integrate further image-based markers, though.) We eschew external markers, as we have a ground truth, which contrasts our work from histology-based challenges, where no direct ground truth is available. Further, the ability to fully automatically compute the "score" of a registration method from ground truths and registration output allows our approach to be used in automatic tests of registrations, such as continuous integration (Section 5.7).
The NIREP project (Christensen et al., 2006) evaluates specifically non-rigid registrations, with absent ground truth. This paper is about distorting a known ground truth for the evaluation of rigid and, mostly, non-rigid registrations, we circumvent the main problem of non-available ground truth.
To name further related papers, Pontré et al. (2017) presents a cardiac perfusion MRI registration challenge focused on motion correction; Brock (2010) compare accuracy of different deformable registration methods on MRI and CT data; West et al. (1997) and Hellier et al. (2003) are examples of the evaluations of inter-subject registrations. Klein et al. (2009) and Ou et al. (2014) evaluate registration methods for inter-patient brain MRI.

Quality measures for registration
Image registration can be seen as an optimization problem. Similarity measures are key to both good registrations and their evaluation, as a similarity measure is basically the objective of optimization. Mutual information is often used as such a measure (Viola and Wells III, 1997;Pluim et al., 2000Pluim et al., , 2003Déniz et al., 2015;Polfliet et al., 2018). A related problem is the selection of the reference in a series of histological sections (Bagci and Bai, 2010). Nanayakkara et al. (2009) introduce a metric for registration errors. Luo et al. (2020) discuss the relation between registration errors and uncertainty.
In this work, we choose image-based measures as an arbiter in quality of the registration. Such approach allows not only for automatic generation of the inputs and for automatic execution of the registration, but also for an automatic evaluation of the results. In our evaluation we use the standard measures by Jaccard (1912) and Wang et al. (2004), as well as the dense optical flow (Farnebäck, 2003). We also use the Dice (1945) measure as a visualization of the Jaccard measure-as the formulation of Dice can be converted to a formulation of Jaccard. (Details on thresholding methods are in the supplementary material.) We use a PSNR-based visualization as well. Crum et al. (2006) discuss generalizations of overlap-based measures, but in this work we opted for the well-known measures. Rohlfing (2012) criticizes the usual imagebased measures, but our method can use any measures for evaluation. Our method is not imbued with the measures we use, hence any extensions or further measures are possible. The core idea is to use (now-distorted) inherently 3D images to benchmark 2D registrations.

Machine learning
With modern deep learning methods, the measure can be implicitly learned, as Krebs et al. (2017) mention. Maier et al. (2019) provide an introduction to deep learning in medical imaging. Cheplygina et al. (2018) survey semi-supervised, multiple instance, and transfer learning techniques in medical imaging. Machine learning functions best with a lot of ground truth data from which the relations can be learned. In this context, our work might provide a way of generating the so much needed input-output data pairs. Kläser et al. (2018) generate synthetic CT images from MRI.  use style transfer to generate images from different vendors, such generated images enable better machine learning. For conventional registration the style is less relevant.
We stress that our method generates distorted images without any use of machine learning. Thus, our method can be used to generate additional data sets or to augment machine learning input.

Lung in 3D
The methods, options, and research outcomes in 3D reconstructions of the lung using any modalities from corrosion cast and up to 3D EM methods (e.g., Schneider et al., 2021) are reviewed by Mühlfeld et al. (2018); practical applications include (Mayhew et al., 2009;Grothausmann et al., 2016). Although, EM studies of the lung are popular and important (e.g., Ochs, 2010;Ochs et al., 2016;Schneider et al., 2019;Mühlfeld et al., 2021), serial sections for LM have their place in the investigation repertoire (e.g., Woodward and Maina, 2008;Mühlfeld et al., 2017;Pentinga et al., 2018). Putting stereology aside, a proper registration is paramount in any investigation of serial sections as a 3D data set.

Methods
The typical sectioning-induced distortion was found by Schormann et al. (1995) to be Rayleigh distributed. This basically means a normal distribution in each of the image axis. We showcase our method on a synthetic image (Fig. 1), on a standard test image (Fig. 2), and on real data, a 2D image from a micro-CT scan of an animal lung (Fig. 3). Notice that Fig. 1  To simulate lost or damaged parts of the sections (which we recently learned how to repair: Lobachev, 2020), additional arbitrarily placed "damage" can be added to the image (Fig. 2h). Finally, a global rigid transformation is applied (Fig. 1d). The rationale behind this step is that only in very rare cases the sections can be placed on the glass slide while maintaining the exact orientation. We randomly apply a rotation and translation to the images to simulate the uneven positioning. Such images serve then as inputs for the benchmark of registrations. Fig. 4 visualizes the core approach and the complete pipeline of this paper. We present a distortion generator that is in a sense dual to a registration. We derive an evaluation of a registration method from original 3D stack and registration results.

Local distortions
Generation of local, non-rigid distortions is of high importance for our method. At the heart of the local distortion lies the generation of normally distributed values. Two independent normally distributed random values form the x and y coordinates of a displacement, making the displacements Rayleigh-distributed (Schormann et al., 1995). The coordinates of locations, where the distortions are placed, lie on a rectilinear grid.  Figure 4: Illustrating the steps in this paper. Individual normalization of benchmark inputs and the application of global normalization are not mandatory. Currently, we do not use damage generation and damage repair in our benchmark images, but we could also test the repair methods in this manner. The evaluation ensues from comparison of the image series. The dotted boxes show two larger conceptual components, the distortion generator and the registration.
We generate multiple distortion "levels" using a classical multiscale approach. The distortions are stored as coordinates of "new" points in a matrix holding both x and y coordinates as an element-this is a typical remap matrix of OpenCV. The distortions are blurred with a Gaussian kernel in each multiscale level to make the displacements smoother. This way we avoid undesirable and unrealistic foldings, as real sections folds look differently.
In the implementation, we used Mersenne Twister pseudorandom generator (Matsumoto and Nishimura, 1998, a standard one in Python). The distortion maps are applied to the input images with OpenCV remap function using bicubic interpolation. The distortion maps are saved for further analysis. We can generate those maps in a fully deterministic manner, if desired. This determinism contributes to reproducibility.
In our application, the final distortion map is visualized (Fig. 1c, 2d) using HSV colorspace. The Cartesian coordinates of the displacements are mapped to a polar angle (associated with hue); vector magnitude basically codes the intensity.
Our distortion maps are a simple, reproducible, and well-defined way to add sectioning-inspired distortions to arbitrarily registered data. We aimed to define a stable and reproducible way to model such distortions using the statistical properties of the real-world distortions.
The reproducibility is given through multiple efforts. We have the initial, "ideal" state, the ground truth. We save the exact rigid transform, the non-rigid distortion field, and the distorted result. Through the use of pre-defined, deterministic states of the pseudorandom generator, we can basically save all the transformations in form of the seed value (plus original images, of course). Above issues would be useful to ensure reproducibility, e. g., for automatic regression testing of registration methods.

Adding rigid transform
The locally distorted images are further processed. In our application we add an image-wide rigid transform: a random rotation and a translation (Fig. 1d). The rotation angle is uniformly distributed between −180 and +180 degrees. The translations are also uniformly distributed, but they are chosen in a range [−d/4, d/4], where d is maximal image dimension, in order to not truncate too much of the image content. The rigid transform is recorded, as it is the ground truth for the first, rigid step of the registration.

The use of the distorted images
The distorted images serve as a starting point for the evaluation of registration methods. The registration should, basically, "undo" the distortions and transforms we applied to the original images. For the benchmark and evaluation purposes we suggest using innately registered data, i.e., original 3D images. The benefit of doing so is the available ground truth, the original undistorted 3D stack. Summarizing, we circumvent the problem of missing the real ground truth when comparing registrations of serial sections.

Damaging the images (optional)
An optional extension is to mimic the damage to which the sections are subjected to during the processing (Fig. 2h). Using normally-distributed pseudorandom values for dimensions and placement we iteratively generate an ImageMagick (The ImageMagick Development Team, 2020) script that deletes selected parts of an image. This is a quite crude approximation to the variety of possible kinds of damage to a section (Lobachev, 2020)-from an unsharp region (due to focus error) to a teared section. However, we would like to model missing parts of a section in an understandable and straightforward way. Our test subject, a registration method, would not necessarily care about the reason why some image regions are not matchable to other images. It is not our current goal to model the damaged sections realistically.

Registration methods
In order to demonstrate our methodology of the evaluation of registration, we apply the following registration methods to our distorted data sets: 1. "Rigid-SURF": Feature-based rigid-only registration based on weighted RANSAC (Fischler and Bolles, 1981;Lobachev et al., 2017b) and SURF feature detector (Bay et al., 2006); 2. "Rigid-SIFT": same as above, but with SIFT feature detector (Lowe, 1999(Lowe, , 2004; 3. "Deform-SURF": Feature-based deformable registration (Lobachev et al., 2017b), first stage based on "Rigid-SURF", followed by multiple non-rigid stages using B-splines; 4. "Elastix": a generic Elastix (Klein et al., 2010;Shamonin et al., 2014) configuration, we used rigidly registered result from "Rigid-SURF" as input. The parameter file is made available in the supplementary material. The non-rigid stage is not a feature-based method; 5. "GS": Registration method based on Gauss-Seidel optimization (Gaffling et al., 2015), we used rigidly registered result from "Rigid-SURF" as input. However, the actual non-rigid deformation is based on different principles, among others, on the gray-level co-occurrence matrices; 6. "Blending": Registration method based on blending rigid transforms in image regions (Kajihara et al., 2019); The rigid methods work pair-wise on the images. We operated Elastix pair-wise on the images, hence the possible accumulation of "drift" with the progress of the series. "Deform-SURF" optimizes the whole stack at once in the non-rigid phase, "GS" does the same. The "Blending" method works backward and forward from a reference frame. In this case, the inputs were used "back and forth", for a series of 1, . . . , n images, the input was n, . . . , 1, 1, . . . , n, essentially doubling the length of the series. The border interpolation was less of our concern, the images were padded before processing.
Why the rigid transformations? The rigid-only methods we used clearly cannot undo the non-rigid distortions. But the non-rigid distortions also make harder the search for correspondences for the rigid transformations between image features. Further, a rigid-only registration is also not necessarily perfect in what it does, in other words, a rigidly transformed (Section 3.2) and then rigid-only registered series is different from the series before such transformations.

Result evaluation with quality measures
We propose the use of established image quality measures: structural similarity (SSIM, Wang et al., 2004), Jaccard measure (Jaccard, 1912), often visualized here with slight implementation differences as a Dice measure (Dice, 1945), a visualization of dense optical flow (Farnebäck, 2003). In the latter, color stands for a direction and intensity for the magnitude of the movement. We also use PSNR visualization, as implemented in ImageMagick. There, red highlights a non-correspondence. For Jaccard measure we use global thresholding. For Dice measure visualization we use Otsu's method (Otsu, 1979) on blurred images, as implemented in OpenCV (Bradski, 2000;Kaehler and Bradski, 2014). Further details are in the supplementary material. The latter material also details on the manner in which we crop the images for evaluation in order to eliminate border effects. In our visualizations, black is "neither", magenta and green means "in one image, but not in another", white is in both.
To give an example, Figure 5 shows some of the quality measures, applied to different registration of the same data. Top row (b)-(f) shows the SSIM, bottom row (h)-(l) shows the optical flow visualization. We registered a full LS series with all methods. We used the distorted images that were also globally transformed as the input. Then, a 1k × 1k pixels crop from the full registration was used for evaluation to eschew the border effects. Panels (a) and (g) from Fig. 5 show regions cropped from the original images. Panels 5(b) and (h) show locally distorted images, before the global rigid transform. For all registrations, the images were locally distorted and globally transformed. Images shown in panels (c) and (i) were then registered with "Rigid-SIFT", it is the rigid transformation only. Panels (d) and (j) show the evaluation of images, registered with "Deform-SURF", both rigidly and non-rigidly. Images shown in panels (e) and (k) were rigidly registered with "Rigid-SURF", then non-rigidly registered with "GS". Panels (f) and (l) show the evaluation of original, non-distorted data, i.e., the ground truth.

Results
First, we establish our specimens, discuss the data processing (Section 4.1), and data preparation (Section 4.2). Our main result is a full-series evaluation with statistical means (Section 4.4). We also include some special cases (Section 4.3). A pair-wise evaluation is included in the supplementary material. Notice that full images were always used for the registration. We mostly look at the center crops for the consistency of the evaluation, but full images were processed beforehand.

Specimens
We applied our method to micro-CT, light sheet, and EM images of animal lungs (Fig. 6). The data was processed in the manner standard for each modality, basically, for the processing in this paper, we perceived the data as already processed and ready-to-use 3D images. The images were extended to the size specified below in order to not lose data during the global movement phase. Specifically, we used: • A rabbit lung acquired with micro-CT (Fig. 6a). The specimen was a New Zealand White rabbit that was artificially delivered 3 days early by cesarean section and that spent 7 days in hyperoxia (95 %), the lung was perfusion fixed. The sample comes from a project studying the bronchopulmonary dysplasia in a hyperoxia preterm rabbit model , part of a larger study of bronchopulmonary dysplasia models (Appuhn et al., 2021  Belgium). The X-ray source was set to a tube voltage of 80 kV and a tube current of 125.0 µA, the X-ray spectrum was filtered by 1 mm of Aluminum prior to incidence onto the sample. We recorded a set of 2 stacked scans overlapping the sample height, each stack was recorded with 488 projections of 3104 × 1091 pixels (2 projections stitched laterally) at every 0.4°over a 180°sample rotation. Every single projection was exposed for 2247 ms, 5 projections were averaged to greatly reduce image noise. This resulted in a scan time of approximately 8 hours.
The projection images were then subsequently reconstructed into a 3D stack of images with NRecon (Version 1.7.4.2, Bruker microCT, Kontich, Belgium) using a ring artifact correction of 7. The whole process resulted in a data set of 1135 images with an isometric voxel size of 7.0 µm (see also Fig. 3). The images were pre-processed with a 2D anisotropic diffusion denoising filter based on lattice basis reduction (Fehrenbach and Mirebeau, 2014).
We extracted 600 images from the middle of the filtered data set for our benchmark and padded them (Sec. 4.2), yielding images at 3954 × 3954 pixels; • An EM serial block-face (SBF-SEM) data set of adult mouse lung (Fig. 6b).
The specimen was a 4 weeks old C57BL/6 mouse, the lung was perfusion-fixed (Buchacker et al., 2019). The experiments were approved by Regierungspräsidium Karlsruhe. Overall, 5246 sections with 80 nm thickness were cut in a Zeiss Merlin VP Compact SEM (Carl Zeiss Microscopy GmbH, Jena, Germany), using a Gatan 3View2XP system (Gatan Inc., Pleasanton, CA, USA). The block-face was captured with the view port of 525 × 525 µm, yielding 15k × 15k pixels with 0.5 µs dwell time, 3.0 kV acceleration voltage and variable pressure mode at 30 Pa.
The benchmark uses a crop from the full data set with 1000 images at 1.5k×1.5k pixels. The final resolution is 0.15 µm/voxel. The data set was denoised with gradient anisotropic diffusion using ITK before usage. We did not apply any registration in post-processing, but we individually normalized the images-as detailed below; • A lung for the light sheet (LS) data set was obtained from a male 24 week-old Fisher 344 rat with a body weight of 320 g, which was part of a ventilation study approved by the LAVES in Oldenburg, the number of animal experiment proposal is 17/2608. The lung was fixed in an inflated state with an airway pressure corresponding to 20 cm of H 2 O and perfusion fixed, compare Krischer et al. (2021). By means of a "tissue slicer" the lung was cut in slices of 2 mm thickness.
The image data was acquired with the UltraMicroscope II (LaVision BioTec GmbH, Bielefeld, Germany). The lung slices were pinned up to a mandrel in the ethyl cinnamate-filled detection chamber and illuminated unidirectionally with 6 light sheets. An sCMOS camera detected the fluorescence light with a wavelength of 490 nm, which matches the tissues autofluorescence, perpendicular to the illumination plane. Due to the large dimensions of the rat lung and the intention to depict the complete lung the only zoom factor to choose was 0.63, corresponding a 1.26-fold magnification. The series was acquired as 336 images with 5.16 × 5.16 × 15 µm/voxel (Fig. 6c).
For the benchmark we use 300 images at 3,9k × 3,9k pixels that were individually normalized, see below; • As a reference, we also provide two serial sections of the same rabbit lung as in micro-CT, stained with toluidine blue. Fig. 6d shows one of those sections. The images were acquired in transmitted LM with a Zeiss AxioScan.Z1 scanning microscope (Carl Zeiss Microscopy GmbH, Jena, Germany) at 0.22 µm/pixel (20× lens). The section thickness was 2 µm.

Data preparation, normalization, and availability
All benchmark images were extended to a larger square to reduce the loss of information during rotation and translation. Their bit depth was reduced to 8 bit, LS and EM images were normalized individually using ImageMagick and GNU parallel (Tange, 2020). The normalization of individual images is introduced to mimic a slightly varying image intensity (Bagci and Bai, 2010;Khan et al., 2014;Nadeem et al., 2020) due to varying thickness of slices (Hanslovsky et al., 2015) or varying penetration of the fixation and the staining solutions (e.g., Bulmer, 1962;Lanir et al., 1984;Dullin et al., 2017).
We provide as supplementary data the original, undistorted images, the locally distorted images, the locally distorted and rigidly transformed images, the local distortions, and the values for the rigid transformations.

Special cases
For the pair-wise evaluation in supplementary material we intentionally picked the center of a series and a clearly defined region. Here we would like to separately highlight some especially good of bad consecutive image pairs in Fig. 7. This is a subjective selection of image pairs, as contrasted with the next section.
In panel (a) a larger global shift in non-linear distortion phase of the registrations' input is shown. It is CT data set, sections 95-96, Dice visualization. Fig. (b) shows uncorrected global movement in Elastix-based registration, same data set, sections 99-100, Dice visualization. There is a "floppy end", a movement, in Elastix result (c), from same data set, sections 305-306, Dice visualization. Mostly the registration is good, but in the depicted region the offsets are much larger.
In panels (d), (e) the measures of an interesting image pair from "Deform-SURF" are depicted. We show Dice (d) and optical flow (e) visualizations from the same region. There is some movement, but is it all explainable with the "natural" differences in consecutive images?
The areas with little tissue (as found in our EM series near its end) are a greater challenge for the feature-based methods, as Figs. (f), (g) show. We see there a failure of the feature-based SIFT method to find a correct rigid alignment. Full images from EM data set, sections 702-703, are shown. The probable reason is the low number of viable key points found by the feature detection.
Subfigures (h)-(j) show a small evaluation of a particularly good GS-registered image pair, LS data set, images 161-162. We show there Dice visualization, PSNR, and SSIM, respectively, for the same region. Notice the low values and very few differences.

Full-series evaluation
After we have looked into the relations of two consecutive images from the middle of the series, a question arises naturally, how the quality measures look throughout the series. For this sake we computed the three numerical measures SSIM, Jaccard, PSNR over the whole series and evaluate these values statistically. All Jaccard values below are computed with threshold 100.

Methodology
Our concept of the evaluation focuses on comparing consecutive sections from the registered series to the same consecutive sections from the ground truth. We decided against comparing the images from the registered series directly to the ground truth images: Accumulated errors from the rigid transformations and non-rigid distortions impact such comparisons. We would be more interested in how the now-registered series fits to itself in comparison to how it should have fit, that in a direct comparison that would find a lot of mismatch that would be of less interest in practice. To give a simple example, many methods might over-fit the registrations of their inputs for the better numerical quality values ("over-registration", "banana problem"). With our ground truth series we would be able to find such cases.
The measures were computed on the 500 × 500 pixels crop from the middle of the images, throughout the series. Lower values in the beginning and in the end of the series can be explained with less tissue in the region. However, we advocate the use of the center crop as a simple to define and "fair" way to define a region. As detailed in the supplementary material, it is impractical to use the full image for the evaluation. Also, some methods used additional padding, making the uniform and comparable use of general cropping offsets harder. The middle of the image should arguably have meaningful tissue contents in most cases. The "drifting away" tissue, i.e., the case when different registrations accumulate the errors so differently, that we obtain fully different image regions at the same offset, is rather an exception. This way, we have meaningful content in the most of the series duration.
We present box plots of the appropriate values of quality measures. In this case we decided against using violin plots. Violin plots show outliers similarly, but the median and the shape of the inliers can be discerned with less clarity in most of our particular cases. (We still present some violin plots in Fig. 11, see also supplementary material.) We also present a statistical evaluation. We performed an unpaired t-test with different means and unequal variances, a Welch two sample t-test, to be exact. We always compared a measure of registration results to the same measure of the ground truth.

CT
Consider Fig. 8, presenting box plots of the full-series evaluations. Overall "Deform-SURF" and "GS" stand out for their good performance. Panel (a) shows SSIM values. Notice, how the median in "GS" is higher than in ground truth (0.895 17 vs. 0.877 45). In panel (a) "GS" and "Deform-SURF" are close to the ground truth, while there are a lot of outliers in "Deform-SURF" and "GS" seems to overshoot a bit, but has some lower outliers. There are few outliers in the ground truth, too, but they are rather symmetric. The latter also holds for the Jaccard measure. In panel (b) there are some outliers with high Jaccard values in "GS". In (c) we see, again, similar values in "GS" and "Deform-SURF" to the ground truth, but now there are many outliers with high and very low PSNR values in "Deform-SURF". Notice also the shape of the ground truth box plot for PSNR: there are quite many values above the median. Overall, Elastix has good results, but quite tall box plots, indicating high variance. A possible reason is that Elastix operated pair-wise on the image sequence in this case. Both "Deform-SURF" and "GS" operate on a full image stack at once. Table 1 shows a statistical evaluation. We aim to decide with a Welch t-test, if the quality measures of a registered series are similar to the ground truth. This is almost never the case. In "GS" with respect to Jaccard measure ((b)) we see the largest similarity, according to the test, but p is still quite low there, under 4.3 · 10 −4 . Sometimes, the maximal values of the quality measures for a registration method are higher than the maximal ground truth value for the same measure. We attribute this to a wider "spread" of the variance, induced by the registration. To give an example for the SSIM measure, the variance of "Deform-SURF" for the full series is 4.29 · 10 −3 , while the variance of the ground truth is 1.05 · 10 −4 . Figure 9 presents the box plots of the quality measures for the full EM series. The normalized, 8 bit ground truth was the actual input for our distortion method. Its output was the input of the registrations. The 16 bit ground truth is the original data and is provided as a reference. Fig. 9 necessitated some adjustments. There were some very low SSIM values. We adjusted the scale of y-axis in panel (a) to show more of the relevant details around the median values and to remove some outliers. Panel (b) was plotted without any adjustments. We had to filter the PSNR data to remove infinite values in panel 9c. Also, the quality values there for the ground truth, 16 bit, were much higher than for other modalities. We adjusted the scale of y-axis to show the results from the registrations more detailed.

EM
We see in panel (a) that "GS" almost reaches the level of the ground truth, normalized with respect to SSIM. The 16 bit ground truth has lower SSIM values because of more details. As before, "GS", "Deform-SURF", and Elastix look quite good in box plots. The Jaccard values (b) were rather high, though. The probable reason is the amount of background in the EM data set. In both discussed measures there are some outliers on the lower side. PSNR (c) shows very high values for 16 bit ground truth, we disregard them as all other values are 8 bit. Concerning PSNR, we see some over-registration in "GS", less so in Elastix and "Deform-SURF": the median values and most of the box contents (the box represents 50% of the data around the median) are higher than in ground truth. The box plot for the normalized ground truth shows some outliers for the larger PSNR values, however.
In the statistical evaluation (Table 2), we see a slightly larger p value for "GS" with respect to SSIM, but nothing extraordinary for this measure. The high values of 1.0 for SSIM originate from the region at the end of the series with few changes because of low amount of tissue. It rather indicates a failure of the registration, as the measure is computed on a center crop. As mentioned above, the Jaccard values are rather high, again, "GS" manages to obtain a slightly higher p for its mean. Quite of interest is PNSR, where "Deform-SURF" manages a p ≤ 0.02713, but even more spectacularly, Elastix has p ≤ 0.7992. This value basically means that the mean of the PSNR for this data set, registered with Elastix, matches the mean of the normalized ground truth with a high probability. Such a match is an exception in our evaluations.

LS
Consider Fig. 10. The SSIM for LS method shows quite good values for "Deform-SURF", "GS", and also for Elastix, but in this case with some outliers. Surprisingly, the  The Welch two sample t-test for the quality measures on CT data set. Similar as above, the "locally distorted" values originate from applying only local distortions with our method. The registrations' input was also globally transformed. Concerning the appropriate measures' values, we compare the means from each of the registration results to the mean for the ground truth. The differences are statistically significant in almost all cases, we show the p value for "not equal". The hypothesis of the equal mean was almost always refuted with a high confidence. We additionally show the maximal value of a measure ("Max" column). The "local" row is for the local distortions only, without global movements. The mean value marked with * is the one closest to the ground per t-test. The maximum values marked with † are larger than the maximum of the ground truth. This is a "crime" many good methods commit in our evaluation.
The closer is the mean to the ground truth, the better.  -SURF" for Deform-SURF, "local" for only local distortions, "Gr. tr. n." for "ground truth, normalized", "Gr. tr. 16" for "ground truth, 16 bit". We compare the means from each of the registration results to the mean for the normalized, 8 bit ground truth. The differences are statistically significant in almost all cases, we show the p value for "not equal". The hypothesis of the equal mean was almost always refuted with a high confidence. Notice Elastix with respect to PSNR with p < 0.7992, it matches the mean of the normalized ground truth PSNR up to −0.2 % relative error. The mean value marked with * is the one closest to the normalized ground per t-test. We also show the maximal value of a measure (the "Max" column). The maximum values marked with † are larger than the maximum of the normalized ground truth. Notice that the quality measures are unusually high for this data set. In contrast to Fig. 9, we operate on unfiltered data for SSIM and Jaccard. We still had to remove infinite values of PSNR for a meaningful analysis. The sole method where data was individually filtered in the above manner is marked with ¶. The closer is the mean to the ground truth, the better.  for Rigid-SURF, "D.-SURF" for Deform-SURF, "local" for only local distortions, "Gr. tr. n." for "ground truth, normalized", "Gr. tr. 16" for "ground truth, 16 bit". Fig. 11 shows some further details. (f) LS, PSNR, full violin plot Figure 11: More detailed plots of LS quality measures. Abbreviations are same as above. "Truncated" means that not the full range of the measure was plotted. In Jaccard, "filtered extreme values" means that numerous zero and 1.0 values were removed. They can be attributed to missing tissue or completely full region, were no meaningful comparison can be made at current threshold. We also show some violin plots where those might bring additional insight through their shape. Concerning the "Rigid-SURF" method, its quality was much lower. As this figure aims to provide a more detailed view, we scale the box plots to better visualize the differences between other methods. Fig. 10 as well as the violin plot (f) show the full picture. As before, "R.-SIFT" stands for Rigid-SIFT, "R.-SURF" for Rigid-SURF, "D.-SURF" for Deform-SURF, "local" for only local distortions, "Gr. tr. n." for "ground truth, normalized", "Gr. tr. 16" for "ground truth, 16 bit". We compare the means from each of the registration results to the mean for the normalized, 8 bit ground truth. The differences are statistically significant in almost all cases, we show the p value for "not equal". The hypothesis of the equal mean was almost always refuted with a high confidence. The maximum of the normalized ground truth for SSIM was rather low. For Jaccard, "Deform-SURF" reaches p < 0.168, and for PSNR the same method reaches p < 0.7875. For the latter, it matches the mean of the ground truth up to 0.186 503 4 %, the 95 % confidence interval is −0.434 535 2 to 0.572 834 2. The mean value marked with * is the one closest to the normalized ground per t-test. We also show the maximal value of a measure (the "Max" column). The maximum values marked with † are larger than the maximum of the normalized ground truth. Notice that the quality measures are unusually high for this data set. Similar to Fig. 10 plot for the normalized ground truth is less convincing, the original, 16 bit ground truth shows even less similarity. The variance is clearly much larger in the 16 bit data. We can thus conclude, that above "good" registration method over-register. Looking into Fig. 11a, we concludes that "GS", "Deform-SURF", and less so, Elastix, produce much higher SSIM values than they should have in order to be similar with the normalized ground truth. The variance in the results of those registration methods is also lower than in the normalized ground truth. Panel (b) shows in a violin plot how the distribution of SSIM values changed between the methods. We see quite high values for Jaccard measure in Fig. 10b, but also a lot of outliers in almost all methods. To study those further, we present a zoomed-in version in Fig. 11c. Even more interesting is panel 11d. There we have removed the values 0 and 1.0 from the evaluation. Basically, those extreme Jaccard values mean that either no correspondence at all was found or the full correspondence. The latter can be caused by too low threshold value or too little detail in the particular region. In those both panels we see a superior performance of the "Blending" method compared to all other registrations. We notice also that our local distortions do not change the Jaccard index very much. The statistical analysis (below) does not support the superiority of "Blending", however.
As also in other modalities, the PSNR value of the 16 bit ground truth is much larger, than in all other methods that utilize 8 bit images (Fig. 10c). In a zoomed-in version in Fig. 11, (e) we see that for PSNR the median of no registration method exceeds the median of the ground truth. This means that PSNR detects no over-registration in this case. The somewhat peculiar, uneven shapes of the PSNR distributions are visualized as violin plots in panel (f). Now, consider Table 3. Statistically, no registration method matches the mean of the SSIM of normalized ground truth well. Most of the methods (namely, "Deform-SURF", "GS", Elastix) are well above and also all registrations overshoot the maximum of the ground truth SSIM. In Jaccard, we have many 1.0 values (which were not removed in this case, as we want to contrast those statistics to the box plots). We see a match in the means with p < 0.168 in "Deform-SURF", however we would be quite cautions in this case because of some "invalid", too low or too high Jaccard values. With PSNR, "Deform-SURF" manages to match the mean with p < 0.7875. Some methods (e.g., "GS", Elastix) have even larger mean values of PSNR. We would deem those methods as better than "Deform-SURF" in this case, if we did not have the ground truth. We have to note, however, that all registrations overshoot the maximum of the PSNR in the ground truth, as evident from Table 3c. A repeated look on Fig. 11, panels (e), (f) shows that the distributions of the PSNR values for "GS" and "Deform-SURF" are much more similar to each other than to that of the normalized ground truth.

The results and the "banana problem"
We have mostly discussed the results in the previous section, still there is an issue we would like to specially highlight. We have quite often seen that methods which produce better image-based metrics also seem to over-register the series. It would be very hard to find such an over-registration (a "banana problem") without a ground truth. CT scans of the specimens before sectioning might help, but, as mentioned above, they lack on the resolution. Basically, our ground truth bounds from above the amount of correspondence between consecutive images. Such challenges as ours would help to develop better registrations that try to reach such a bound, but not to overstep it.
Occasionally, we have found the values of our quality measures for a particular registration method higher than for the ground truth. How is this possible? Our reasoning is that the ground truth does not constitute a perfect correspondence of the consecutive input images. It is merely their real correspondence drawn from inherently 3D data. This means that those registrations might create too much correspondence, they over-register their inputs. We have also seen an interesting effect, where the correspondences after a registration where more heterogeneous than in ground truth. It appears to us that some areas were over-registered and some areas were under-registered. Again, without a ground truth such observations would be impossible.
In other words, to obtain good values overall or on average (while high variance cannot be reduced), some registrations seem to attempt to shift all of the correspondences between the images towards their maximums. Naturally, this behavior forces the maximal values to exceed the maximum of the ground truth while the mean is still under the mean of the ground truth. This leads to the "banana problem".

Distortion mechanism
Our distortions "stretch" and "shrink" the images, but their positions are organized in a rectilinear grid, even if the distortions themselves are random and Rayleigh-distributed. It would also make sense to choose the distortion positions randomly, too. However, we opted for a grid for a better reproducibility of the appearance of the test images: a human would know where to look. Still, the actual distorted images do not show that much "bad" regularity, the above grid is not visible, so we argue that no problems arise from such a rectilinear grid placement. If our method is used to benchmark registrations using neural networks with a direct assignment of neurons to either pixels or distortions, the randomization of the distortion positions might be required.
We use a rigid transformation to model the inaccuracy in section placement. If a fully affine or an even more generic transform is needed, our code can be easily extended to incorporate it. Indeed, some registration methods (e.g., Cardona et al., 2012;Xu et al., 2015) use affine global transformations to model wedge-shaped sections.
To contrast phantoms to our approach: the argument on not fully representing the distortions might also hold for our distortion generation, but we still work with real data. Hence, the reasoning on lacking data complexity does not hold. This issue is especially prominent in methods based on feature detection.

Further ideas for the distortion modeling
Our method currently does not directly account for tissue folding and tearing. Such damaged areas can be represented with "holes" in the images, but they currently would not correlate with larger distortions in the connected areas. A realistic modeling of section damage was not our current goal. Nowadays, methods to bridge section damage exist (Lobachev, 2020). Basically, any kinds of damage can be assessed with masks for the repair, similar to the masks we use here to simulate the damage. Some further recent works assess cracks and discontinuities (Aggrawal et al., 2020;Ng and Ebrahimi, 2020).
A convolutional neural network, transforming "clean" images into damaged ones with some kind of a style transfer (Gatys et al., 2016;Shaban et al., 2019;Liang et al., 2020;Khan et al., 2020) is an interesting idea. We sought for a functional and welldefined image deformation that allowed for using transformed images as a benchmark input. Neural-network-generated images might have some unnoticeable for humans drawbacks that would obscure and throw off-track some other (probably, also deeplearning-based) registrations-detection of adversarial examples is a separate problem. Our test images are produced through simple, robust, reproducible, and well-understood image transformations.

Impact of a 3D series
We use full-blown, inherently 3D images as a series of 2D images. Those 2D inputs are used for our benchmark for a reason. Some registrations do not operate on image pairs, but optimize the spatial positions of the full image stack at one.
Next, the presence of already three-dimensional images as the starting point enables us to state how the final image stack should look like. We digitally simulate the distortions by sectioning and further section handling. Then we apply a registration method (we evaluate multiple of them in this work). The discrepancy between the distorted series is larger, than in the registered series; this is the whole idea of registration. But, contrary to the usual 2D registrations of serial sections, we still have the initial starting point, the 3D images. They are in the same resolution as the registered series. We call those initial images the ground truth. We can compare the registered series to the ground truth and find out, what was wrong with the registration method in question.

Challenges in project execution
It was quite hard for some methods to cope with large angles in rigid transformations, in those cases we used SURF-based rigid registrations as an initial phase. Many registrations are inherently trimmed for the most used input data kinds and modalities. Adaptation to further images is possible, but requires more or less tuning. In the best case, the tuning can be commenced with parameter files, such as with Elastix.
The present project was quite large. A decent automation of the workflow (we used Python, bash, and GNU Make) was key for fast and error-free processing. This issue was of especial importance in case of the evaluations.

Evaluations
The different normalization issues, esp. in EM data set, might also explain the observed variations in the measurements. This is a typical trade-off: a better normalization allows for better registration, but a normalization also changes the data set, so a direct comparison with non-normalized data might be harder.
The visualizations of optical flow we used as one of the measures is, on the one hand, a valuable tool. Those visualizations show issues less visible otherwise, the "hot spots". On the other hand, a direct comparison of such visualizations with each other in their present form might be misleading because of individual magnitudes.
One of the ideas for further improvement of our work is to compare not images, but deformation fields from various methods. However, multiple implementation questions would arise. One of the issue is the registration "drift" that would be different in various methods: Currently, our challenge is complete open: the participants need to obtain the input images and produce the result images, the registration method itself does not need to be adapted or changed, it can remain closed-source or even a commercial secret. If we would like to compare the deformation fields, we would need to provide a consistent way to output comparable distortions across all the implementations, libraries, programming languages the participants use. The code for the actual method needs to be changed, which means it should be available and human resources for the change need to be allocated.

Continuous integration
Our visualizations and measures are computed automatically. No human interaction what so ever is needed: the distortions of the ground truth, registrations, and evaluations can happen fully automatically. Thus, our evaluation method is a gateway to widescale registration challenges and to regression testing of further developing registration methods. Our method can be applied as a part of a continuous integration workflow (Meyer, 2014). With our approach, better and more thorough regression testing of registration becomes possible.
Basically, this paper shows a further path towards automatically testing different regularizations in image registrations. The goal would be to reduce the magnitude of the banana problem, while still maintaining good registration results. Such testing can be done with image-based metrics, as we do here, but any other metric would work too.

Conclusions
We introduce an approach to generate individual 2D distortions applied to existing 3D medical data. Those distortions have the basic statistic properties of the cuttinginduced variations in serial sections. The distortions are computed in a straightforward, understandable, and reproducible manner. We also apply a global rigid transform to mimic the inexact placement of a section on glass slide. Modeling damaged sections is also possible. Combined, we can simulate the transition from a tissue block to a set of serial sections. Thus, we are able to test registration methods on such simulated data originating from real tissues.
We provide an overview of the existing registration and evaluation efforts. To our knowledge, the approach, we pursue in this work, has not been undertaken previously.
The key contribution of this work is the utilization of the original, undistorted data for the evaluation. Previously, it was impossible to evaluate registrations of serial sections with ground truth, as the tissue block is destroyed by sectioning. Micro-CT scans currently do not have sufficient resolution and can serve only as a coarse guide in a co-registration of micro-CT to real serial sections. Phantoms and synthetic images might not have the needed complexity. We use real, micrometer-scale lung images from other modalities in this work. By using real images of animal lungs we affirm that the kind of the images used is comparable to real serial sections of the same tissue. Our method is, however, directly applicable to other organs or generic images (Figs. 1, 2).
In this work we evaluate six registration methods on three distorted data sets. In each of them, a ground truth is present. With the ground truth, we can not only compare the registrations with each other, but also with the inherent 3D data, in other words: with original data, with how the 2D "slices" should have been aligned if no cutting took place. We address the quality of registrations with four visualizations of image metrics and three image-based quality measures. In this work, we both look at a specific image pair (in the supplementary material) and provide an evaluation over the full range of the series. Our method can be applied in a continuous integration workflow.
We make the source code and the data sets publicly available. Further contributions, both in form of additional registration results and further data sets, are welcome.

Future work
Utilization of our method, statistical analysis of real sections (similar to Schormann et al., 1995), and an introduction of better quality measures may lead to better registrations, both utilizing deep learning and not. We definitely look forward to more comparisons of registration methods.
It would be very interesting to find a way to compare the registration result to ground truth directly. (In this work we compare consecutive images from each of the results and evaluate the resulting measures.) Presently, the accumulation of registration "drift" and some global offsets make a well-founded assessment more complicated than our present evaluation. • Otsu method (Otsu, 1979); • Otsu method in blurred images; • with Matlab's activecontour function at 300 iterations, starting from a binary thresholded image.
To give an example, consider Fig. 15 that shows a region from LS data set. The masks, such as (b), were computed using the activecontour function in Matlab with the threshold 150. We see that the Dice measure in (d) roughly midway between (c) and (e). We cannot hope to reach values similar to (e) with the method used, as no non-rigid alignment happens in this case. Thus, the result of the rigid alignment is acceptable. It manages to align the sections rigidly, despite additional non-rigid distortions induced by our generation of input data. Figure 16 shows the same region as Fig. 5, the Rigid-SIFT method (d). Here, the same image pair is evaluated using different thresholding methods. We conclude that Otsu's method on blurred images produces best visualizations, although little difference is seen to Otsu without blur. For our Dice measure visualizations in the main text we use Otsu's method on blurred images. For Jaccard evaluation in the main text, we use the global thresholding for its consistency. One of the reasons for these decisions: Our input images are individually normalized. (See Section Appendix F.) Even if they are normalized back to a "common denominator", some differences may remain. We also use the input images (with all their discrepancies) in the evaluation.
Notice, that the meaning of colors white and black in Figures 15 and 16 is different. We argue that using white for the overlap is better.
We use two consecutive not distorted images from the middle of each series as a ground truth for the evaluation of multiple registration methods. We also use two consecutive images at the same positioning from the registrations' results. (An evaluation of the full series is in the main text.) The input of the registrations is the series in its completeness after the application of distortions presented in this paper. Both local distortions and rigid transformations were applied. Two consecutive result images   (Figs. 17,18). In Jaccard measure, we used a global threshold of 100. Low quality values in blending transforms can be explained by residual shifts. Notice the differences between the rigid-only and non-rigid methods. Notice also the lower values in the ground truth, when compared to Elastix. The "Locally distorted" line shows the local distortions only, without global rigid transforms. There was some larger movement (as seen in Fig. 18), this fact affects the measures. The registrations' inputs, however, were also rigidly transformed beforehand. The larger the values, the better-up to the ground truth value.

Method
Jaccard PSNR SSIM from the middle of the registered series are subjected to the quality measures Jaccard, PSNR and SSIM, as outlined in the main text. We also utilize two consecutive locally distorted images to produce comparison values. In the image-based evaluation we show a 500 × 500 pixels crop from the image center to highlight details. The tables below show the Jaccard, PSNR and SSIM values for the ground truth (undistorted images), as well as the locally deformed images. In part, the values for the normalized ground truth images were also given, to give an idea about the impact of intensity variations on the ground truth measures. Same intensity variations as in "normalized ground truths" were also present in registrations' inputs.
Hence, the effect of the non-rigid registrations can be read as an improvement   (Figs. 19, 20). In Jaccard measure, we use global thresholds of 100 and 150, as indicated with "J 100" and "J 150". "Locally distorted" means images resulting from our method applied to normalized ground truth for local distortions, but no global rigid transformations were used in this case. The actual registrations operate on images that were also rigidly transformed. Notice the differences between the rigid-only and non-rigid methods. Notice also the differences between both rigid versions as well as the impact of normalization and 8 bit discretization on the ground truth values. The larger the values, the better-up to the ground truth value.  (Figs. 21, 22.). In Jaccard measure, we used a global threshold of 100. Low values in "Blending" and in rigid-only methods can be explained by residual shifts. Notice the improvements in non-rigid methods. The shorthand s stands for "stretch". The larger the values, the better-up to the ground truth value.

Method
Jaccard PSNR SSIM between "locally distorted" and "normalized ground truths", but there was also a global rigid transformation. Undoing it with a rigid registration might have lead to a worse starting point for the non-rigid registration than "locally distorted".  Fig. 17. Both rigid-only methods leave some incongruences, these are then greatly reduced by "Deform-SURF" (Lobachev et al., 2017b). "GS" (Gaffling et al., 2015) also works on rigidly pre-registered input, but produces a very different image of residual movements in optical flow visualization. It reaches also very good values with respect to other measures, e.g., SSIM. The Elastix-based registration (Fig. 18)  might be the varying thresholding methods. "Blending" has some residual movement, as seen in PSNR and optical flow visualization. The goal of "Blending" was a coarse alignment, hence the lower quality measures. Notice, however, that it copes with the random global transformations on its own, without the initial "Rigid-SURF" stage. In "local only" we see some global movement with is a residue of a local distortion at a larger scale. Not surprisingly, there are still some residues of it left in rigid methods, as discussed above. Consider the ground truth as the "ideal", fully corresponding image pair. Small signal in PSNR means little change. It is also visible in the visualizations as less red. SSIM and optical flow visualizations show that these differences are local and evenly distributed across the region. In registered images there are typically more differences; those are also more concentrated in a region. Such concentrated "change" means stronger local transformations.
The visualized differences in the ground truth appear to be less than in most of the registration results (all in Fig. 17)   Gr. truth, norm. Image Dice PSNR SSIM Flow Figure 22: Evaluation of LS data set with image measures, part 2, mostly variations over "Deform-SURF", abbreviated here as "D.-SURF". We differentiate the "stretch" parameter of the non-rigid phase. The default value is the factor 5 · 10 −3 of the image size. We also test "Deform-SURF" with the values 10 −3 and 10 −2 . The ground truth is normalized. Continued from Fig. 21. All scale bars are 500 µm.
e.g., the optical flow visualization between ground truth, "GS", and "Deform-SURF". As for Elastix (Klein et al., 2010;Shamonin et al., 2014), there are less differences between two consecutive images than in the ground truth. The SSIM and optical flow visualization of Elastix results show the differing parts to be more concentrated in the certain areas (e.g., the bottom part and the first third part of the crop) when compared to the same areas of the ground truth. Such a behavior is hinting at overfitting. In the ground truth the "change" is more evenly distributed across the image. (We can easily make statements about the localization of the movement or local differences with the visualization of optical flow. Optical flow highlights the movement across consecutive images. However, in its present form, it is rather hard to make statements about the magnitude of the movement across multiple visualizations. In this work we need to rely on other measures for a comparable assessment.) If we look at the numerical values, we observe that all three measures are lower for the ground truth than for Elastix. The verdict appears to be that Elastix (with our parameter file) over-registers the CT data set. (The over-registration or over-fitting of the registration is basically the "banana problem", too much correspondence is created.) However, there is still some noise in the data that is irrelevant for real-world tasks. The noise, however, still contributes to the measures and is also the subject of registrationwe did not use a mask. Such "unnecessary" alignments of the noise might explain the too large values for Elastix. Still, the GS method is remarkable in how close it gets to the values of the ground truth quality measures.

Appendix D.2. EM data set
We have applied a prior individual normalization in EM data set (Buchacker et al., 2019), hence it might be less fair to compare the registered images to original data, that has additionally 16 bit depth. Hence, we also show the normalized ground truth images.
The differences between rigidly registered images with SIFT and SURF in Fig. 19 appear to be the effect of different normalizations-we were attempting to undo the individual normalization of the image in inputs of the "Rigid-SURF" method. Again, we see residual incongruences with rigid-only methods. Both "Deform-SURF" (Fig. 19) and Elastix (Fig. 20) have multiple small "movements" in optical flow visualization, as opposed to a more or less uniform color of a global shift. "Blending" (Fig. 19) shows somewhat uncompensated distortion in the middle of the image, in Dice, SSIM, and optical flow visualizations. It appears that in this method the effect of our local distortion (Fig. 20) has not been corrected completely, but the random global transformation was undone well.
From the visualizations, "GS" (Fig. 19) and normalized ground truth have definitely less movement than, e.g., "Deform-SURF". The movement in "GS" seems more evenly distributed, normalized ground truth has a "hot spot" in the optical flow visualization. Curiously, the original ground truth has both more details (because it is not clampeddown 16 bit data) and more movement can be detected therein (compare flows and SSIM of the normalized and original ground truths). At the same time, PSNR is higher with the original image pair.
Numerically, Elastix has best Jaccard measures (both at 100 and 150 thresholds, indicated with postfix in the following), PSNR, and SSIM among all registrations. Especially with Jaccard 100, "Deform-SURF" is a very close follow-up, at 99.6 % of the performance of Elastix. Now, if we look at the measures' values of the objective ground truths-the luxury we did not have before in an evaluation of registrations of serial sections-an issue is apparent. Both Jaccard 100 and Jaccard 150 values for Elastix and (less so) for "Deform-SURF" are higher than for the normalized ground truth! This issue might indicate an over-optimization by those registrations.
To give some values, Jaccard 150 measure for Elastix is 0.748 % higher than normalized ground truth, same method has also 7.34 % higher PSNR than normalized ground truth; "Deform-SURF" reaches 95.7 % of the normalized ground truth SSIM. Notably, from the visualizations we would deem "GS" as a very good registration, comparable, if not beating Elastix. But numerically, Elastix has much higher values.

Appendix D.3. LS data set
In LS data set the individual normalization was also used for challenge data, in methods with SURF we also aimed to fix those discrepancies that we created in the challenge data to ensure better rigid registration. This explains the visual differences between the results. As we compare images from each of the series to each other and not across the series, this problem is less an issue. Still, it highlights the importance of the input normalization for good registration results. We also used this data set to study the effect of different distortion magnitudes in "Deform-SURF".
In Figure 21 we see a larger movement in the rigid method. It originates from our local distortions in the region of interest ("Local dist."), but cannot be undone with rigid methods only. "Blending" still suffers from these distortions, but restores the coarse correspondence well. "GS" recovers very well from those local movements, however it might over-register, based on visual comparison with normalized ground truth. Both in optical flow visualizations for "GS" and for the ground truth we see multiple small "movements", originating from the fact that we compare two consecutive non-equal images. However, "GS" shows less movement than ground truth, hence our above suspicion. Numerical values would provide more clarity, we will look at them next. As for Elastix, Dice, PSNR, and SSIM visualizations look very good, similar as in "GS". In Elastix, the visualization of optical flow appears in general to be lower than the ground truth, but it also shows a "hot spot" at the top part of the image. However, as already mentioned, we should be careful with such comparisons between optical flow visualizations. Figure 22 shows quite confident results from "Deform-SURF", but the "Rigid-SURF" is even worse than "Rigid-SIFT" (Fig. 21). We use three different values for the non-linear "stretch" in the non-rigid feature-based registration method "Deform-SURF". The "stretch" values we state are a factor of image size in pixels, limiting the magnitude of the "movement" in the non-linear registration. When the deformation magnitude is apparently too small (10 −3 ), SSIM (and less so, PSNR) visualizations are slightly worse, but the optical flow shows the problem "hot spot" near the center of the image. The default magnitude of 5 · 10 −3 is already much better and does not have the aforementioned problem. An even larger magnitude 10 −2 also looks similar to the standard setting. Whether those two parameters in "Deform-SURF" over-register can be determined from the numerical values below. For now, we can say that normalized ground truth is slightly better in PSNR, Dice, and SSIM, but appears more "noisy", but also more "even" than "Deform-SURF". We attribute the variations in intensity of the registered images in different methods to the normalization. Table C.6 shows numerical values. The Jaccard measure is computed with threshold 100. "Rigid-SIFT" was very successful with respect to Jaccard measure, but PSNR and SSIM are lower than in a typical non-rigid registration. In a contrast, "Rigid-SURF" is quite bad with respect to all three measures. Still, it manages to be a good input for the non-rigid methods "Deform-SURF", GS, and Elastix. Comparing different "stretch" values with "Deform-SURF" numerically, we see that the default value of 5 · 10 −3 is slightly better than the larger value. Too small "stretch" decreases the measures slightly, but it is still better than all rigid methods, for PSNR and SSIM with a larger margin. The "Blending" method has some problems, also evident in visualizations. We discussed this issue above.
The methods "GS" and Elastix are very similar with respect to Jaccard measure. Interestingly, PSNR is slightly higher with Elastix, but SSIM is higher with "GS". Both "GS" and Elastix have better numerical values than all "Deform-SURF" methods tested here-but let us consider the ground truth! When compared to the (normalized) ground truth, Jaccard and PSNR measures in all registrations are lower. Still, the Jaccard measures for "GS" and Elastix are closer to the ground truth than in other methods. However, the SSIM value for "Deform-SURF" is less than 2 % larger than the actual SSIM for the normalized ground truth image pair. Still, "GS" and Elastix have even larger SSIM values than the ground truth: 2.31 % and 2.80 % correspondingly. Such larger SSIM values might signal over-registration.

Appendix E. Visual comparisons for LS
This section presents selected volume renderings of the full stacks. While 3D representations typically convey more information, a 3D overview can show only the most coarse incongruences. We demonstrate here the volume renderings, but base out actual evaluation on a sequence of objective measures, as presented above and also in the main paper.
Our volume renderings were produced with ImageVis3D (Fogal and Krüger, 2010). Fig. 23 shows the frontal views of the LS data set. We used the same, standard settings for all the images.