Whole Slide Image Quality in Digital Pathology: Review and Perspectives

With the advent of whole slide image (WSI) scanners, pathology is undergoing a digital revolution. Simultaneously, with the development of image analysis algorithms based on artificial intelligence tools, the application of computerized WSI analysis can now be expected. However, transferring such tools into clinical practice is very challenging as they must deal with many artifacts that can occur during sample preparation and digitization. Therefore, the quality of WSIs is of prime importance, and we propose a review of the state-of-the-art of computational approaches for quality control. In particular, we focus on WSI quality issues related to the presence of sample preparation artifacts, compression artifacts, color variations, and out-of-focus areas. An analysis of the monthly WSI clinical routine in a cytological laboratory confirms the importance of implementing quality control measures. Given this observation, we draw perspectives on how a computational quality process can be included in a computational pathology diagnosis pipeline.


I. INTRODUCTION
Digital Pathology (DP) is the crossroads between the pathology world and the digitization revolution. Pathology, from pathos ( ) and logia ( ) respectively ''suffering/experience'' and ''study of'' is a medical science that involves the study and diagnosis of disease through the examination of surgically removed organs, tissues, and bodily fluids. Because each pathology sub-domain (histology, cytology, etc.) needs adequate expertise to analyze dedicated samples, DP was first seen as an answer to fulfill the local lack of health specialists. As depicted at the First International Conference on Image Management and Communication in Patient Care [1]: ''static and dynamic imaging in pathology represent ways to address the problem of maldistribution of specialty pathology services and to provide primary pathology diagnostic services in rural areas. These technologies are also valuable for consultation and educational programs.
The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang .
Because of the high information density in pathology specimens as compared with radiograms, image storage and image transport in pathology represent special challenges to video and communications engineers.'' The first wave of changes affecting pathology began with the development of the Internet and the appearance of affordable scanners to digitize pathological samples. Whole slide images (WSIs) are numerical objects that are made to be displayable and shared through networks, see Preston et al. [2] for a vision from the 1980s. WSIs allow the screening of optical data at different magnifications and are described by a family of file formats that ease the automation of specific tasks (locating regions of interest, labeling a slide, etc.). The second wave in which DP was installed as a legitimate domain was the increase in available annotated data in the early 2000s. Digital pixel data enriched with expert labels allowed DP to benefit from machine learning (ML) tools and deep learning (DL) by approximately 2010 year. Nowadays definition of DP has followed these different evolutions, as depicted in ''Introduction to Digital Pathology VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and Computer-Aided Pathology'' [3]: ''DP, which initially delineated the process of digitizing WSIs using advanced slide scanning technology, is now a generic term that includes Artificial Intelligence (AI)-based approaches for detection, segmentation, diagnosis, and analysis of digitalized images''. The main goal of DP, through different periods, is to investigate and study a digital object with the same diagnostic properties as the original biological sample. DP was first involved in easing the delivery of diagnostics by connecting specialists. Then it mutated to computer-aided diagnosis (CAD) to add valuable information to experts. Regardless of the case, adding the digitization step to the standard pathology pipeline involves certain requirements for the subsequent stages. Is the quality requirement to diagnose a slide the same for an AI as for a human expert? Does digitization embeds sufficient information to obtain an equivalent diagnostic for both a sample and its digital representation? What are the requirements to obtain acceptable digitization of a slide, and are these requirements the same for a human expert and an AI?
In a pragmatic way of thinking, a laboratory entering the DP realm should expect an additional Quality Control (QC) step to ensure that each slide and its digital clone are well prepared. Janowczyk et al. [4], highlighted the QC need for laboratories diving in a DP world: ''Manual review of glass and digital slides is laborious, qualitative, and subject to intraand inter-reader variability. Therefore, there is a critical need for a reproducible automated approach to precisely localize artifacts to identify slides that need to be reproduced or regions that should be avoided during computational analysis''. Thus, when considering the issue of quality in DP, two main related topics emerge: • QC in laboratories: computer-aided verification of failures in the digital pipeline to check whether each WSI meets the diagnostic requirements according to its domain (histology, cytology, etc.) and to the laboratory (specific machines preparing slides and scanner), • Quality of a representative corpus from a CAD perspective: This ranges from the versatility of learned models for machine learning engines to the ability to process and detect out-of-scope or ''never seen before'' manifestations (rare visual anomalies induced by any prior steps of the laboratory pipeline). Figure 1 depicts the trends in DP scientific publications that involve WSIs processing. The same rise in publications can be observed for either generic DP or WSI related papers or QC in DP, highlighting the interest in this emerging field. It should be noted that pathology subdomains do not receive the same attention, with many more studies focusing on histology rather than cytology. In addition, there is some confusion between pathology and histology: many articles of DP use the term pathology instead of histology even if all the considered examples are from this subdomain. In addition, even in specific categories, the level of interest is different, focus detection and color adaptation are the two main tasks in which QC is mainly involved.
This study focuses on QC issues in DP, driven by a laboratory point of view: how to guarantee the quality of WSIs or associated labeling information, how to detect a failure in a pipeline when analyzing a WSI, which decision to take when a quality dysfunction is detected, etc. are some of the questions we address here.
In the next sections, we run through different stages of DP, from sampling to diagnosis. We elaborate on some technical aspects addressed in the literature, from focus, blur, or artifact detection to color deviation. We end this survey with a review of quality issues of a 1-month stream of WSIs in a DP laboratory to confront the previously exhibited ideas with reality. Last sections provide perspectives and a conclusion.

II. DIGITAL PATHOLOGY PIPELINES
To understand the challenges in DP, we begin with a description of the processes involved in pathology laboratories, with a digital accent, to understand the impact of QC at different stages.

A. LABORATORY POINT OF VIEW
To obtain a specific accreditation that allows them to operate, laboratories must follow guidelines certified by a dedicated consortium (European cooperation for Accreditation, 1 International Laboratory Accreditation Cooperation, 2 International Accreditation Forum, 3 etc.). Accreditation organisms, based on norms (e.g., NF EN ISO 15189) with specifications relative to the countries where they are applied, guarantee the technical and diagnostic reliability of the pathology examinations occurring in a laboratory. These guidelines include pre-analytic, analytic, and post-analytic risk management and require laboratories to analyze each step of their processes according to: what has already happened (retrospective analysis), what occurs currently (ongoing quality measures, user returns), what could happen (risk analysis) and what has happened in other laboratories (bibliographical and clinical studies). To show just one example, here are some requirements for medical Pathology Services from the Australian Department of Health [5] 4 : SB8.4 Acceptable test performance must be confirmed by the ongoing use of internal quality control material, SB8.8 Uncertainty measurement must be estimated for each test procedure where relevant and possible, SB8.9 The Medical Pathology Service must have evidence that its uncertainties measurements meet clinical requirements.
From sampling to diagnosis, the biological material goes through many different and entangled stages; the firsts are made to prepare the next stages until an expert diagnoses the sample. These stages differ according to the pathological sub-domain involved. However, the pipelines are broadly the same in both histology and cytology, with samples collected, registered in the laboratory management system, prepared, disposed on a slide, stained, preserved, digitized, and analyzed. However, significant differences can be noticed when preparing the sample, which can lead to various quality and performance issues (discussed later in Section III).
Laboratories that have completed their digital transition can screen slides on monitors and be helped by various artificial intelligence assistance tools. Therefore, precautions should be taken by the DP laboratory. Indeed, the digitization, storage, and computation steps have minimal requirements to operate as intended and these should be controlled. Digitization is located at the end of the pipeline, and its outputs (pixel data in the WSI and other meta-data information) are directly used for the diagnosis of an AI or a pathologist. We describe the main steps in various DP pipelines to highlight the potential quality issues that impact diagnosis.

B. FROM SAMPLE TO SLIDE
Biological samples to be examined by pathologists must follow a specific sample preparation pipeline because of the requirements of brightfield microscopy. In particular, the requirements for a sample to be examined under the microscope are as follows: i) the sample is well preserved, ii) the sample is transparent so that light can pass through it, iii) the sample is thin so that a single layer of cells is present and iv) some components of the sample can be distinguished by different colors. Because of these requirements, the preparation pipelines differ in histology and cytology. In addition, these differences lead to specific quality issues according to the specificity of the step in which they occur. Figure 2 depicts the typical issues in each preparation step for both histology and cytology. Some errors due to the failure of the information system management are included; they have been addressed in the early ages of DP. Aspects that alter the final diagnosis are the main topics discussed here.
In histology, the following steps are performed [6], as depicted in the left column of Figure 2: • Tissue collection: A specimen is collected by a surgeon who removes a piece of tissue from the body, and a sample is cut from it.
• Tissue processing: To preserve the cells of a sample, it is fixed with a chemical product, placed into a cassette, processed (dehydration, clearing), and then embedded in paraffin wax.
• Tissue sectioning: The paraffin block is cut into thin sections (3-10 microns) to allow visualization through a microscope and placed on glass slides.
• Tissue staining: slides are stained to reveal the structural details of the tissue sample, and covered with a glass cover slip. In cytology, the following steps are performed [7], see Figure 2, right column: • Fluid collection: A biofluid is extracted from a patient and placed into a tube.
• Fluid preparation: Cells are processed and displayed on a glass slide. It can include centrifugation or filtration steps to concentrate the cells in the tube.
• Fluid spreading and staining: cells are distributed within a disk on a glass slide, and the slides are stained and covered with a glass cover slip.  Therefore, preparing a histological or cytological sample for microscopic examination is a technically complicated step that is rather sensitive. In addition, it requires specialized equipment and expertise. Even if we consider a specific slide preparation pipeline (e.g., using either centrifugation or filtering with cytological samples), differences between laboratories can appear. Any problem that occurs during the preparation of the slides can have an impact on the quality of the prepared samples, which is of utmost importance for the final diagnosis taken by the pathologist. In the next section, we focus on the heart of the DP pipeline: the digitization step and its related constraints.

C. DIGITIZATION OF SLIDES
Once the slides are prepared, they can be transformed into digital slides using WSI scanners. There are many different types of WSI scanners. Table 1 provides a (non-exhaustive) list of the most common ones. WSI scanners have two major components. The hardware components include a microscope with lens objectives, a light source (bright field or fluorescent), robotics to load and move glass slides, digital cameras for (line or tile) image capture and a built-in computer. The software components include digital slides I/O, management, and visualization. Farahani et al. [8] and Patel et al. [9] reviewed the different properties of WSI scanners including slide capacity, image magnification and resolution, file format, and scan speed. Some scanners can digitize a slide at several focal planes, which can be of interest for cytological slides that are thicker than histological slides. With many different WSI scanners available using different technologies, the digitization of slides can produce very different results. Consequently, the parameters of each WSI scanner component can impact the quality of the WSI and therefore its further interpretation. In addition, as pointed out by Ogura et al. [10] the same slide repeatedly scanned with the same WSI scanner will not produce completely identical WSIs. As such, the FDA has designated WSI systems as class III (highest risk) medical devices. When using a WSI, validation must be performed first to ensure that the diagnostic performance based on digitized slides is at least equivalent to that of glass slides and light microscopy [11]. Pantanowitz et al. provided guidelines on this topic [12]. They have analyzed the impact of differences between digitized slides. Then, WSI scanners can be validated through stability measurement with regard to the quality of the information captured in the WSI. Shi et al. [13] concluded that one should use 20× magnification scans for diagnostic workouts and 40× for challenging cases. In [14], after a review of different studies led on eight different WSI scanners, the mean diagnostic concordance of WSI and light microscopy (LM), weighted by the number of cases per study, was established at 92.4%. A similar finding was reported by Rajaganesan et al. in [15], with four different scanners. The diagnostic accuracy of LM was 95.44%, and that of WSI was 93.32%. They also reported rates from other studies in the literature and the diagnostic concordance rate was always > 96%. Interestingly, this study also estimated that the mean digital image artifacts (out of focus and stitching) appearance rate was 6.8% (we will discuss this later in the paper). Given these studies, digitization produces digital slides that do not seem to affect diagnostic performance as long as their quality is sufficient. However, as the visual quality of the digital slides (in terms of color and introduced digital artifacts) can vary significantly among WSI scanners, this must be carefully monitored. In particular, only two scanners from Philips TM and Leica TM companies received FDA approval for the review and interpretation of digital surgical pathology slides prepared from biopsied tissue. This shows that obtaining WSIs of good visual quality is still a challenge.

D. COMPUTATIONAL PATHOLOGY
With the advent of WSI scanners, the development of image analysis algorithms based on artificial intelligence tools, and an increase in the computational power of computers, the application of computerized image analysis to WSIs can now be expected. In 2011, Fuchs et al. introduced the term ''computational pathology'' (CP) [16] in a journal special issue dedicated to WSI processing [17]. This was the first appearance of this term and it encompasses all approaches that make use of AI methods on WSIs to analyze patient samples [3]. However, what gain can be provided to pathologists with an automated system using computational pathology? First, the samples analyzed by pathologists are most often benign and can be easily distinguished from cancerous ones. This is a potentially huge waste of time, and any system that could help pathologists in localizing cancerous areas in slides would be beneficial [18]. Therefore, computational pathology can be useful for computer-aided diagnosis. Second, it can be used to predict disease outcome and survival.
In particular, grading approaches have been established for prostate and breast cancers (Gleason and Elston/Ellis) which are correlated with patient outcome and long-term survival. Computational pathology can be very interesting in providing such a grade estimation based on a quantitative analysis of the slide biological structures. From all these potential outcomes, we can roughly say that computational pathology [19] aims at using computational methods to analyze patient specimens (and especially WSIs) for the study of disease [3], [20]. Currently, these computational methods are often referred to as machine learning or artificial intelligence methods [21] and particularly to deep learning methods based on neural networks. In [22], Janowczyk et al. showed in tutorial cases how deep learning can be used for computational pathology tasks: segmentation (nuclei, epithelium and tubule), detection (lymphocytes and mitosis), and classification (lymphoma sub-type). This paper is important to the community as it has shown how pathologists can benefit from such computational approaches. In [23], a similar approach was adopted to demonstrate the incorporation of AI and machine learning tools into clinical oncology. Many approaches have been proposed so far for the processing and analysis of WSIs. Recent approaches use Vision Transformers and Multiple Instance Learning (e.g., [24], [25], to quote a few). A review of such approaches is beyond the scope of this paper. We refer the reader to recent comprehensive reviews for more insights on AI and deep learning in computational pathology [26], [27], [28], [29], with specific reviews for histology [30], [31], [32], [33], [34] and cytology [35], [36], [37]. If some ethical issues have appeared in the use of computational pathology in clinical routine [38], most pathologists are in favor of their use [39]. At this time, most of the proposed computational pathology tools are restricted to research use only (RUO) and can only be used to provide a complementary analysis to that performed by a pathologist. In [40] and [41], it was shown that the combination of computational pathology and human pathologists has the potential to improve accuracy and efficiency in gastric cancer diagnosis. However, with the use of computational pathology, remarkable progress has been made beyond RUO. For the Gleason grading of prostate cancer [42], [43], [44] and the Elston/Ellis grading of breast cancer [45], [46], [47], commercial solutions recently reached the market. 5,6 Similarly, for cancer screening in cervical cytology, some commercial solutions have been proposed. 7,8 One can see that computational pathology enables the addressing of many tasks such as tissue detection, segmentation, and classification with a very broad focus from computer-aided decision (that aims to assist pathologists in routine diagnosis) to precision medicine (that aims to study the clinical outcome of a patient). However, this is at the cost of collecting large sets of manually annotated WSIs to train deep learning models efficiently.

E. QUALITY CONTROL IN DIGITAL PATHOLOGY
With the development of DP and CP, it becomes now commonplace to acquire digital slides and to analyze them with AI tools. However, new problems have appeared that are important to overcome for routine clinical practice. Indeed, slide preparation can introduce many artifacts that can impact the readability of the slide. This is also the case with slide digitization (e.g., with blurred areas). Both these kinds of artifacts can be annoying for either a visual analysis by a pathologist or an automated analysis by an AI tool. As a consequence, some approaches have been developed to minimize the appearance of these artifacts or to cope with them by automated detection or correction. As previously mentioned, clinical laboratories have to follow guidelines for quality control and both DP and CP have to conform to these to be usable in practice. As a result, quality control is an emerging topic in DP and CP [26], [35], [48], [49], [50]. In the next sections, we review the approaches proposed to assess the quality of WSIs. As both the slide preparation and slide digitization can have an impact on the final WSI quality, we analyze both separately. Finally, we present approaches for quality control at the slide level, and we provide recommendations for improving the quality of slides considered as unanalyzable by computational pathology pipelines.

III. QUALITY OF SLIDE PREPARATION
In a digital pathology flow for sample analysis and diagnosis, the first factor that can strongly impact the quality of a WSI (the digital item) is related to the quality of the slide preparation (the physical item). As slide preparation is very technical and sensitive, mistakes can be made and artifacts can appear. In histology and cytology, an artifact is a structure which should not be present in living samples. In some situations, the presence of an artifact can compromise an accurate diagnosis with an examination under a microscope [51]. Consequently, artifacts due to slide preparation can also cause potential mistakes in quantitative analysis involving the processing of WSIs. As the preparation of slides can be very heterogeneous between and within institutions, artifacts are inevitable. What type of artifacts can be encountered? There is no perfect answer to that question as an artifact is in essence an item that was not foreseen to appear. In a very detailed technical note, Leica TM presented a list of artifacts that can be annoying for the interpretation of (digital or not) slides. As several artifacts are very rare, we present the most common ones grouped into two categories.
• Domain-independent artifacts (in histology or cytology) can be generated from the sample preparation: (i) The sample can have been contaminated with an unexpected biological element (e.g., blood, bacteria, mucus), or a foreign (non-biological) object (e.g., surgical contaminant). (ii) The glass slide can be dusty or dirty, affecting its background. (iii) Air bubbles can appear during the glass cover slip. (iv) Staining is a critical step, it can cause severe visual color artifacts if it is not well monitored (e.g, by checking the validity of the coloration chemicals). Artifacts can also be generated after sample preparation. For example, pathologists often use markers on glass slides to identify specific . Different types of artifacts: foreign object, marker, dust, tissue fold, air bubble, biological contaminant. Samples are from the dataset https://grand-challenge.org/algorithms/quality-assessment-of-wholeslide-images-through-a/ [56] and [57].
areas for analysis. These markers can mask biological components.
• Domain-dependent artifacts are due to the specific characteristics of the specimens and their preparation. In histology, thin tissues can fold on themselves [52], Yagi et al. [53] showed that there is a correlation between the quality of slide preparation (the thinner the tissue, the better) and the quality of WSIs. In cytology, if the centrifugation is not well performed, many clusters will appear, preventing their interpretation [54]. In addition, in cytology, slides are usually thicker [55], which can lead to focus problems (discussed in the section on digitization). Figure 3 illustrates such preparation artifacts that affect the quality of the slide. As we just have seen it, artifacts that occur during slide preparation can have a significant impact on the accuracy of slide analysis [58], whether this analysis is performed by a human pathologist or a computer-aided decision system. This has received attention from the computational pathology community only recently [48], [59]. Recent works proposed to perform quality control (automatic or not) of slide preparation to identify the type and/or severity of artifacts from WSIs. The output of such systems can then be used as additional input for computer-aided decision systems to establish a more accurate analysis of WSIs. We review some of these artifacts' detection methods in the sequel. At this level, we focus only on artifacts related to slide preparation. Digitization artifacts will be further considered (even if some of the cited works also consider digitization artifacts such as out-of-focus areas). They are summarized in Table 2, specifying the pathology domain, artifact types, and detection methods.
In [60], Avanaki et al. proposed automatic quality estimators by adapting image quality assessment (IQA) methods that were originally developed for natural images. In particular, they considered IL-NIQE, a no-reference IQA, to detect artifacts and showed that the scores provided by this estimator enable discrimination of artifacts following the ratings given by a pathologist. In [61], Kothari et al. proposed a method for identifying tissue-fold artifacts in histological WSIs. Assuming that tissue folds are different from the other slides' areas w. r. t. color saturation and intensity, they considered the difference between both. The difference is then thresholded at different values and the distribution of the remaining connected components is studied to define two thresholds that enable the extraction of tissue-folded regions. A similar approach was proposed in [62]. In [63] Palokangas et al. proposed the extraction of folds using k-means clustering in a saturation-intensity feature space. In [4], Janowczyk et al. proposed HistoQC,9 an open-source quality control tool for digital pathology slides. This tool uses a combination of handcrafted features extracted from a digital slide (related to color, brightness, contrast, and edges) that can be fed to machine learning techniques for artifact extraction. The software can then be used to identify artifacts and artifact-free areas. In [64], Chen et al. used HistoQC for quality control of renal biopsy WSIs. They aimed at identifying the slides that were unsuitable for computational analysis because of the presence of artifacts. They used different modules of HistoQC to identify artifacts such as tissue folds, pen markers, air bubbles, and ink stain variations. They showed that HistoQC could identify batch effects in the slides' cohorts of three pathological laboratories and concluded that a quantitative process is necessary for robust and reproducible quality control of digital slides. Kumar et al. [65], proposed a method for the identification of artifacts in cytological cervical smears WSIs. Cells that are not epithelial are considered as artifacts: e.g., blood cells. The cells were extracted using classical image processing methods. Features (describing intensity, shape, and texture) are extracted from the segmented cells and the latter are classified as artifact or non-artifact cells with an SVM. They obtained an accuracy of 86.64%. In [66] and [67], Shakhawat et al. proposed the detection of artifacts (air bubbles and tissue folds) in histological WSIs at low resolution, as the considered artifacts can be seen at low magnification and have visual properties that are very different from the surrounding tissue. They extracted luminance and saturation features, and Haralick gray-level co-occurrence matrix texture features from 100 × 100 patches. Only relevant features were retained using sequential feature selection and fed to an SVM classifier. They obtained an accuracy of 98.98%. Smit et al. [56] proposed a multi-class deep learning model 10 for the semantic segmentation of artifacts caused by tissue folds, ink, air bubbles, dust, and markers for histological WSIs of multiple tissues and staining types. The network architecture was an encoder-decoder network with EfficientNet-B2 as the encoder and DeepLabV3+ as the decoder. The semantic labeling is performed on 1024 × 1024 patches. They obtained an accuracy of 89.45%. In [68], Foucart et al. proposed a deep residual network for artifact detection in H&E and IHC stained WSIs. Their approach relies on rough annotations and works at a low resolution of the WSI pyramid to enable a fast analysis. Semantic labeling was performed using 128 × 128 patches. They also considered a specific data augmentation technique as artifact areas are much less present than non-artifact areas. They obtained an accuracy of 89.77%. They recently extended their results in [69] and showed that a deep learning approach to artifact segmentation 11 can produce interesting results as long as learning strategies are adapted to dataset characteristics. In particular, artifacts in digital pathology slides are ill-defined objects, which makes them particularly challenging to annotate precisely, and their work addresses this aspect of imprecise annotations. 12 Babaie et al. [70], proposed the use of a pre-trained DenseNet201 CNN as a feature extractor to characterize tissue folds, and fed an SVM classifier with these features. The DenseNet201 CNN was fine-tuned on a dataset of folded and fold-free 255 × 255 patches along with data augmentation. They obtained an accuracy of 96.7%. Ali et al. [71] proposed a fully automatic CNN for the classification and restoration of WSIs containing pen ink markers. 13 First, a CNN detects tiles corrupted by a pen marker. Second, in these corrupted tiles, a Yolo CNN detects the bounding boxes of pen marker areas. Third, corrupted pixels are restored by a domainadaptive cycle-consistent-adversarial generative model. Their approach can produce visually coherent marker-free WSIs while enhancing their quality (as assessed using PSNR, SSIM, and VIF IQA measures). In [72], Zhang et al. proposed a method for assessing the staining quality of WSIs for Gram staining. They considered a MobileNet CNN to estimate the staining quality of the slides' tiles. They obtained an accuracy of 86.8%. The results were shown to pathologists in the form of a stain-quality heat map. In [73] and [  network to determine the presence of artifacts on 228 × 228 patches at low resolution. 14 From the patch-level artifact estimation statistics, they also provided a slide-level ''usability'' index that estimates wether the slide is appropriate for establishing a clinical diagnosis and an indication of the impact of artifacts on WSI quality. This can help pathologists determine whether the slide needs to be re-scanned or re-stained. They obtained an accuracy of 98.7%.
From these representative studies on artifact detection in WSIs, we observed that this problem has rarely been addressed in the literature. Most recent studies rely on deep learning and obtain very good detection results. However, the integration of the artifact detection result within a computer-aided diagnosis system still needs to be explored.

IV. QUALITY OF SLIDE DIGITIZATION A. QUALITY OF WSI FORMAT
Many commercial scanners have appeared in the market and each has introduced its proprietary file format. Consequently, there is no established file format for storing and exchanging WSIs produced by WSI scanners (see Table 3). However, these proprietary file formats can share similar properties. Indeed, given the flexibility of the TIFF file format, many have adopted it to store WSI data, often in the form of a tiled multi-resolution pyramid ( Figure 4). Nevertheless, this abundance of proprietary file formats can be a strong barrier to their use even if some formats are very similar. Indeed, as we will point it out later in this review, deep learning methods for computational pathology require large and diverse datasets from different centers potentially using different scanners. Therefore, the quality of the WSI file format is an important factor for its use in the digital computational pathology analysis pipeline. For instance, the macro-image (that represents a snapshot of the entire glass slide) provides a low-magnification overview of all of the tissue pieces. It can be used to guide the scanner's tissue detection system or for focus-point selection. As reported in [75] and [76], an incorrect macro-image can generate technical problems such as automatic tissue detector failure, or poor scan coverage. Even if a standard specification for WSI data has been published by the Digital Imaging and Communications in Medicine (DICOM) Working Group [77], its adoption is rather limited. Consequently, reading and writing WSIs can be technically difficult, even if some open-source libraries have made significant progress [78], [79]. The quality of the WSI file format, in terms of interoperability, must be considered when choosing a WSI scanner. In addition, the quality of a digitized slide can be altered by many factors which we review in the sequel.

B. COMPRESSION
A parameter that can impact image quality is the use of compression. Indeed, all WSI scanners use lossy compression to ensure reasonable file sizes. Typically, image compression is measured using the quality factor (QF) or compression rate. An uncompressed image has a QF of 100. If the scanner uses a QF that is too low, compression artifacts will appear, which are mostly visible as block effects. Figure 5 presents examples of such compression artifacts on histological and cytological images. As shown, a QF that is too low can severely affect the image content making it more difficult to diagnose.
Some studies investigated the optimization of compression standards for WSI scanners. Sharma et al. [81], considered WSIs obtained from a scanner at different compression rates for 12 different stain types. Then, they determined the most suitable QF for pathologists to be able to perform a diagnosis on compressed images. They concluded that a QF of 50 was suitable for all stains. In [82], Bug et al. investigated Scalable High Efficiency Video Coding (SHVC) as a replacement for the JPEG and JPEG2000 standards currently found in most WSI formats. They showed that SHVC can provide a gain in compression performance but introduces blurring artifacts. In [83], Helin et al. defined an optimized parameterization for JPEG 2000 image compression specifically used with histopathological WSIs. Their parameterization is based on allowing a very high degree of compression on the background part of the WSI while using a conventional amount of compression on the tissue-containing part of the image.
Although there is no consensus regarding acceptability of image compression levels, JPEG is thought to allow a compression rate between 10:1 and 20:1, and between 30:1 and 50:1 for JPEG 2000, without the loss of diagnostic information [77], [84]. Therefore, lossy compression is not an issue because few visible compression artifacts will appear making the interpretation difficult. However, many studies have used deep learning to analyze WSIs. Compression can be a problem with such methods. In [85], Zanjani et al. studied the impact of JPEG 2000 compression on a CNN for detecting tumor metastases in H&E-stained tissue sections. Their experiments showed that the CNN model is robust against a compression rate of up to 24:1 when trained on uncompressed images. In addition, they showed that when the CNN was trained on compressed images, the performance is not much impacted even at high compression rates. Chen et al. [86] also investigated the effect of compression on the performance of deep learning approaches for segmentation and detection tasks in WSIs. Their findings are similar: the images can be compressed by 85% while still maintaining the performance of the algorithms at 95% of what is achievable without any compression. In particular, they observed that the minimum acceptable QF for diagnosis by a pathologist corresponded to a decrease in the performance of the deep learning algorithms. These results are in line with the findings of Dodge et al. [87] for natural images. They studied the robustness of four state-of-the-art CNNs against five distortions. The CNNs were highly resilient to compression and it was only at very low QF that their performance began to decrease (less than 10 for JPEG and less than 30 for JPEG2000).
From these works, we can conclude that the usual lossy compression employed by whole slide scanners does not have a strong impact on the image quality and its use in machine learning tasks, even at high compression rates. Therefore, fixing the compression rates at established values ensures the visual quality of WSIs [84].

C. COLOR
Assessing and enhancing the color quality of WSIs is probably what has been mostly addressed in the pre-processing of WSIs. Several steps can have a significant influence on color quality in the preparation of digital slides [88]. First, the slides are stained with chemical dyes to highlight the cellular structures and enable their interpretation. If standardized staining protocols [89], [90] can help to reduce variations in staining results, many factors can affect the stain color in practice: the use of different staining equipment, dye brands, and staining protocols. Second, the slides are digitized using WSI scanners that can provide very different results depending on their electronic components and internal color calibration (if any). Consequently, it is inevitable for WSIs to have variations in their color appearance among different institutions, because they use different staining protocols and scanners (see Figure 6). The computational pathology community has embraced this problem and has tried to address it through different means: color calibration or color normalization. We provide a review of the most representative works in this section.

1) COLOR CALIBRATION
Color calibration is an established routine in the print and photography industries (to quote a few) that has been adopted by most digital systems with the use of ICC (International Color Consortium) color profiles. The calibration process consists in comparing the known colors of a set of color patches with their digitization from a digital device. The difference between the two can then be used to define a color correction embedded in the ICC color profile. The US Food and Drug Administration has released recent guidance [91] stating the need to develop a method to control color reproduction throughout the digitization process in whole-slide imaging for primary diagnostic use. They stated that color control is essential in digital pathology and recommended the use of a target slide with spectral characteristics similar to those of stained biological components. The ICC Medical Imaging Working Group has started pooling resources to develop a calibration system for digital microscopes. 15 Unfortunately, this is still not standardized among WSI scanners' vendors, and even if calibration slides have appeared recently, 16,17 their use is not widespread. A general review of this color calibration problem in pathology can be found in [88]. 15 https://tinyurl.com/ysxz6vjv 16 https://tinyurl.com/55xnd2as 17 https://ffei.ai VOLUME 10, 2022 To meet these requirements of color calibration, in [92] Yagi et al. were the first to propose a target slide containing nine filters with color patches selected for H&E-stained slides. This has been further explored by Bautista et al. in [93] that proposed a color correction procedure based on the comparison between the spectral colors of the patches and their scanned colors. Using this procedure, they have shown that the color difference between two slides scanned from different scanners is significantly reduced after correction. In [94], Chen et al. eliminated the need for a color calibration slide by measuring the spectral transmittance of a reference biological tissue sample. However, this is difficult to use in practice. In [95], Shrestha et al. proposed an alternative to using a specific color calibration slide. They considered a standard IT8-target transmissive film on a slide and proposed a matrix-based calibration method compliant with the ICC standards, without the use of external color measurement devices such as colorimeters. They showed that with this method there is no visible difference between calibrated slides scanned by different scanners in terms of CIE-Delta2000 JND (Just Noticeable Difference). In [96], a color calibration slide was used to calibrate a WSI scanner for cytology. They showed that this reduces the color variation to less than 2·JND and maintains color fidelity. Therefore, color calibration minimizes system-to-system variability, producing repeatable color output regardless of system age or optical component variation.
At this time, even if color calibration should be a requirement of any whole slide scanner, this is still not the case. A recent work by Ogura et al. [10] has shown that this can be a strong problem as several digitization of the same slide with the same scanner and the same parameters provide slightly non-identical images. Although the fact that color calibration from reference targets (such as MacBeth or IT8) are established methods in displays, print, and photography, the lack of such an established reference target slide is still a barrier to color calibration in digital pathology.

2) COLOR NORMALIZATION
Color normalization is the transformation process from one image to another which affects the colors. There are various algorithms for color normalization, most of which were presented in [97] and [98]. Color normalization methods can be categorized into three categories from the most ancient to the most recent: global normalization, stain separation, and generative model-based approaches. All the methods are summarized in Table 4 including the pathology domain, the type of staining under consideration, and the category of the method for color normalization.

a: GLOBAL COLOR NORMALIZATION
Methods apply either histogram matching or color transfer. In histogram matching, a normalized histogram is computed, its probability distribution function (PDF) is estimated and the PDF of the image is matched to a reference image [99]. Because histogram matching on entire images ignores local differences in image content, colors associated with one stain may be matched to irrelevant colors. In color transfer, a statistical analysis is used to impose the characteristics of a reference image on other images. In [100], Reinhard et al. matched the mean and standard deviations of reference and query images in Lab color space. Because images stained by multiple chemical dyes may have different color distributions, colors associated with different biological components may blend after color transfer. To address this problem, before the color transfer, an image can be divided into different regions. Magee et al. [101], applied a Gaussian mixture probabilistic model for automatic segmentation at the pixel level in three classes: Hematoxylin (H), Eosin (E), and background (B) which was followed by color transfer on the corresponding classes from the reference to the query image. Nevertheless, color transfer methods cannot ensure that the structural features of biological components are preserved.

b: STAIN SEPARATION
Methods aim at estimating the main stain vectors of an image that can be used for stain intensity correction and stain replacement. Generally, these methods compute stain vectors using Singular Value Decomposition (SVD) or Non-negative Matrix Factorization (NMF), as the stain concentration cannot be negative. Both involve factorization of the optical density (OD) space matrix. In [102], Ruifrok et al. proposed that each pixel color can be represented as a linear combination of different stains using a stain matrix. They proposed a method called color deconvolution, that decomposes the optical densities of stain mixtures into stain-specific channel information. The method is supervised as the OD values of the pure stains must be provided in the form of a stain matrix that describes how color is affected by stain concentration. In [103], Macenko et al. proposed an algorithm using plane fitting with SVD to determine the stain matrix from the stain vectors of an image. Prior knowledge of the stains can also be used in the plane fitting process as proposed in [104]. In [105] and [106], 19 CNNs were used to estimate stain vectors faster. Gupta et al. [107] 20 performed the color normalization in three steps: illumination correction, stain color vector correction from an alignment and a rotation of their SVD estimation, and stain quantity correction. Khan et al. [108] learned an image-specific stain color matrix from a color-based classifier using a stain color descriptor. A nonlinear mapping of the channel statistics obtained after color deconvolution enables the reconstruction of a normalized image. Kather et al. [109] 21 proposed an optimized variant of [102] for immunostained images.
To avoid the supervision required to define the stain matrix in previous methods, unsupervised methods have been developed. In [110] [125] 18 contains the popular methods of [100], [103], [111].
image, the image illuminant is estimated and an NMF-based stain spectral estimation is performed with an initialization using saturation-weighted statistics to enable a better convergence. In [111] Vahadane et al. performed stain separation to estimate the stain matrix by an NMF that incorporates a sparseness constraint followed by a structure-preserving color normalization. Lei et al. [112] proposed an improvement to this method by using CNNs to estimate the stain matrix instead of a sparse NMF. These methods do not require prior information and preserve the structure of the original image. However, they do not preserve all the color information of the source images. Bejnordi et al. [113] used color and spatial information to classify the image pixels into different stain components. The chromatic and density distributions for each of the stain components in the huesaturation-density color model were then aligned to match the corresponding distributions of a reference.

c: GENERATIVE MODEL-BASED
Methods have recently attracted considerable attention. They use adversarial learning of deep neural networks to perform a style transfer to a WSI. In addition to being more adaptive than the global color normalization and stain separation approaches, they suppress the need of selecting reference images to define the stain parameters. Indeed, generative adversarial network (GAN) [126] based methods consider the overall dataset of the target style as the template and approach the problem of color normalization as an image-to-image translation. GAN-based color normalization approaches can be divided into supervised and unsupervised methods. Supervised methods require paired images of different staining protocols and use L1 and adversarial losses to optimize the generative networks. For instance, in [114], 22 Salehi et al. proposed the use of conditional GANs where the generator is trained to generate restained images conditioned by input gray-scale images. However, paired images of different styles require multiple staining which is difficult to perform in practice. In contrast, unsupervised methods do not require paired images and are much more appealing. Cho et al. [115] 23 proposed a stain-style transfer GAN composed of two transformations: gray-level normalization to have a laboratory-independent image, followed by a colorization that fits the stain-style of a chosen laboratory. The latter used a specific loss that combines reconstruction loss, conditional GAN loss, and feature-preserving loss to ensure the preservation of the extracted features (essential for subsequent analysis of the normalized image). In [116], 24 Shaban et al. proposed a method called StainGAN that uses Cycle-Consistent Adversarial Networks (CycleGAN) for one-to-one domain stain transfers. Cycle consistency allows the images to be mapped to different color models but preserves the structures of biological components. This method was also used by Runz et al. [127]. In [122], Lee et al. proposed an extension of [115], [116] that uses UNet as the generator for better structure preservation, and a Markovian discriminator with local receptive fields.
To improve the quality of the generated image, they also introduce a color classifier that provides feedback to the generator on the normalized color content. Comparisons were led for classification tasks on H&E breast cancer WSIs and the classification performance was better than the compared methods, i.e., [100], [103], [111], [115], [116]. In [117], 25 Kang et al. proposed a faster alternative to StainGan using a 1 × 1 convolution. In [118], 26 Liang et al. proposed a method 27 that uses a different reconstruction loss (based on the use of a structural similarity index matrix and directional statistics-based color similarity index) to better preserve the texture, structure, and color of the biological components. In [119] 28 Shrivastava et al. proposed a self-attentive adversarial stain normalization approach for normalizing of multiple stain appearances to a common domain. This enables many-to-one domain stain transfer and the feature preserving loss of [116] is replaced by a structural cycle consistency loss. Chen et al. [120] proposed to normalize an input image by style removal and reconstruction, as in [115]. Style removal generates a grayscale image using a color-encoding mask. For style reconstruction, the loss contains an intra-domain adversarial loss, an L1 penalty, and an inter-domain adversarial loss. Ren et al. [121] considered a Siamese network as the generator to regularize the normalization. In [123], Nazki et al. have proposed an unsupervised adversarial network to normalize histological WSIs while preserving the structural features of the tissue. To that aim, a single generator is trained to normalize images. The preservation of the fine salient anatomical structures is performed using an auxiliary feature extraction network and a perceptual loss to minimize the perceptual distance between the normalized and the original images. Experiments were led for color normalization between different WSI scanners. Recently, some authors have proposed performing color normalization with self-supervised techniques. In [124], Zhao et al. first perform stain separation to estimate H and E dyes from H&E stained WSIs. Then, a UNet-based network learns how to re-stain the grayscale images in a self-supervised manner using a combination of adversarial, color, and staining losses. Their approach outperforms the methods [100], [103], [111], [116] in terms of image quality assessment measures (SSIM, PSNR, etc.), and of performance for segmentation and classification of H&E stained breast cancer WSIs. 25 https://github.com/khtao/StainNet 26 https://github.com/hanwen0529/DSCSI-GAN 27 https://github.com/hanwen0529/DSCSI-GAN 28 https://github.com/4m4n5/saasn-stain-normalization

3) INFLUENCE OF COLOR
We reviewed approaches for color calibration and normalization, but how important are these two steps in a computeraided diagnosis? Some recent studies have attempted to answer that question by studying the influence of color normalization on decision systems. In [128], Leo et al. considered H&E stained histological prostate WSIs. They have shown that textural and structural features can become very unstable when scanned on different scanners. In [129], Jia et al. have also studied the influence of color on H&E stained WSIs by comparing the color spectrums with an MDS embedding, which enables to easily identify low staining quality and abnormal staining conditions. However, no solution is proposed on how to use this information for quality control of the staining. Liu et al. [130] evaluated the degree of color similarity between two images based on the volume of their color gamut. They compared the normalized images obtained using the methods of [100], [103], [111] with a reference obtained by a hyperspectral imaging microscopy system. They observed that the methods in [103] and [111] reduced significantly the color gamut and that the one in [100] better preserved the color gamut but was unable to fully preserve the color information. In [131], Ziaei et al. proposed an evaluation of the color normalization methods' ability to normalize images obtained from one scanner to match the color rendering from another scanner. The comparison was based on the CIE Delta E color difference. They compared the methods in [100], [103], [111], and [116]. Their experimental results showed that color normalization was effectively able to reduce color variation and that StainGAN [116] performed significantly better than the other global and stain separation approaches. However, these two works studied only the color variation of the color normalization methods and not their influence on the subsequent machine learning process. In [132], Aubreville et al. showed that the influence of the color domain shift introduced by different scanners strongly affects the performance of CNN-based mitosis detection on histological H&E slides. In [133], Nisar et al. have also shown that it is possible to detect and estimate the staining shift in digital histopathology. To that aim, they considered a domain shift metric measuring the differences between two domains' distributions using features extracted from pre-trained neural networks. As attended, this shift can impact the generalization performance. This advocates the use of color normalization. Pontalba et al. [134] evaluated the necessity and impact of color normalization for CNN nuclei segmentation methods on histological H&E slides. They considered the color normalization methods of [100], [103], [108], [116]. As in [131], they observed that color variations are less important with StainGAN [116]. For nuclei segmentation, the segmentation performance varied significantly depending on the applied color normalization method. Only the approach in [108] preserved the segmentation performance with respect to those with un-normalized images. In [135], Swiderska-Chadaj et al. noticed that for prostate cancer classification, StainGAN normalization can improve CNN robustness as compared to classical color normalization. In [136], Bianconi et al. evaluated the effect of color normalization methods [100], [103], [108] on automated classification methods (based on either classical machine learning or deep learning techniques) of histological H&E slides of different types of cancers. Their results showed that in most cases color pre-processing did not improve the classification accuracy and could even result in a noticeable reduction in accuracy. Similar findings were reported by Gadermayr et al. [137]. They studied the combination of several color normalization methods [100], [103] with feature extraction methods (Fisher vectors, LBP, color histograms, and a VGG CNN) for the classification of patches in the glomerulus and non-glomerulus tissue with slides stained by alpha-smooth muscle actin or periodic acid Schiff staining. They observed that the use of color normalization always causes a loss in accuracy. Tellez et al. [138] studied the effect of color normalization [103], [113], [116] on the performance of several different CNN classification tasks on histological H&E slides. They also observed that color normalization was not necessary to achieve better performance. Hameed et al. [139] have considered six intermediate layers of the pre-trained Xception model to extract features for the classification of histological H&E breast cancer images. None of the normalization methods they considered [100], [102], [103], [111] was able to outperform the results of the original un-normalized dataset. However, with GAN-based approaches, the classification results of the unnormalized and normalized images can be very similar. In [140], Ciompi et al. investigate the influence of color normalization methods [103], [113] on tissue classification of colorectal cancer tissue samples in H&E-stained images. In contrast to the previously mentioned studies, they reported a significant gain in performance with the color normalization of [113]. Finally, in [141], the influence of several color normalization algorithms [100], [103], [106], [108], [111], [113] has been studied for the classification of three different histological cancer H&E WSIs (ovarian, breast, and pleural) using a ResNet18 network. Their finding is that color normalization does not improve performance when WSIs are from the same center. However, when normalized datasets from several centers are used for learning, an increase of classification performance can be obtained.
In conclusion, color normalization is important for obtaining images of similar colors for analysis by pathologists. However, it is still not demonstrated that color normalization can enhance the performance results of a computer-aided system, even if the recent GAN-based approaches for color normalization appear very promising [142], [143] for learning stain invariant features to improve the generalization of CNNs.

D. OUT-OF-FOCUS
Despite a controlled environment and an autofocus process, scanners can produce blurred WSIs, either globally, regionally, or locally (see Figure 7). This is mostly caused by a poor focusing on the objects of interest [8], known as the FIGURE 7. Different types of blur (global, regional, local) that can be encountered in histology (first row) or cytology (second row). out-of-focus (OOF) effect. During acquisition, several problems may appear and contribute to this effect, such as thermal variations, internal or external vibrations, errors in the focus determination at a focus point, or in the generation of the WSI focus map (interpolation from the focus points). Some of these errors are caused by preparation issues, such as tissue folds, bubbles, dirt, or the distribution of objects along the z-axis in liquid-based preparations. Specific tissue types, such as fat, may also affect slide sections.
The objective of automated focus quality assessment (FQA) is to: i) determine whether a slide must be rescanned, either locally, regionally, or globally, and ii) provide an FQA map for visual inspection and possible weighting of further processing steps (a kind of confidence map). To perform this, FQA methods locally estimate a focus class (or score) at the patch or tile level, with high magnification levels (usually 20× or 40×). The local scores are directly used to define the pixel values of an FQA map, usually in the form of a heatmap (as shown in Figure 8). Eventually, regional or global focus scores can be deduced from the local ones. Some works have also proposed to directly modify the acquisition process to perform learned auto-focusing in WSI scanners, either for histological [159] or cytological [160] samples. All the FQA methods are summarized in Table 5 including the pathology domain, the type of staining, the data used, the number of focus classes, the category of the method for OOF estimation (IP: Image Processing, HFC: Handcrafted Feature Classification, LFC: Learned Feature Classification), and possible transfer test with other stains or scanners.
The FQA methods used either during or after digitization are essentially similar. They are inspired by auto-focusing techniques and both determine a local quality score. However, VOLUME 10, 2022 images after digitization usually contain some degradation due to post-processing, e.g.,, JPEG compression. This section focuses on FQA methods at the patch or the tile level once the WSI has been acquired.

1) IMAGE PROCESSING METHODS
The firstly proposed OOF estimation works used classical image processing tools. In [161] Obviously it is very difficult to establish manual thresholds that will perform accurately in all focus configurations. As a consequence, works have rapidly shifted towards learned predictions to better adapt to image focus variations. However, this requires annotated datasets.

2) OOF DATASETS
To be able to learn to detect OOF areas, FQA research works have considered annotated datasets. The latter can be obtained by different techniques (resulting in different dataset types). The most commonly employed techniques to label the focus quality of patches in WSIs are: • Fully manual annotation (MAN): patches are extracted from the WSI and labeled by human experts.
• Automatic z-stacks (STK): patches extracted from real z-stacks with an offset value from the in-focus plane, or an absolute z-level from this location.
• Semi-synthetic z-stacks (SYN) simulate the previous one by applying several degrees of synthetic blur to the in-focus patches.
The labels associated to the patches are usually binary (in-focus or out-of-focus) but in some works [153], [156], and [157] several levels of focus have been considered. FQA methods designed on subjectively labeled data (MAN) should provide a response closer to manual FQA according to the assessor cohort, which is usually able to distinguish a maximum of six OOF grades. Because the creation of MAN datasets is tedious, thus limiting their size, they are mainly used to evaluate the performances of FQA methods optimized with objectively labeled data (STK and SYN). The latter data, related to auto-focusing and z-stacking, are easier to create and overcome the limitations of the former (size and dependence on assessors). Note that STK and SYN are not completely objective (manually calibrated and checked), and SYN is an augmentation technique. While Gaussian blur is commonly used, Bokeh blur (2D Heaviside step function) is known to be closer to the perception of real OOF occurring in photography and microscopy. As observed in [156], the human perception of blur follows an exponential relationship with the OOF level rather than a linear one. In particular, Gaussian blur tends to underestimate strong OOF which is less the case for the Bokeh blur model. Even if labeled datasets are mandatory to design a learned FQA method, still few datasets are available. Focus-Path 29 [147], and its extended version, 30 contain 864 (resp. 8640) 1024 × 1024 pathological images with 16 absolute z-levels as scores (STK), and acquired with Huron Tis-sueScope LE1.2 at 40×. TGCA@Focus 31 [157], selected from the Cancer Genome Atlas [163], contains 14 371 manually annotated patches (MAN, 2 classes) from 52 organ types. Both of them are restricted to histology. 29

3) HANDCRAFTED FEATURE CLASSIFICATION
In [150], Gao et al. proposed the first work on learned WSI FQA from handcrafted features. They used 44 standard measures of image quality as features (neighborhood contrasts, derivative-based, local intensity statistics, wavelet-based), which have already been identified as good candidates for characterizing the OOF effect in the context of microscopy and auto-focusing [164], [165]. Then, an AdaBoot-based binary classifier is used to estimate if a patch is in or out of focus. They considered MAN and STK datasets [150], both at 20× and 40×, and trained their binary AdaBoost classifier based on 44 features for the four datasets independently. For the MAN dataset, the classification accuracy was higher at 20× with a large improvement (approximately 5%). For the STK datasets, it was higher at 40× with a slight improvement (< 1%). Globally, STK achieved better performances than MAN (> 92.7 and just above 91% respectively.). The individual behavior of these handcrafted features was further analyzed by Moles Lopez et al. [152]. They observed that the discriminatory ability of these features varies from one biological structure to another for the same stain (most importantly for IHC), and from one stain to another. To cope with this, they considered several binary classifiers based on height discriminatory features and decision trees (DT), trained either on a stain-specialized dataset (H&E or IHC) or on a mixed version, with objective scores (MAN). All three learned models achieved an accuracy of at least 96%. To test the transfer ability of these three learned models, they tested them on an independent MAN dataset, and the performances dropped by up to 89%, particularly for classifiers that rely on IHC. As classification accuracy was reduced for the mixed version in any case, mainly owing to the presence of IHC patches, they retained the stain-specialized versions with a reduced number of four features (Haralick features, mean gradient magnitude, Tenenbaum gradient, and noise used in [149]) and decision tree depths (approximately seven). In the context of liquid-based cervical cytology, Lahrmanb et al. proposed a binary classifier for patches containing cells automatically selected using Otsu segmentation and HSV analysis [151]. The classification is based on a SVM and five handcrafted features expressing the mean quantity of edges and local variations, and the differences between sharpened, smoothed, and blurred versions of the patches. High accuracy and sensitivity (> 98%) were obtained for the MAN dataset. Inspired by previous studies, Campanella et al. [153] considered a Random Forest (RF) model with 13 features to predict a blur class (among six). Based on a feature selection on a Gaussian-SYN dataset, a RF regression model composed of 19 trees and 10 features was retained.
More recently, Hosseini et al. designed several sharpness measures (HSV-MaxPol [147] and FQPath [148]) based on a symmetric FIR kernel that mimics the ability of the human visual system to boost high-frequency domain magnitudes in a balanced manner. It is defined as a superposition of multiple even-order derivative kernels, either fixed (HSV-MaxPol) or to fit the inverse PSF of the scanner optic (using the Born & Wolf model) up to a threshold frequency (FQPath). By using the FocusPath dataset (STK) to tune the parameters, good correlations with the FocusPath scores were obtained by HSV-MaxPol and FQPath compared to several state-of-the-art measures, close to the behavior of Maximum Local Variations (MLV) [166], for a reduced computation time. Similar results were obtained for the binary classification [157] (ROC 0.94, PR 0.97). However, when using a different dataset (MAN) to test the transfer ability (other scanners and stain-tissue types), the performances were considerably reduced and were different for HSV-MaxPol, FQPath, and MLV (0.56 <PR< 0.67). This confirms the observations made in previous studies on the transfer ability of handcrafted features.

4) LEARNED FEATURE CLASSIFICATION
During the past five years, several data-driven methods based on CNNs have been proposed for FQA of digital microscopy images [167] and WSI to improve transfer ability. They can be divided into three main categories: i) standard architectures eventually adapted with minor adjustments, ii) truncated at a lower level, and iii) architectures developed specifically for WSI FQA. Standard architectures, which are usually pre-trained with natural images are all re-trained for WSI FQA. Campanella et al. provided the first example with ResNet-18 [153], but restricted to grey-level patches for comparison to RF. In Contrast, the CNN models considered in all the other studies take RGB color patches as input data.
Senaras et al. proposed DeepFocus 32 [155], a binary classifier based on five convolutions and three max-pooling layers to extract the features, and two fully-connected layers followed by a Softmax and manual threshold to determine the OOF class. Trained by using an STK dataset and categorical cross entropy as a loss function, DeepFocus showed high classification accuracy (93.2%) for an independent dataset (from the same scanner and stains, H&E and IHC), and confirmed that H&E-stained tiles are much easier to classify than IHC ones (< 90% for 2 slides among 3). Kohlberger et al. proposed ConvFocus [156], which is a similar network that distinguishes 30 OOF levels. The network is a truncated Inception V3 architecture composed of six layers to extract the features (three conv. + max pooling + average pooling + dropout), and a fully connected layer followed by a SoftMax. ConvFocus was trained using cross entropy with two Bokeh-SYN datasets, coming from two different scanners, and augmented by various transformations (orientation, brightness, contrast hue, saturation, translational jitter, Poisson noise, and JPEG compression). It was tested using two MAN datasets (similar scanners, same stains, six grades). The predictions were highly correlated with the expert scores, mainly for one scanner (SRCC approximately 0.8 and 0.93 resp.).
Wang et al. proposed FocusLiteNN 33 [157], a shallow CNN that provides a sharpness score from a single convolution layer (with k filters for each color channel) and non-linear pooling as an activation function (linear combination of the extremum feature values). FocusLiteNN was compared to the knowledge-based methods discussed in [148], including FQPath, HSV-MaxPol [147], [148], and MLV [166], as well as five standard CNN architectures similar to those considered in previous works (DenseNet-13, EONSS, and different configurations of ResNet). All CNNs were trained on FocusPath using their initial loss function (PLCC for FocusLiteNN). Performances on Focus-Path and a TGCA@Focus datasets were always higher for CNN-based models (ResNet-10 performed best), with interesting transfer ability (0.97 <PR< 0.99 for FocusPath and 0.9 <PR< 0.87 for the TGCA@Focus dataset). Differences between shallow and deeper CNN architectures were only slight, while shallow CNN improved computation time by a large margin. Albuquerque et al. [158] performed a comparison using FocusPath to train seven other standard CNN architectures (MobileNet_V2, AlexNet, GoogleNet, ResNet18, VGG16, truncated ShuffleNet, and SqueezeNet) for a classification task (12 OOF levels). Contrary to Wang et al. [157], data augmentation (shifting, zooming, flipping, and rotation) was applied. As most previously tested models were trained according to a non-ordinal loss function (typically crossentropy), Albuquerque et al. [158] also considered five different ordinal loss functions (ordinal encoding, binomial unimodal, regularized cross-entropies, ordinal entropy). The performances for FocusPath, w. r. t. classification accuracy, MAE, and Kendall's τ measure showed that cross-entropy was the second-best option. MobileNet_V2 with ordinal encoding was the best performer and improved the correlation (SRCC, PLCC) with the FocusPath scores previously obtained by Wang et al. [157] for all CNNs with a reasonable computation time compared with FocusLiteNN.

5) DISCUSSION
To conclude this tour of WSI FQA, best performances can be obtained by CNN-based methods trained on semi-synthetic or stack datasets. However, most methods are dedicated to specific stains and scanners. The results on transfer ability are encouraging for subjectively assessed data containing other stain-tissue types and acquired by different scanners, even for very shallow networks [157]. While shallow networks seem to provide a more appealing ratio between computation time and classification accuracy than handcrafted featurebased methods, their behavior for specific stain-tissue types (e.g., IHC) and cytology images has been rarely studied. The datasets used for learning or evaluation are generally free of artifacts owing to slide preparation or manipulation. Therefore, even if the quality of the images produced by autofocus and FQA during digitization is becoming higher and higher, FQA after digitization is still necessary. Another important issue concerns the decision at the slide level from OOF estimation at the patch level for automatic rescan decisions, which has only been addressed in a few studies, for example [144], [151], as described in the following section.

V. QUALITY DRIVEN BY DIAGNOSIS
With the advent of computational pathology, computer-aided (CAD) systems are now able to assist pathologists in establishing their daily diagnosis, for example for tumor detection or cancer grading. However, if these systems can reach the performance of pathologists for some diagnostic tasks, very few can be considered as being directly clinically applicable in clinical practice [168], [169]. Indeed, when such algorithms, which were developed and validated in pure research settings, are applied in routine diagnostics, they face many variations that can have occurred at each step of the digital slide preparation. As we have seen in the previous sections, both the quality of slide preparation and digitization can introduce severe artifacts. If the latter can be annoying for a visual examination by a pathologist, they will represent a potential failure case for computer-aided systems, as they will have never learned to recognize these before. For example, Wright et al. [170] showed that the quality issues of digital slides can have a strong impact on the performance of classification algorithms. Some recommendations have recently been released to facilitate the implementation of computational pathology workflows in pathology laboratories [171]. Standardizing the sample preparation in laboratories is of course a solution to such issues. The CAP NSH WSI Quality Improvement Program 34 is an initiative toward this purpose. Labs can have the quality of their histological H&E WSIs estimated and feedback is provided to help the lab in preventing the appearance of preparation or digitization artifacts. However, standardization will not eliminate the appearance of artifacts, and computational pathology methods must be able to address them. As we have seen in the previous section, methods have been developed to handle these problems of slides' preparation artifacts, staining variations, and out-offocus areas. These can be considered as the first steps toward quality control of WSIs. They should be integrated within any CAD pipeline in digital pathology to establish if a slide is of sufficient quality to be analyzed, and if not, what countermeasures have to be taken. However, how can we establish such a quality of a WSI towards its analysis? In this paper, we consider that a slide is analyzable, and can lead to a trusted CAD diagnostic, if: • The sample is sufficiently representative according to the analysis requirements (sufficient tissue in histology, enough cells in cytology). For instance, in cytology, the Bethesda System for Reporting Cervical Cytology (TBS) imposes a minimum of 5000 analyzed cells to obtain a sample of sufficient quality. In histology, the Elston-Ellis breast cancer grading requires the number of mitotic figures per 10 consecutive high-power fields.
• No element of the sample is obfuscated, potentially leading to doubt in the diagnostic. In cytology, the TBS imposes that less than 50% of the cells can be obscured. • If part of the sample is obfuscated but an abnormality is detected in a clear area, then the slide can lead to a diagnosis if no other clues are needed. Indeed, even if a slide contains preparation artifacts such as air bubbles, or digitization artifacts such as out-of-focus areas, if the slide contains enough information of good visual quality to establish a diagnosis, the slide can be considered as ''analyzable''. The methods we reviewed for the detection of artifacts, the estimation of out-of-focus areas, and staining problems provide the results of their analysis in two main forms: i) a heatmap that assigns a quality score to patches of the WSI. Figure 8 presents a heatmap for the estimation of out-of-focus areas. ii) a semantic segmentation map that assigns a label to each pixel (or patch) of the WSI as predefined semantic labels. Figure 9 presents such a semantic segmentation map for the labeling of artifacts and non-artifact areas.
Both outputs can be used as inputs to computational pathology methods that can integrate this information into their analysis (e.g., by not analyzing identified artifacts). However very few methods have been designed to assess the analyzability level of a whole slide: is it suitable for CAD analysis or should it be re-prepared, re-stained, or re-scanned? These methods are reviewed in the sequel (Table 6 provides a summary).
In [151] Lahrmanb et al. proposed an approach for scoring the focal quality of cytological slides. The slide was divided into 16 low-resolution regions from which cells are extracted with Otsu thresholding. A total of 200 cells were randomly chosen and cropped at high resolution for each region. Each cell is described by sharpness image processing features that are fed to an SVM, which classifies each cell as in or out of focus. The percentage of in-focus cells in each region is divided by the number of regions. If this value is lower than a user-defined threshold, the slide has to be re-scanned. Ameisen et al. [144], [145] proposed a method to assess the focus quality of a WSI. Blank tiles were excluded based on their color saturation values. The remaining tiles were characterized using image processing methods in terms of sharpness, contrast, brightness, and color. A combination of in-house thresholds is applied to these features to determine whether a tile is of sufficient quality. Depending on the magnification in the tile pyramid, the image is considered sharp if 90% of the tiles are sharp at 2x magnification (70% at 10x magnification). Their results were consistent with those of pathologists for 100 WSIs with various blurred areas. Their work has recently been extended as a software 35 that was presented in [172]. A similar approach was proposed in [173] and [174], but without providing a slide-level quality decision. In [153], Campanella et al. first performed background detection to avoid blank tiles. A set of 10 sharpness measures were extracted from the patches and fed to a random forest classifier that assigns a blur score to each patch. This method provides a blur heat map as output but also provides a blur slide-level score as the percentage of the blurred surface of the WSI. Zhang et al. [72] proposed using a CNN to assess the quality of WSIs stained with Gram staining. They considered a MobileNet CNN to estimate the staining quality of the slide tiles. From the quality of the tiles, they generate assessments (good/average/low) of the slide quality from empirically selected thresholds in terms of staining, density, and artifacts presence. In [66] and [67] In [73], Haghighat et al. proposed the pathProfiler tool for quality control in a large retrospective cohort of prostate WSIs. After extracting the tissue regions to avoid blank tiles, a multi-task deep neural network performs a quality estimation at 5x magnification. With this model, a tile is described by six inferred quality measures assessing usability, no artifact presence, staining artifacts, focus artifacts, tissue folding, and presence of other artifacts (dirt, ink, air bubble, etc.). The tile-level statistics are then aggregated and fed to a fully connected neural network to predict quality at the slide-level. Three slide-level scores are provided to predict the WSI usability, focus, and staining qualities. They also used handcrafted features extracted from HistoQC [4] (this tool does not provide a slide-level analysis) and found that learned deep features perform much better. This work is interesting as it is the only one that: i) considers the usability of WSIs to assess if the slide is appropriate for clinical diagnosis and ii) simultaneously estimates artifact presence, staining problems, and out-of-focus areas. This study is probably the closest to an ideal quality control tool. Now, if we compare the number of approaches in Table 6 with those presented for artifact detection, staining quality, and focus quality, we can see that very few approaches proposed a global slide-level quality analysis. Indeed, most stateof-the-art approaches we have seen in the previous sections do provide only a quality heat map or a semantic segmentation map, and usually for only one quality criterion.
To conclude, the field of quality control of WSIs is just in its infancy. The main issues are known and are related to artifact detection, staining quality, and focus quality. If some methods have been proposed separately for each quality issue (as exposed in the previous sections), they all have limitations in terms of accuracy and reproducibility on other datasets from other digital pathology centers. In addition, their output is at the moment limited to heat and segmentation maps that can be overlaid on the WSIs but not to evaluate the global slide quality. The two recent approaches of [56], [73] provide good directions towards what the community should converge to: a quality control based on artificial intelligence able to recognize all the quality problems and to decide what to do with the slide from rejecting it, advising re-stain or rescan, or considering the slide as analyzable by a CAD system.

VI. LABELED DATA QUALITY AND QUANTITY: HOW TO LEARN THE UNEXPECTED
In the previous sections, we reviewed the most recent state-ofthe-art approaches for WSI quality control in digital pathology and observed how they can be integrated within a slide-level quality analysis for a computational pathology process. Regardless of the QC to be performed (artifact and out-of-focus detection or stain normalization), the most recent and efficient state-of-the-art methods do rely on deep learning approaches. Many recent reviews of the field of deep learning in computational pathology have been published, and we refer interested readers to these [22], [26], [27], [30], [31], [32], [35], [36], [36], [47], [175]. If deep learning approaches enable astonishing results that have put forward the field of computational pathology, these techniques require a large number of labeled examples to guarantee good generalization. Generalization is a well-known issue in machine learning. Generalization refers to the ability of a learned algorithm to adapt properly to new and previously unseen data (hopefully drawn from the same distribution as that used to create the model). It has been studied from a theoretical point of view with the help of the VC Dimension [176]. The VC dimension is a measure of the capacity (or complexity) of a set of functions, which can be learned using a classification algorithm. Baum and Haussler [177] proposed that if a generalization level of 90% is desired, the number of training samples should be about 10 times the VC dimension [176]. It has been recently proved [178] that the VC dimension of a multi-layer perceptron of W weights and L layers is (W · L · log(W )). For instance, a shallow MLP with an input vector of size 100 to be classified into 5 classes with , where W is the number of weights, L is the number of layers (both convolutional and fully-connected layers), β is the distance from initialization in the operator norm, λ is the margin, n is the number of sample data, and the bound holds with a probability of at least 1 − δ. Thus, with very deep neural networks, it can be expected that attaining good generalization will require a very huge number of examples if the number of weights and the depth of the CNN are large. As this is usually the case (with the popular ResNet 101, W is more than 44 million with L = 101 layers), a key issue of applying DNNs in computational pathology is therefore highly related to the availability of a huge and well-labeled learning dataset. The quantity and the quality of a learning dataset are also something that has to be taken into account when designing a computational pathology approach for CAD. Indeed, the quality of the dataset used for learning can potentially have a strong impact on the quality of the induced deep learning algorithm. We already addressed this partly when we reviewed the influence of color and focus of WSIs: once a DL algorithm has been trained, it is built only to work on data that are similar to those used to train it. Therefore, if one wants an algorithm to be as versatile as possible, it should have learned to recognize patterns in many different situations.
Therefore the evaluation of machine learning algorithms for digital pathology can be very delicate [180]. In [181], Wahab et al. have studied, for the classification of different breast cells in H&E WSIs, the quality of annotations in terms of completeness, exhaustiveness, diversity, and agreement. They concluded that standardization of annotation protocols is necessary and proposed a new one. Two direct consequences of annotation quality are that: i) datasets have to be very carefully labeled to have high-quality training corpus (to cope with the lack of quality of labels), and ii) the DL algorithm can use specific learning strategies to make them more robust to variations (to cope with the lack in quantity of data). These two aspects are considered in the sequel. • Artifacts should be considered as additional categories to avoid too many false positives.
• Annotations should be performed by several experts. Unfortunately, very few datasets encompass all these requirements (see in [32] for a list of available cancer histology datasets). Recently, Hosseini et al. [183] proposed a new digital pathology dataset called the ''Atlas of Digital Pathology''. In particular, they demonstrated the quality of their image labels through pathologist validation and by training three state-of-the-art neural networks for tissue type classification. This stresses that obtaining annotations of large cohorts of representative WSIs is a very sensitive step. This is mainly because of the expertise required to generate quality labels and the limited availability of qualified experts. Therefore, pathologists require efficient annotation tools (preferably open-source) that easily enable them to label patterns in WSIs. Hopefully, such software are available (see [184] for a complete review): • Icy 36 [185]. • QuPath 37 [186]. • Cytomine 38 [187]. • SlideRunner 39 [188].
• Quick Annotator 40 [189] Regardless of the chosen annotation tool, the quality of the labeled dataset must be carefully monitored, because labeling is usually performed by a single pathologist. This can have a strong influence on the learned deep learning generalization quality and introduce some bias as recent studies have stressed [190], [191], [192], but this is a well-known issue of any deep learning based computer vision approach [193].

B. LACK IN THE QUANTITY OF DATA
When the objective of a method is to detect rare items in WSIs (such as artifacts), the collection of annotated data is all the more problematic. Indeed, because rare items are by definition not often encountered, the constitution of a representative dataset is almost impossible and this is particularly true for artifact detection in quality control. Therefore, other strategies have been envisioned to address this lack of data. In computer vision, when a dataset is strongly imbalanced or its size is too limited to train a deep learning algorithm, typical strategies exist to alleviate the scarcity of annotated data [194]: data augmentation, transfer learning, domain adaptation, and weakly-supervised learning. Data augmentation [195] artificially generates synthetic data from the initial dataset to enlarge it and improve the performance of the model. Augmentation can be performed using many different transformations, such as geometric transformations or color shifts. Transfer Learning [196] works by training a network on a large dataset such as ImageNet and then using those weights as the initial weights in a new classification task. Transfer learning works only if the data to be processed are similar to have a valid transfer (e.g., natural images). Domain adaptation is a type of transfer learning method. A DL algorithm learns from a source domain with a large labeled dataset and aims at achieving comparable performance for the same task on a target domain with few labeled data [197]. Weakly supervised learning [198] consists in training the models with labels less expensive to collect than image-level annotations (e.g., grades at the slide-level). Such labels are often easier to obtain with limited efforts, and also in large quantities. All these strategies have been explored in computational pathology, an excellent review of which can be found in [30]. We quote only some representative recent works for each strategy in the sequel to provide a rough overview.

1) DATA AUGMENTATION
Data augmentation is a frequently used solution to introduce a certain degree of invariance to a CNN and to tackle class imbalance by artificially increasing the learning dataset. In [138], Tellez et al. conducted a study on different kinds of data augmentation (rotations, mirroring, scaling, elastic deformation, Gaussian blur and noise, brightness and contrast, Hue-Saturation-Value and Hematoxylin-Eosin-Dab color shifts) and showed that this can lead to performance gains. In particular, stain color augmentation is crucial for achieving the best performance. In [199], Teramoto et al. demonstrate a consequent gain in performance with data augmentation for the classification of benign and malignant cells in cytology with a VGG CNN. In [200], Annuscheit et al. investigated data augmentation techniques (color-based, geometric-based, filter-based transformations, and erasing) on different datasets using multiple network architectures (VGG, Inception, DenseNet). They observed that geometric-based techniques increase the model performance but color-based augmentations have no significant effect. This result is not in concordance with those of [138], but their approach has strong sensitivity towards data augmentation hyperparameters. To address this problem, in [201] Faryna et al. proposed to use automated and computationally efficient data augmentation, as classical data augmentation requires extensive hyper-parameter tuning and can lead to sub-optimal generalization performance. Based on the RandAugment framework, they considered several domain-specific modifications relevant to histopathological images (based on these of [138]). They showed that this automated data augmentation could outperform the approach of Tellez et al. [138] where data augmentation was manually tuned.

2) TRANSFER LEARNING
Two strategies were considered in transfer learning. The first strategy consists in using off-the-shelf features extracted from a source pre-trained network that are fed to a specific classifier for the target task. The second strategy consists in using a source network with pre-trained weights and fine-tuning the weights on the target domain. In [202], Sharma et al. demonstrated the ability of the pre-trained Xception model to perform breast cancer histopathological image classification in contrast to handcrafted approaches. Li et al. [203] showed that off-the-shelf features learned from natural images can be reused in computational pathology, but the amount of information that could be transferred heavily depended on the complexity of pathology images. Some papers have studied this transfer ability from general models such as Ima-geNet in pathology. In [204], Sharma et al. compared the performance of features extracted from networks trained on ImageNet and histopathology data. They demonstrated that specific encoders (ResNet) trained on multiple histopathology datasets result in superior features than their ImageNet trained counterparts and should be used for weight initialization in histopathology tasks. Aitazaz et al. have recently shown [205] that the vision transformer works better on histopathology images than CNN models pre-trained on ImageNet. In [206], Mormont et al. investigated various deep learning transfer learning strategies on histological and cytological datasets. They compared transfer strategies for seven different network architectures. They observed that fine-tuning always outperformed off-the-shelf features from the last layer of the network, regardless of the network. Similar conclusions were drawn by Jang et al. in [207] who studied transfer learning between different types of cancers. In [208], Mormont et al. investigated multi-task learning as a way of pre-training networks. They gathered 22 digital pathology datasets into a single dataset and used multi-task training. The features extracted from their model are superior to those of ImageNet. This shows that domain-specific pre-training can be an interesting alternative to fine-tuning.
Another way is to try to consider the existence of generalizable knowledge between different problems and to use weight distillation [209] for cross-knowledge transfer.

3) DOMAIN ADAPTATION
When transfer learning is considered, the source and target domains are assumed to follow similar distributions. Domain adaptation deals with cases where a model trained on a source distribution is used in the context of a different (but related) target distribution. This problem has been recently addressed by employing adversarial learning: a discriminator is trained to distinguish source and target data using features extracted from a deep neural network as inputs, while the deep neural network is tuned to confuse the discriminator. This helps to map the source and target data close. In [210], BenTaieb et al. proposed the use of GANs to learn dataset-specific staining properties to transfer stains across datasets. Fine-tuning the obtained stain transfer network with images from a new domain can enable normalizing training images with respect to the new domain distribution. Ren et al. [121], [211] proposed using unsupervised domain adaptation to transfer discriminative knowledge obtained from the source domain to the target domain. Adaptation is achieved through adversarial training to find an invariant feature space in the source domain along with a Siamese network architecture on the target domain to enforce regularity. Shi et al. [212] also use stain style transfer to translate the style of a small image dataset into a large dataset style with cycleGAN. Because an inception CNN has been trained on a large dataset, it can be applied to datasets from other centers. In [213], a different approach has been proposed. The authors train a ''universal'' model to recognize diverse histological tissue types from a source domain dataset of healthy slides from various organs. They can then adapt the model to transfer diagnostically relevant labels for tissue and disease classification into target domains without any re-training or fine-tuning, by using prior histological knowledge.

4) WEAKLY SUPERVISED LEARNING
Building a large dataset of cell-level labels on WSIs is tedious and time-consuming. In contrast, slide-level labels (e.g., grading or diagnosis) are easier to obtain as they are readily available from diagnostic reports. This is the motivation of weakly supervised learning where the slide-level diagnoses constitute weak labels. Recent reviews on that topic can be found in [214] and [215]. It is then assumed that if a WSI has a cancer diagnosis, at least one of its tiles must contain cancer cells. An application of this principle, referred to as multiple instance learning (MIL), was performed by Campanella et al. [168]. They used a dataset of 44,732 WSIs using only slide-level diagnoses as labels with impressive results on test sets of prostate cancer, basal cell carcinoma, and breast cancer metastases. In [216], Kanavati et al. trained a CNN based on the EfficientNet-B3 architecture, using transfer learning and weakly-supervised learning, to predict carcinoma in WSIs. They also compared fully-supervised learning and weakly-supervised learning and demonstrate that: i) fully-supervised learning performs best when cell-level labels are available, and ii) when only slide-level diagnoses are available, weakly supervised learning can be performed but requires a much larger dataset of WSIs. In [217], Teramoto et al. developed a weakly supervised method for the classification of benign and malignant lung cells in cytological images using attention-based deep multiple instance learning (AD MIL). Images were divided into patches images and stored in bags. Each bag was then labeled as benign or malignant, and classification was conducted using AD MIL. Their weakly supervised learning with AD MIL was able to reach the accuracy obtained with supervised learning and in addition, enables the visualization of the regions that contributed to the decision by the attention mechanism. In [218], Lu et al. proposed a Clustering-constrained Attention Multiple instance learning (CLAM) that also only requires slide-level multi-class labels. CLAM uses attention-based learning to automatically identify sub-regions of high diagnostic value to accurately classify the whole slide, while also utilizing instance-level clustering over the representative regions identified to constrain and refine the feature space. In contrast to classical MIL, CLAM uses an attention-based pooling function to aggregate the patch-level features into slide-level representations for classification. This makes their approach much more efficient than classical MIL and diminishes the need for very large sets of slide-level labels. This last work showed that weakly supervised learning can be beneficial in computational pathology. This has been confirmed by the recent study [219] of Ghaffari Laleh et al. that compared weakly-supervised deep learning pipelines for whole slide classification in computational pathology. In particular, they showed that a classical weakly-supervised approach using Vision Transformers (that are very new in computational pathology) can outperform MIL and CLAM. As the field continuously evolves, there is also much attention now on selfsupervised learning. Contrary to weakly-supervised learning that uses slide-level annotation, self-supervised learning aims at pre-training a self-supervised model on an unlabeled set to obtain task-agnostic feature representations. The model can then be fine-tuned on a limited amount of labeled data to obtain task-specific features [220], [221], [222]. If these models are appealing, they still need many computational resources to be trained.

VII. IMPORTANCE OF QUALITY CONTROL IN A ROUTINE LABORATORY PROCESS
To support our conclusion that quality control is essential in digital pathology in a routine laboratory process, we consider the monthly WSI output of a laboratory for cytological slides. Indeed, with the large number of slides that have to be digitized daily in a laboratory, quality issues are inevitable. However, it might be beneficial to detect these quality problems automatically. To support this assumption, we analyze the obtained monthly set of cytology WSIs both qualitatively and quantitatively. First, we provide details on the corpus of WSIs. They were obtained from a single laboratory equipped with a P250 3DHistech scanner generating .mrxs files. The whole corpus was digitized consecutively in a 30-day window between October 10, 2020, and November 29, 2020 (i.e., during 21 days) where slides were scanned. It represents 2093 WSIs, each scanned at different magnifications from 5x to 40x. At the highest resolution, the number of microns per pixel (mpp) for each slide is approximately 0.24, in other words, 1 micron of a glass slide need 4-5 pixels to be stored digitally. It represents approximately 2.85 terabytes of data. On this basis, we will analyze different aspects related to QC, each involving perturbation in the diagnosis. The aim is to be as close as possible to reality when analyzing the need for a laboratory when moving to the digital world.

A. SELECTED CRITERION
A digital slide is considered ''unanalyzable'' or ''unreadable'' (denoted as Read -) if there is doubt in the diagnosis according to all the pixel data available. For example, if some visible nuclei are noticeable but obfuscated by any factor (preparation issue, sample quality, out-of-focus area, etc.), doubt is allowed because these nuclei could potentially change the diagnosis for the whole slide. Several criteria have been defined for monitoring the different manifestations of these obfuscations. The labels used to annotate the corpus are described in Table 7 and are discredited with levels denoted by ''−'', ''+'', or ''++'' ordered by ascending order. These labels were defined according to the background of the annotator in cytology diagnosis with or without AI assistance. They also correspond to the main issues that the literature has already addressed according to our survey. Two main categories of elements are considered: • Issues that an automatic method can handle from a quality diagnostic perspective (Blur, Bubb) • Elements that affect an automatic analysis that should be considered when collecting training corpus (Atro, Pauc, Bloo, Poly, Mucu, Bact) The intuition behind this selection is that several concomitant elements that slightly affect readability can accentuate the perturbation in the establishment of a reliable diagnosis. All annotations were made by a single expert of the domain (a Cyto Technologist (IAC) involved in the diagnostic of both glass/digital slides and in collecting corpus cell examples to train machine learning based nuclei classification) with more than 10 years of experience.

B. DIACHRONIC AND QUANTITATIVE ANALYSIS
The 2093 slides were annotated according to the labels and their scale, at two or three different levels. WSIs are split into two main categories, ''readable'' (Read+) and ''unreadable'' (Read-) with their levels (''−'' ''+'' or ''++''). Each WSI was annotated with each category label and a level of readability. A synthesis of all annotations made is given in Table 8. Approximately 15% of WSI have been labeled  as ''hard to diagnose'' (Read -). One can see that, even if considered as readable, the slides always contained preparation (air bubbles, blood, mucus and bacteria) and digitization (blurred areas) artifacts. Therefore, artifacts are not always a problem in terms of the readability of the slide. However, as expected, some labels determined the readability of the slide more than others. WSIs with a sparse distribution of cells (Pauc +/ + +) often lead to the Read − classification.
The same effect appears in the presence of bubbles (Bubb +), blur (Blur +/ + +) or poly (Poly +/ + +). For the other labels, the correlation between a given label and readability is less noticeable. But one can notice that for approximately 90% of the ''hard to diagnose'' slides, there was a joint presence of several preparation artifacts such as blood, mucus and bacteria. Similar studies have been led for histological slides [223], [224] and reported a rate of around 20% of quality issues (mainly due to out-of-focus and stitching). Figure 10 presents the stream of slides according to their labels. Some labels and days were gathered for readability purposes when the amount of associated data in each was too small. Regardless of the label, approximately the same proportion of WSI arrives at each time step. In addition, from the perspective of capturing examples to build a machine learning training dataset, some labels are scarcer than others. In this laboratory, it would take approximately a month to collect 100 paucicellular WSIs and a few days for atrophic WSIs. To complete this journey through the labels, we analyze the data to check the interdependence between the labels. In other words, we want to ensure that the chosen labels are good descriptors with few redundancies in the overall information they provide. The uncertainty coefficient (entropy coefficient or Theil's U) is a non-symmetric measure based on the conditional entropy between two phenomena A and B. Knowing phenomenon A, it checks if we can predict phenomena B. We use this measure to check the predictability of labels according to others. The uncertainty coefficient is non-symmetrical. From a WSI, if knowing the degree of B implies that we can estimate its A feature, we cannot infer systematically that knowing its degree of A can estimate how it is affected by feature B. Figure 11 shows that the Read class is mainly due to the presence of the labels Blur and Bubb. Similarly, the presence of Blur and Bubb affects the readability status Read of the slide. Conversely, the Pauc label affects the Read class of the slide, but knowing the Read characteristic of the slide rarely implies a Pauc label. None of the other labels used were correlated according to the uncertainty coefficient. This acknowledges that the chosen labels form an orthogonal semantic space. Therefore, the annotations build a good foundation for describing digital slides from a diagnostic perspective.
As the last window opened on those data, we analyze the impact of concomitant labels on the WSI. If a single noise disturbs the interpretation of a WSI, the addition of a new perturbation can increase the difficulty of the diagnosis. Figure 12 shows the proportion of WSI Read+ / Readaccording to the number of different labels characterizing the  WSI positively. For instance, we counted 3 for a WSI labeled as Bloo+, Atro+, and Mucu+. The number of labels and the readability characteristic of the WSI are correlated, the more a WSI is affected by disturbances, the less trustworthy is its diagnosis. This highlights that analyzing WSI through different angles is a necessary strategy: taken individually, each label doesn't necessarily warn of a potential issue on the slide. A WSI can be categorized with regard to its analyzability only when analyzed holistically with orthogonal features. As well, by defining each case that can be encountered in a ''divide and conquer'' approach, the detected cause can lead to adequate countermeasures if anticipated. In the case of multiple causes compromising the readability of a WSI, the best set of actions to take is the one with the lowest cost for the laboratory, both in terms of time and material resources.

VIII. PERSPECTIVES
Out of this state of affairs, we can see that there are still many improvements to be expected for QC of WSI for routine clinical diagnosis. In particular, if some recent methods [56], [73] have opened the way toward global quality assessment of WSIs, there is still much to do. First, the methods have to be able to detect simultaneously different quality problems and to globally assess the slide quality in terms of analyzability for a final diagnosis (by a human or a computer-aided diagnosis system). Second, quality control should be able to detect out-of-the-scope items to avoid their analysis by an AI. In particular, too few works have been led on artifact detection in WSIs. Third, the methods have to be robust to the different quality situations that can occur in real practice. In particular, most of the methods for quality control are developed on specific small-size datasets. If they can work reasonably well on the learned datasets, their generalization abilities to data coming from other centers (using different slide preparation and digitization protocols) are not enough explored. Only the recent work of [225] has explored how deep neural networks perform on corrupted images (with compression, focus, color, and artifact issues) and shown that this can severely decrease the prediction accuracy. This makes none of the existing quality assessment tools ready for clinical use. However, it can be expected that novel techniques using larger datasets will converge to models close to being deployed for clinical practice.
We propose to draw an outline of what a QC tool for whole slide images should do and how it could be included within a computational pathology pipeline. The pipeline is as generic as possible as it encapsulates any histology and cytology routines. A similar pipeline was proposed in [226]. This is illustrated in Figure 13. In this prototypical pipeline, quality control checks are performed on two nodes: after digitization and conjointly with the AI module used in this DP pipeline. After the preparation of the biological sample, a scan is performed by a WSI scanner. The latter must be calibrated using a calibration slide to ensure color fidelity and reproducibility. The slide can be rejected if it needs to be re-scanned (digitization problem), re-stained (staining preparation problems), re-prepared (does not contain sufficient material), or accepted and analyzed for CAD. Just after digitization, several checks configured with the laboratory constraints and material (protocol for staining, the chemical used, machine for several preparation steps) are made:

A. D TO D -RE-DIGITIZE
Check if the minimum requirements are met in terms of magnification, microns per pixel, and size of the scanned VOLUME 10, 2022 area. Automatic glitch, blur, foreign body detection, and other feature detection with a relative pipeline agnostic characteristic can lead to scanner reconfiguration or glass slide cleaning before regenerating a new WSI.

B. D TO P -RE-PREPARE
Check gamut deviation according to laboratory standards (depending on the protocol, chemicals, and machine used). If the stain preparation was incorrect, the colors can be too far from what is expected. Such a gamut test can be performed using the estimated stain vectors [105].

C. D TO S -RE-SAMPLE
Some issues need another extraction from the initial sample e.g., when a large fold is detected in a histological WSI.
After these checks, the WSI can be used by a pathologist to perform its examination. With the addition of an AI module as a certified assistant for a pathologist, new constraints and new indicators can be performed, each eventually leading to backward action to any prior step of the pipeline.

D. AI TO D -RE-DIGITIZE
the scanner needs to meet the requirements for creating WSI in a compatible format and with enough information to allow the use of the AI module as intended.

E. AI TO P -RE-PREPARE
Sometimes, color deviation from the standard is not captured by quality control made in the prior steps. According to a fine analysis of the train data and data augmentation, the AI module should be self-aware of its range of acceptability. Countermeasures can be applied with color/style transfer from the domain of the processed WSI to the domain described by the training dataset. In the same way, an AI can be self-aware of its weaknesses by incorporating a module for the detection of known perturbation (artifact, blur, glitch, or pipeline-specific disturbance such as a large presence of blood). Any error detection module listed above must indicate the backward action to take to bypass the issue encountered.

F. AI TO S -RE-SAMPLE
In the same way, automatically detected issues can lead to the preparation of a brand-new sample (lack of biological material or un-recoverable preparation error).
Backward actions do not have to be automatic. When issues are located precisely on a WSI, joining data analysis with a map of trusted/corrupted areas allows the pathologist to obtain the final word. WSI containing blurred areas but visible suspicious regions should not be filtered out blindly. Finally, if a set of causes leads to difficulties in establishing a diagnosis (as depicted in our corpus analysis, Section VII) the best set of actions to take is the one minimizing the number of steps that have to be redone. Any other phenomenon that does not fit this ontology should be considered by enhancing the AI module with this new type of object or changing the laboratory workflow with a preemptive procedure.

IX. CONCLUSION
In this paper, we considered the quality of WSIs in digital pathology. Many factors in a digital pathology pipeline can have a strong influence on the quality, from slide preparation to slide digitization. In particular, we reviewed issues related to quality concerning the presence of sample preparation artifacts, compression artifacts, color variations, and out-offocus areas. We have proposed a review of all the computational methods that have been proposed in the state-of-the-art for their detection.
In the quality process driven by diagnosis, we have established the notion of the analyzability of a WSI. The latter can be obtained from the previous computational methods that analyze the quality of slides. They can be blended to assess whether a slide is of sufficient quality to be used by a computer-aided diagnosis system, or should be re-prepared (too many physical artifacts), re-stained (too much color deviation), or re-scanned (too many out-of-focus areas).
As the most recent competing methods rely on deep learning, we also addressed issues related to the quality of the data used for learning deep models, and ways to cope with this problem.
Finally, to illustrate the importance of quality control in the daily practice of a real laboratory, we have labeled and analyzed the quality issues of cytological WSIs digitized during one month. This confirms that the presence of preparation artifacts (e.g., air bubbles) or digitization artifacts (e.g., out-of-focus areas) occurred for at least 15% of the slides. The greater the presence of artifacts, the less the slide is analyzable by a human expert and therefore might also be difficult to analyze using an AI-based system.
Based on this observation, we have drawn perspectives on how a computational quality process can be included in a computational diagnosis pipeline.
ROMAIN BRIXTEL received the Ph.D. degree in computer science, artificial intelligence and natural language processing from the University of Caen, France, in 2011.
From 2012 to 2016, he pursued his work with some postdoctoral studies at the University of Caen and Lausanne, in various fields driven by machine learning perspectives as multimodal and multiscale document studies, plagiarism detection and charisma analysis. He then have been introduced into Datexim, a company dedicated to medical image analysis, where he became a CSO, in 2019, and build projects with digital pathology breakthrough in mind. He was the Head of the MMI Department, from 2015 to 2021. Since 2022, he has been the Deputy Director of the GREYC CNRS Research Laboratory. Applications of his works include computational photography, computer-aided diagnosis, and computer vision. He is the author of two books, 11 book chapters, 50 articles in international journals, and more than 120 in international conferences. His research interests include graph-based signal processing, adaptive and multidimensional mathematical morphology, and machine learning. He is a member of SPIE, IAPR, and EURASIP. He was an Associate Editor of the IEEE TRANSACTIONS ON  From 2005 to 2016, he worked as a Cytotechnologist in a private laboratory. In 2016, he integrated the Datexim Company, where he is still working as a Product Manager and a Clinical Expert. Passionate with both gynecological and non-gynecological cytology, as well as histology, he also fulfilled his interest in new technologies by collaborating in the development of the first CE-marked software for cervical cancer screening. He received the International Academy of Cytology Certification, in 2011.

SÉBASTIEN BOUGLEUX
BENOÎT LEMOINE received the M.Sc. degree in machine learning and image processing from Ensicaen, Caen Normandy, France, in 2020.
He then joined Datexim Research and Development Team, as a Junior Developer, working to extend the first CE-marked software for cervical cancer screening. As a Machine Learning Enthusiast, he stays in touch with state of the art to propose innovative tools for digital pathology.
MATHIEU FONTAINE received the Ph.D. degree in computer science, on constraint satisfaction problem from the University of Caen, France, in 2013.
After spending two years working on facial recognition, he joined Datexim, as a Research and Development Engineer to work on image similarities detection then on artificial intelligence in digital pathology, more specifically on cell nucleus detection and classification for cervical cancer screening, in 2014.
DALAL NEBATI received the M.Sc. degree in microbiology from Oran University, Algeria, in 2015.
From 2011 to 2017, she worked as an Research and Development Engineer and participated in the conception of clinical trials and in the development of the first CE-marked software for bladder cancer. In 2016, she successfully participated in the formation of cervico-vaginal cytopathology. In 2022, she joined Datexim Company, as a Cytology Engineer involved in various research projects on cervical cancer. She is passionate about digital pathology and artificial intelligence and has spent years infusing her expert knowledge in this field of oncology and pathology into innovative digital applications.
ARNAUD RENOUF received the Ph.D. degree in computer sciences with a specialization in image processing and artificial intelligence from the University of Caen, France, in 2011.
He founded Datexim, in 2011, where he started as the COO and the CSO of the company. In 2018, he became the President of Datexim, building the international image of the company, leading the collaboration with other international industrial and academic partners, and bringing his expertise to the customers worldwide through his business development activities. From a science background, he quickly improved his skills in business development and company management thanks to the HEC Paris Training Program (Challenge+) and another training at EM Lyon.