Designing an ML Auditing Criteria Catalog as Starting Point for the Development of a Framework

Although AI algorithms and applications become more and popular in the healthcare sector, only few institutions have an operational AI strategy. Identifying the best suited processes for ML algorithm implementation and adoption is a big challenge. Also, raising human confidence in AI systems is elementary to building trustworthy, socially beneficial and responsible AI. A commonly agreed AI auditing framework that provides best practices and tools could help speeding up the adoption process. In this paper, we first highlight important concepts in the field of AI auditing and then restructure and subsume them into an ML auditing core criteria catalog. We conducted a scoping study where we analyzed sources being associated with the term “Auditable AI” in a qualitative way. We utilized best practices from Mayring (2000), Miles and Huberman (1994), and Bortz and Döring (2006). Based on referrals, additional relevant white papers and sources in the field of AI auditing were also included. The literature base was compared using inductively constructed categories. Afterwards, the findings were reflected on and synthesized into a resulting ML auditing core criteria catalog. The catalog is grouped into the categories: Conceptual Basics, Data & Algorithm Design and Assessment Metrics. As a practical guide, it consists of 30 questions developed to cover the mentioned categories and to guide ML implementation teams. Our consensus-based ML auditing criteria catalog is intended as a starting point for the development of evaluation strategies by specific stakeholders. We believe it will be beneficial to healthcare organizations that have been or will start implementing ML algorithms. Not only to help them being prepared for any upcoming legally required audit activities, but also to create better, well-perceived and accepted products. Potential limitations could be overcome by utilizing the proposed catalog in practice on real use cases to expose gaps and to further improve the catalog. Thus, this paper is seen as a starting point towards the development of a framework, where essential technical components can be specified.

Artificial Intelligence (AI), especially Machine Learning (ML) algorithms, become more and more popular in the healthcare market.In the U.S., only 7% of the hospitals have The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu .a fully operational AI and automation strategy, even though 90% started a draft [4, p. 4].Identifying the best suited processes for ML algorithm implementation is one of the biggest challenges according to Sage Growth Partners and Olive [4, p. 4]'s study.It is not easy to answer this ex-ante.To do so, the algorithm needs to be successfully audited, having robust evidence available that proofs the algorithm is able to perform the task within defined quality control metrics.
Von Twickel et al. [5, p. 2] highlight that a generally agreed upon framework for auditing AI systems is required.This is in line with Wiegand et al. [6, p. 10] who state that ''[t]here is no agreed framework for assessing or reporting the results of health AI models.''Brundage et al. [7, p. 11] mention that AI systems are intransparent and often closed source.This is crucial as raising human confidence in AI systems is elementary to building trustworthy, socially beneficial and responsible AI [8, p. 83].According to SAI of Finland [9, p. 11], for ''a well-functioning public sector'', personal data protection, decision explanation and bias are few of the main challenges of ML algorithm implementation.If not tackled well, those challenges could lead to ''obscured inefficiency . . .[and] damaged trust.'' The goal of this paper is to contribute to designing an approach how to audit machine learning algorithms.To do so, the focus is set on carving out relevant auditing aspects in form of a core criteria catalog.
A commonly agreed AI auditing framework that provides best practices and tools could help speeding up the ML adoption process in the healthcare market.
Before describing ways how to perform audits and what pillars a supporting criteria catalog may be made of, it is important to first get a good understanding of the term Auditable AI (AAI).In this context, Dengel et al. [10, p. 91] mention AI systems that ''should be able to answer questions asked by humans and interact with them in an understandable way.''To operationalize this definition, Benchekroun et al. [11, p. 2] state that ''[a]uditability . . .ensur[es] that the AI model behaves as expected.''Next to this definition, Dengel et al. [10, p. 103] also provide an idea how this could be achieved: By ''[querying] AI systems . . .externally with hypothetical cases'', whereas those cases can be based on real world data or being artificially created.
Explainability, on the other hand, which is often abbreviated by the term Explainable AI (XAI), makes sure that humans are able to derive a sufficient understanding of the model's inner workings [12, p. 7].Ideally, the explanations are exhaustive and adjusted to the individual [13, p.  Lastly, ''Interpretability refers to the observation and representation of cause and effect within a system'' [10, p. 94].Russell and Norvig [13, p. 729] accentuate that the focus is on comprehending the input/output behavior of an ML model, without necessarily opening the black box.Many authors argue that XAI is a necessary, but not always 1 Those white box models already contain eXplanation by Design (XbD), meaning the model itself is inherently explainable [14, p. 23].sufficient, prerequisite to achieve human comprehension of an AI system [12, pp. 7-8].
In the first part of this paper, we explore important artifacts in the field of AAI using inductively constructed categories.In the second part, we further group and synthesize the findings, having an ML auditing core criteria catalog as a result.

II. METHODS
In order to achieve the aforementioned goal, it is necessary to get a good understanding of ongoing research in the field of AAI.Thus, we first conducted a scoping study, where we focus on identifying commonalities and differences among themes, artifacts and patterns [15, p. 408].Our method of choice was a qualitative content analysis.Here we utilized best practices from Mayring [1, pp. 3-6], Miles and Huberman [2, pp. 245-287] as well as Bortz and Döring [3, pp. 149-154].
In our search we used the exact term ''Auditable AI'' in Google Scholar, leading to 40 hits that translated into 41 sources. 2ut of the initial 41 sources, 24 sources were assessed being relevant and 13 sources as not being relevant.Four sources were not accessible.
In addition to the 24 relevant sources of the initial literature research, we added seven relevant white papers in the field of AI auditing to the literature base as well.Furthermore, we included 10 additional sources based on expert recommendation.
We performed a detailed text analysis on the overall body of summarized literature, revealing concepts and ideas behind AAI.Those 41 sources, which can be seen in table 1, aim to provide a concept, methodology, framework or use case how to audit ML algorithms.The table also shows that the field of AAI is very new and rapidly changing, as the oldest publications are from 2018.
For the text analysis, we applied the step model process from Mayring [1, p. 4].The outcome of the process are artifacts3 that are grouped around inductively created categories.
The results structured as an AAI mind map consisted of 12 main categories having a total number of 800 artifacts.The next, main step was to identify relevant artifacts and reflect using own experience on how they could be synthesized into a resulting criteria catalog.We first created five categories containing 34 artifacts.Finally, we restructured the artifacts and put them into a logical connection, establishing an ML auditing core criteria catalog consisting of 30 questions.Those questions are grouped around the final categories Conceptual Basics, Data & Algorithm Design and Assessment Metrics.The final categories and their artifacts are presented in detail in the results section.A summary of the explained method can be seen in figure 1.

A. CONCEPTUAL BASICS
AI might provide many Opportunities, for example, to increase the economic productivity or to allow the development of new technologies/products.According to Clarke [16, pp. 429-430], ''AI's purpose is to extend human capabilities.''In that sense, improving products that require repetitive tasks with very high precision and replicability, where humans usually do poor, could gain leverage.Application areas like decision support (DSS) are on the rise, where ML algorithms provide briefings to a human necessary for a decision.
On the other hand, as with every new technology, there are Risks involved.Lack of incentives for industry to evaluate and steer threats, political manipulation and bias/no human control in decision (support) systems are often among mentioned risks [17, pp. 3-4].AI algorithms may even have purposely embedded bias or disinformation so that certain stakeholders achieve their goals [18].Often, the benefit resulting from a decision and the accountability/liability in case of harm, are not within the same natural or legal person [17, pp. 3-4].Last but not least, the famous phrase ''garbage in, garbage out'' also applies for AI [19, p. 130].
Thus, it is very important to establish a good Risk Management.Clarke [20, categorizes risk management strategies in ''proactive'', ''reactive'' and ''non-reactive''.Tamboli [21, pp. 91-92, 98-100] suggests the use of ''red teams'' (ethical hacker teams) who challenge AI products and produce problematic inputs of natural or malicious type.For any ''residual risks'', after all proactive measures were taken, the accountability among stakeholders should be clarified and options like utilizing ''AI insurances'' taken into account [21, pp. 105-106, 108-111].Von Twickel et al. [5, p. 5] propose to ''transfer the concepts and methods from the classical IT domain to the AI domain''.In the context of risk management, they point to the ''IEC4 61508 functional safety standard [life cycle]''.It is a technical framework for ''electrical, electronic and programmable electronic devices (E/E/PE)'', which ''sets requirements for the avoidance and [damage] control of systematic faults'' [22, pp. 7-9].
Before working on the ML algorithm itself, as with any well managed project, it is recommended to align on Methodology.Groza and Marian [23, pp. 4-5] suggest adapting the ''CRISP-DM'' concept.It was drafted in 1999 for the area of data mining and contains the following six process steps [24, p. 7]: 1) Business understanding 2) Data understanding 3) Data preparation 4) Modeling 5) Evaluation 6) Deployment In the business understanding phase, the focus is on identifying the correct variables/features representing the business model.In step two the correlation among features as well as the data distribution is examined.Then the focus is set on standardization and harmonization of the data.In step four modeling starts, always having Occam's razor principle, as well as GDPR's ''right to explanation'' in mind (details see later here and in section III-B).In the evaluation step, the ML algorithm is checked against acceptance criteria metrics as well as for occurrence of bias.During the deployment, questions of retraining and updating the model need to be discussed, to comply with the ''learning'' aspect of the ML algorithm.
Another concept is the ''five-step-cAI''5 life cycle from von Twickel, Samek and Fliehe [5, p. 20].It consists of five steps (phases): 1) Planning 2) Data 3) Training 4) Evaluation 5) Operation It is possible to repeatedly jump back-and forward between phases.In the planning phase, the developer sets the boundary conditions. 6This leads to the AI model characteristics like model family, dataset, algorithm and (hyper-)parameters.In the next three phases, the mostly experience driven data processing and design work takes place.The developer needs to closely supervise the training, test intermediate results and adjust (hyper-)parameters.Once the agreed development criteria is met (e.g., performance), the model is moved into operation.
The last methodological concept presented here comes from FDA [26], [27] dealing with ''AI/ML as Software as Medical Device (SaMD)''.They bring ''retraining'' and ''learning'' of an ML algorithm into focus.Questions like ''how developers can automatically deploy an updated version'' are discussed.The FDA's goal is to establish good machine learning practices (GMLP).They do premarket reviews, safety and performance monitoring and publish reports.There are two core elements given: ''SaMD pre-specifications (SPS)'' and ''algorithm change protocol (ACP).''The first addresses ''what'' an algorithm becomes as it learns (with new data or updates).The ACP deals with methods to control risks to SaMD's modifications (include GMLP, good data governance, retraining description or performance evaluation update procedure).
Before going into concrete project activities, it necessary to be aware of relevant Legal Acts/Policies and Standards.From the ones identified in literature, which can be seen in table 2, two selected legal acts/policies are briefly mentioned.
The EU AI Act aims at specifying an audit process including conformity assessments in areas like ''data and data governance, documentation and recording keeping, transparency and provision of information to users, human oversight, robustness, accuracy and security'' [38].The legislative draft mandates for ''unacceptable'' and ''high risk'' applications to describe the type of information AI producers have to provide, the form of the information and the addressees of the ML algorithm in order to allow a pre-market assessment [12, p. 4].In June 2023 the European Parliament has found consensus in a negotiating position that is being presented to the EU member countries in the EU council.
The EU General Data Protection Regulation (GDPR) has transparency as the overarching principle, which shall be achieved by transparent data processing, explaining what data is being processed and providing access to it.It also demands explanations of decisions and requires an informed consent before processing any data [12, p. 5].According to Chatila et al. [14, p. 22], the most relevant point regarding AI auditing is that it dictates the right to explanation in Recital 71 in combination with Article 22: ''obtain an explanation of the decision.''Analogously, an existence of automated decision making, as well as logic and consequences of data processing, needs to be made clear to the user [14, p. 22].
There are tools and methods of ML Documentation that can help fulfilling those legal requirements.First, there are ''data sheets'' that provide information about data collection and data properties in a tabular summarized overview [39, p. 283].They strongly remind of descriptive statistics patient cohort tables often used in life sciences.Then there are ''AI model cards'' that provide characteristics capturing the model's effectiveness [14, p. 21].''Model fact labels'' are similar, but include additional metrics [39, p. 283].Furthermore exist ''AI fact sheets'' that are supposed to standardize capturing and representation of facts about the whole AI model [14, p. 21].Lastly, Chatila et al. [14, p. 21] introduced the term ''Care Labels'' providing guidance how to use and ''treat'' the algorithm.
Of course, similar as with reviewing a company's bookkeeping on a yearly basis, it is appropriate do also implement an Audit Process for the ML project implementation/operation.How does such a process typically look like?SAI of Finland et al. [9, p. 6] suggest those steps: 1) Documentation review 2) Code review 3) Reproduction of model training, testing, performance measures and perturbation tests 4) Alternative models development The ''SMACTR auditing method'' from Google stands for [23, pp. 3-4]: Groza and Marian [23,, scoping looks at the motives and intended impact, as well as concepts for development.Mapping examines the data and how decision are made.Artifact collection refers to documentation (e.g., in the form of model cards and data sheets).In the testing step, the developer engages thoroughly with the artifact in order to test for contradictions, biases or breakdowns.Lastly, during reflection, a risk analysis and mitigation strategy is set up and questions discussed (e.g., how to deal with inconsistencies among subgroups).It is important, following Chatila et al. [14, p. 30], to ''[include] external oversight by independent, competent and properly resourced regulatory authorities with appropriate powers of investigation and enforcement.''One aspect that has gotten more and more attention recently is the consideration of Ethical AI Design.A synonym is ''ethics by design'', highlighting that ethical concepts are not designed around an ML model, rather that the ML model is designed around ethical concepts [40, p. 2].Making sure that human rights are not violated is the essence of those concepts.
Transparency, especially for ML algorithms applied in medicine, health care or life science is of great importance when advocating the use of ML products.Kiseleva et al. [12, p. 8] mention three central functions: accountability, ensuring safety/quality and allowing to make informed decisions.The first is geared towards the vendor of the ML algorithm, making sure the physician acquires all necessary information about a ''black box model'' (see section III-B).The second implies continuous testing, debugging and auditing due to the high stakes in medicine.Only if the vendor provides enough information, physician and patient are able to make a joint decision about the use of the ML algorithm in the current medical case, considering the patient's state, risk and potential outcome.
Lo [41] argues in his theory about the Paradoxical Transparency of XAI that with the increasing usage of deep neural networks, the asymmetry of knowledge between developer and user diminishes [41, pp. 6-9].The ''code around the black box'' would be relatively easy to write and to audit [41, pp. 9-12].The focus should be set on increasing the transparency in the pre-modelling stage by auditing the data collection, cleansing and governance of ML algorithm producers, as well as in the post-modelling phase by documented deployment and monitoring [41, pp. 9-12].Lo [41, pp. 9-12] mentions that big companies would be ''OK'' with external audits of that kind (Facebook, Apple, Amazon, Netflix, Google).
For AAI as with any classical IT software program, the concept of Verification and Validation is important.The first asks if ''specifications of systems [are satisfied]'' and ''internal structural correctness of systems'' is given, and the latter ''compares the system to the needs of stakeholders'' [42, p. 18].Verification and validation have a long history in classical software engineering.However, according to von Twickel et al. [5, p. 8], when designing AI algorithms, only for a controlled hypothesis space and adequately understood structures, a formal verification is possible.
Trusted AI, as being described by Knowles and Richards [45, p. 2], contains the idea of establishing ''AI-as-aninstitution.''They explain that by using elements of signification (symbols of trust), legitimation (norms and values) and domination (allocative and authoritative resources) a society of systemic trust, which properly uses AI technology, can be established.An example of an ''element of signification'' is a public repository containing good cases of AI applications that successfully underwent an AI Certification [45, p. 7].Such a certification process would make use of the three pillars of an AI Assurance Case, which consists of ''objectives & constraints'', ''argument'' and ''evidence'' [28, pp'. 33-34].
Trust is also captured by the ''Assessment List of Trustworthy AI (ALTAI)'', which was developed by the High Level Expert Group on AI (HLEG) of the EU as part of their ''ethics guidelines'' [46, p. 3].There, trustworthy AI is described by seven requirements, for example, ''1.human agency and oversight'', ''4.transparency'' or ''7.accountability.''For each requirement, the European Commission and Directorate-General for Communications Networks, Content and Technology [46, pp. 7-22] provides a series of questions for self-assessment of developers, users or other third-party persons involved with the AI product (implementation).
When a credit application was rejected by an AI agent, often questions about Fairness and Bias arise.The first can be characterized by ''distributive'', ''procedural'' and ''interactional justice'', as well as ''personal principles of fairness'' [47, pp. 39-40].For the second, light is being shed on differential prediction (e.g., improper treatment of minority groups) or intentional discrimination (e.g., no proper representation of minority groups) [47, pp. 40-42].

B. DATA & ALGORITHM DESIGN
Switching to technical aspects, it is vital to first look on the ''driver of growth and change [of this century]'', Data itself [48].The Assumptions made when capturing data that is used to train, test or validate an ML model are very important.The encoded occurences of values between feature and target variables, their correlation and other statistical properties are created by an underlying data generation process (DGP) that should be well understood.Individual sample records should always come from that same DGP, are independent of each other and have equally randomly distributed errors. 7If the DGP is not well understood, there is a chance that ''unknown unknowns'' in the form of mediators or confounders are left out in the ML model design phase.Following Breiman [49, p. 204], the idea is to include all those ''unmeasured variables'' in the ''model box'' (the ML algorithm) in order to ''emulate[] nature's box'' (the DGP under scrutiny).This exercise is not a one way path, meaning that after having acquired the model's accuracy (see section III-C), and thus a measure of how well (or not) the DGP is captured by the model, it is common to work again on the data assumptions.Otherwise, when not doing so, the model's performance will be poor or biased.
Speaking about bias, it is also very critical that data taken for training, test or validation does represent relevant characteristics of the population in scope for the model.Clarke [16, p. 428] speaks here from the ''inferencing process's applicability to the particular problem.''A prerequisite for any successful ML product design is well structured data in a high quality.Following Loshin [50, pp.89-93], relevant Data Quality Dimensions in this paper's context are: ''accuracy'', ''consistency'', ''completeness'' and ''currency''.Accuracy describes if the information adequately represents the underlying real world object and consistency indicates if the piece of information is not contradicted by another source.Completeness deals with missing and implausible values and currency asks if the data collection time represents the latest possible state.
The ''art of data preparation'' that makes use of the four data quality dimensions, is very important before starting with the model design itself.It is dangerous to have uncleansed and not consolidated data, especially with missing values and coming from multiple sources [16, p. 428].Each source has to be well understood, so that it is clear what data type is contained in a field, what the scale (nominal, ordinal, interval or ratio) and the reference range is.
After the origin and assumptions of the data have been clarified, it can be used to train an ML algorithm.Before doing so, it is imperative to take a look on the most important ML Algorithm Properties.The terms ''causality'' and ''correlation'' need to be clearly distinguished.The first describes the relationship of a cause to an effect, which ideally can be naturally derived.Imagine a person standing on rollerblades pushing against a wall.The person exerts a force to the direction of the wall (''actio'') and as a result experiences an oppositely directed force of equal strength (''reactio'').This is a typical example of a causal relationship that can be easily determined by a third-party observer.Now, when dealing with a complex machine learning model that predicts the creditworthiness of a customer, the concept of causality is difficult to proof.There is no formula following the laws of physics stating that people of a certain age, income and education will not be able to pay a credit back.Instead, within the domain of stochastic statistics, the concept of ''correlation'' can be used.It describes how strongly one variable is related to another variable (influenced by or dependent on).It can be among features (X ) or between features and outcome (Y ).A typical measure is the ''Pearson's Correlation Coefficient''.
A value of 1 indicates a perfect relation between X and Y , 0 indicates no relation and −1 a perfect inverse relation.The normalization allows a comparison of the relationship among different variables (features and target) with different scales.It is important to note that a causal relationship always has perfect correlation, but having a correlation coefficient of 1 does not automatically proof a causal relationship.
Looking at ML design options, there are two different Types of ML Algorithms: ''black box'' models and ''white box'' models.The first describes an entity that conceals its inner workings to the user or even the developer.Data goes in and a result comes out, without giving the possibility to the user to comprehend the mechanics or rules of decision making.The most prominent example of this ML model type is a neural network, which is often used for natural language processing (NLP) or image recognition.
Traditionally, statisticians have been working with parametric models to directly design a formal relationship between input and output [49, p. 202].Those ''white box models'' allow the user, within reasonable effort, to get an understanding of the inner workings.The user would be able to break down the decision points that determine the result.Typical examples of this type are generalized linear models (GLM) like linear or logistic regression, as well as decision trees.
In his well cited ''The Two Cultures'', Breiman [49, pp. 209-214] argues that black box models are superior to white box models in terms of predictive power, as well as in terms of understanding the DGP (the internal mechanics).
He uses examples to show that a random forest can achieve a better prediction (lower misclassification rate) than logistic regression while correctly identifying the variables majorly influencing the decision (feature importance).His main point is that statisticians should work together with computer scientists to solve a problem using both black box and white box models rather than just assuming being able to directly derive a DGP by establishing a parametric model.
However, according to ''Occam's razor principle'', given a black box and a white box model have the same accuracy (performance) and both are able to sufficiently explain a hypothesis, the less complex white box model should be preferred [11, pp. 1-2].
At the stage of ML algorithm development, before going into feature selection/engineering and model type selection, the Specifications need to be clear among the stakeholders.This includes a correctly formulated hypothesis (narrowly described problem situation with expected algorithm behavior).Once the algorithm starts generating results, statistical testing should be done.Only then the question can be answered, if the right solution was built and if it was properly designed.Additionally, when writing the specifications, acceptance criteria and metrics of success have to be defined.In terms of product safety, a ''kill switch'' should be integrated, allowing a human operator to shut down the ML algorithm in case of misbehavior (e.g., autopilot of a plane).The implementation of the ML algorithm is usually a composition of classical IT (cIT) components and AI components.

C. ASSESSMENT METRICS
The next aspect that plays an important for an AI auditing criteria catalog are assessment metrics.In principle, the nature of those is either Qualitative or Quantitative.The first utilizes descriptions or narratives, which are often given from a rather subjective perspective.That is in contrast to the latter, where an objective measurement is possible, often with subsequent mathematical calculations.
For the qualitative area, adequate model assumptions are very important.Those assumptions are constantly being made during the modeling process and might state how to deal with missing data, the engineering of model features, the model type itself and the chosen hyperparameters.Those are all made choices that should be justified.Methods like sensitivity analysis or model validation can be used to ''assess the appropriateness of a particular model specification and to appreciate the strength of the conclusions being drawn from such a model'' [51, p. 263].In literature, the dimension ''auditability'' of an AI product is closely related with ''audit trials'', which is a ''traceable log'' that ''cover[s] all steps of the AI development process'' [7, p. 24].Only if such a log exists, an appropriate level of human oversight could be executed, which is also required in the proposed EU AI Act in article 14 [38].Important aspects of transparency and trust that were already described in section III-A, can be used to perform a textual assessment of a given AI model and use case.Lastly for fairness, Becker et al. [28, p. 36] suggest to use Acceptance Test-Driven Development (ATDD) Coming to Performance Indicators, the ''Area Under the Curve (AUC)'' or ''F1-Score'' using ''Receiver Operating Characteristics (ROC)'' is being commonly utilized.Using the example of assessing the credit worthiness of a bank customer, the first step in measuring the performance of the used ML product is to create a ''Confusion Matrix'', as can be seen in table 3.
The ML product is used on N credit applications and for every run the manual decision of the clerk is compared to the automated prediction of the algorithm.In case the algorithm accepts an application that the clerk also accepted, it is a ''true positive (TP).''In case only the algorithm rejects the application it is a ''false negative (FN).''If both the clerk and the algorithm reject the application, it is a ''true negative (TN)'' and in case only the algorithm accepts the application it is a ''false positive (FP).''Afterwards confusion metrics like sensitivity or fallout can be calculated.
A ''ROC graph'' is then generated by plotting the sensitivity on the Y-axis and the fallout on the X-axis.A perfect classifier would be on the top left corner (0,1), having no false negatives (FN) or false positives (FP) and a random classifier would be on a diagonal from origin (0,0) to top right (1,1) [53, pp. 862-863].The AUC, measuring the area (integral) underneath the plotted line, would be maximal in the first case and minimal in the second case.
A widely used metric for accuracy is the ''F1-Score'', where a value of < 0.5 is considered a bad classifier.Alternatively the ''Matthews Correlation Coefficient (MCC)'' can be used, where values > 0.7 indicate a strong correlation and thus a good classifier.
When looking at metrics to evaluate potential discrimination, according to Bhaumik et al. [52, pp. 5-6], ''Statistical Parity (SP)'' or ''Disparate Impact (DI)'' can be used.As an illustration, the DI is used in the U.S. labor law to investigate potential discrimination of women towards men in employment settings [47, p. 40].There, an industry convention has formed saying it should be ≥ 0.8, after including the disturbance term, otherwise one might suspect discrimination [55].

D. TOOLS & METHODS
The assessed literature provides an abundance of tools or methods related to AI auditing or XAI.Those tools could be used complementary when working with the ML auditing core criteria catalog presented in section III-E.Audit related Tools or Methods can be further subgrouped into being more conceptual/theoretic and more hands-on/practical as can be seen in table 4. In the next paragraphs, a few exemplary tools and methods are chosen and further explained.
Looking at the hands-on tools, there are audit checklists like the one from Yakobovitch [59] or the one from SAI of Finland [9], which look more into the auditability of the solution itself.The first takes care of aspects like ''AI business outcome'', ''data source'' or ''privacy of data''.
The second comprises an audit catalog consisting of the six ''CRISP-DM'' steps, which were already presented in section III-A, but adding the seventh step ''operation of the model and performance in production'' [9, p. 6].
Additionally, there exist tool kits of different scope.''Credo AI Lens'' looks at responsible AI and is supposed to help companies manage risks to achieve fair, compliant and auditable ML products [62].It provides a technical evaluation of the data set and ML model, as well as guides developers with critical questions during the development process.
XAI Tools or Methods are usually grouped by the ML algorithm development stage when they are applied.An overview is provided in table 5.
There are also ''meta frameworks'' that contain collections of XAI tools for all development stages.One example of this is the ''Censius Platform'' or the ''Tag Tool & Deep Learning Sandbox''.The first contains 23 tools for MLOps, 8as well as AI observability, monitoring and explainability tools and contains an enterprise AI audit checklist [66].The latter provides a web interface for testing purposes, mainly focusing on capabilities and performance metrics [10, p. 97].
Next to those company products, also not-for-profit projects like ''OpenML'' have been established, where ML algorithms and datasets can be shared between researchers, developers and users in order to ''to work more effectively and collaborate on a global scale'' [67, p. 4].
It becomes clear that before modeling starts, the ''method'' is rather to have a good strategy how to achieve a representative, high quality training data set.The ''tools'' are mainly descriptive statistics that look at correlation and data distributions.During modeling, the focus is on the already described ''Occam's razor'' principle (see section III-B), meaning to rather prefer a ''white box'' model over a ''black box'' model, given the relevant metrics are identical.Most tools look at the post-modeling stage.Well cited are ''LIME'' and ''SHAP''.The first, model agnostic method, fits linear regressions to ''simulate'' local behavior for concrete data input, as well as providing global explanation [11].The second uses Shapley additive explanations to figure out the average marginal contribution of features [11].

E. ML AUDITING CRITERIA CATALOG
After having identified and described artifacts being relevant for auditing machine learning algorithms, the next step is to subsume and connect the artifacts into an ML auditing core criteria catalog.As a methodology, the authors follow the from Gawande [68, p. 26] who lists requirements and procedures of creating a good checklist.The objective of the catalog is given in figure 2.
Having this in mind, the resulting ML auditing core criteria catalog consists of 30 questions that are grouped around the categories: ''Conceptual Basics'', ''Data & Algorithm Design'' and ''Assessment Metrics''.The questions are thought to guide the ML development team.

1) CONCEPTUAL BASICS a: AI OPPORTUNITIES VS. AI RISKS
⃝ Is the expected benefit • benefitProbability of a successful ML use case implementation greater than the damage • damageProbability in case of failure?
⃝ Do you expect a productivity gain, improved quality or a new functionality compared to the current manual/non-ML process?
b: RISK MANAGEMENT ⃝ Are the roles and responsibilities (RACI 9 ) and liabilities before, during and after the implementation clearly defined?
⃝ Do you have a proactive, reactive and/or non-reactive risk management strategy in place?For example, have you planned to implement a ''kill switch'' with measures to (temporarily) go back to the old process?
c: METHODOLOGY ⃝ Have you aligned and agreed on the methodology with all project stakeholders (e.g., for implementation CRISP-DM and internal audit SMACTR)?
⃝ Are the implications in case the ML use case falls in the ''high risk'' category of the EU AI Act understood?
⃝ Do you plan to make use of Data Sheets to describe the data collection process as well as the data properties?
⃝ Do you plan to create AI Model Cards/AI Fact Sheets to describe the model characteristics?
⃝ Do you plan to prepare AI Care Labels to instruct internal stakeholders how to use and ''treat'' the algorithm?
d: AUDIT PROCESS ⃝ Have you established an internal advisory committee consisting of senior IT governance specialists and business/medical specialists who critically accompany 9 RACI says that when working in teams it needs to be clear who is Responsible for a given task, who is Accountable especially if something goes wrong, who needs to be Consulted for advice and who must be Informed about the progress.
the implementation (e.g., watch for sufficient documentation and methodology adherence)?
⃝ Do you ensure the ML implementation is not violating ethical concepts (''ethics by design'' is considered)?
⃝ Do you have protocols in place that allow independent, external auditors to critically review the ML use case implementation?
e: QUALITY ASSURANCE (QA) ⃝ Did you perform a verification of the ML output behavior using a set of expected, representative inputs of the productive usage?
⃝ Did you perform a validation whether the project's specification and stakeholders' needs are met?

IV. DISCUSSION
The goal of our paper was twofold, first we wanted to explore and highlight the most important auditable AI related artifacts existing in literature.Second, we wanted to provide valuable input for auditing ML algorithms in form of a core criteria catalog.
Our main outcome is a 30-question, consensus-based ML auditing core criteria catalog that is intended as a starting point for the development of evaluation strategies by specific stakeholders.
The catalog is organized in the categories Conceptual Basics, Data & Algorithm Design and Assessment Metrics.Section III (results) presents referred artifacts in the same logical sequence as they appear in the catalog, to simplify its usage.
We grouped artifacts dealing with AI opportunities/risks, risk management, methodology, audit process and quality assurance (QA) under the category Conceptual Basics (section III-A).This term was chosen, because our recommendations mostly deal with business, process and people related topics.Those need to be taken care of before going into the Data & Algorithm Design (section III-B).There, the main artifacts describe data properties and algorithm design.In Assessment Metrics (section III-C), we distinguished qualitative assessments from quantitative assessments and provided detailed instructions how latter can be calculated.The contents of the catalog are the result of the ''auditable AI'' qualitative content analysis (see section II).We utilized many iterations to sort out irrelevant artifacts and synthesize relevant ones into new ideas and categories.We believe that this catalog will be beneficial to healthcare organizations that have been or will start implementing ML algorithms.Not only to help them being prepared for any upcoming legally required audit activities, but also to create a better, well-perceived and accepted product for patients/ customers.
With section III-D (tools & methods), even though not contained in the catalog, our intention is to provide a summary of programs and little helpers that could be used aside of the ML auditing criteria catalog.We discovered those during the literature study.
We needed to decide on a trade-off between the breadth and the depth for our ML auditing criteria catalog.Also, for practical reasons, it was demanded to limit the scope and time period of the qualitative content analysis in the first place.Therefore, it is possible that there are AAI aspects missing and/or have not been dealt with in the necessary detail.
Those limitations can be overcome by utilizing the proposed catalog in practice on real use cases or hypothetical use cases with artificially generated data.Doing so, potential gaps could be exposed and the catalog further improved.Thus, this paper is seen as a starting point towards the development of a framework, where essential technical components can be specified.

V. CONCLUSION
Our motivation for this paper was the existing lack of commonly agreed procedures and content for auditing ML algorithms, which we point out in different sources.We opened with a Gartner study showing a lack of AI adoption in the healthcare sector.In order to help speeding up this adoption, we provided a core criteria catalog as a starting point for further developments.
As an additional valuable input for the new and rapidly evolving field of ''Auditable AI (AAI)'', we included a definition of the term and contrasted it to ''eXplainable AI'' and ''Interpretability''.
The 30 questions of the core criteria catalog are contained within the categories Conceptual Basics, Data & Algorithm Design and Assessment Metrics.The level of detail is chosen purposely rather compact, to focus on the most relevant artifacts, being valid for most stakeholders who want to implement an ML algorithm.

APPENDIX A DIFFERENT REPRESENTATIONS OF AAI MINDMAP AND CRITERIA CATALOG
The most relevant artifacts are explored in an AAI mind map that is shown in figure 3 (top level view).It consists of 12 main categories having a total number of 800 artifacts.The category Tools/Methods contains most artifacts (203), followed by Definitions (159) and Dimensions (82) as well as Philosophy/Ethics (82).
Figure 4 is a representation of the AAI mind map as a word cloud.
In figure 5 the ML auditing core criteria catalog is displayed as well as a world cloud with the same parameters as the AAI mind map.
The complete mind map file, as well as auxiliary files of the core criteria auditing catalog can be accessed via OSF at: https://osf.io/tdr3p.
The µ Y and µ X are the variable means and E calculates the expected value (weighted average of (x i − x)(y i − ȳ)).σ X and σ Y are the respective standard deviations.The value and direction (−∞ ≤ cov ≤ +∞) of the relationship is already determined by the covariance and then normalized by the standard deviation, leading to values The Variance Inflation Factor (VIF) measures the degree of collinearity (linear correlation between features).It is given in equation ( 2), y i is the i th observed outcome value of a data set, ŷi the i th predicted outcome value and ȳ the mean of all observed outcome values.
The Shapiro-Wilk Test (SWT) examines normality (normally distributed values of each feature).As shown in equation ( 3), for the SWT of a sample with size n of a given variable x, x (i) is the i th order statistic (smallest to largest value) and x the sample mean.(a 1 , . . ., a n ) = m T V −1 C , whereas m = (m 1 , . . ., m n ) T is created from the expected values of the order statistics, V is the n × n covariance matrix of the order statistics and C = m T V −1 V −1 m 1 2 .

Algorithm 1 Breusch-Pagan Test (BPT) Assumption:
Null hypothesis H 0 : homoscedasticity is present (equally distributed variance of residuals) Alternative hypothesis H 1 : heteroscedasticity is present (unequally distributed variance of residuals) Step 1: Fit a regression model so that ŷi = f X i , β Step 2: Calculate the squared residuals ê2 i = y i − ŷi Step 5: Perform the Chi-Square test X 2 = nR ′2 Step 6: Derive the p-value using X 2 and DF according to X Result: if p-value ≤ 0.05 then H 0 can be rejected and H 1 applies: heteroscedasticity is present else H 0 cannot be rejected: homoscedasticity is present end if The Breusch-Pagan Test (BPT) checks for heteroskedasticity (increase in variance).The sequence of the BPT with explanation is given in algorithm 1.
Typical confusion metrics are Sensitivity = The ''F1-Score'' is given in equation ( 4).It is the harmonic mean of sensitivity and precision, being as well ∈ R in [0, 1], and not of symmetric nature.That means if the rejected and accepted group members are interchanged (swapping of class labels), having everything else constant, the F1-Score will be different.
The ''Total Sobol's Variance Ratio (TSVR)'' explains the total effect (incl.interactions) of an input variable X i on the variance of the output Y and is given in equation (6).
X ∼i denotes the set of all input variables except X i .The larger S Ti , the larger the contribution of X i on Y 's variance.The ''Cosine Similarity Vector Pairs (CSVP)'' is used in the context of binary classification models in order to ''find . . . the number of vector pairs that are very similar in values but have been predicted to be in different classes.'' The formula is given in equation (7).The Statistical Parity (SP) is shown in equation ( 8) and the Disparate Impact (DI) is shown in equation (9).They describe if an ML model tends to predict the positive result more often for members of a privileged group than for members of a minority group.The ML model would do so, even though the attributes or characteristics of the privileged group are not supposed to act as features in the model.This means that members of the privileged group have equal Sensitivity and Fallout as members of the nonprivileged group.Ideally, the observed differences should only result in the disturbance term (being close to 0).

FIGURE 1 .
FIGURE 1. Main process steps of the method.

FIGURE 3 .
FIGURE 3. Iteratively created categories of the ''Auditable AI'' qualitative content analysis (top level view).The first # behind each node gives the number of children, and the second # the sum of descendants.The total number of artifacts is 800.

FIGURE 4 .
FIGURE 4. Word cloud of the ''Auditable AI'' mind map file.Each text attribute of the 800 artifacts (nodes) is split into words and then displayed according to the calculated rank.Words with less than two characters are ignored.

FIGURE 5 .
FIGURE 5. Word cloud of the 30-question ML auditing core criteria catalog.Each sentence is split into words and then displayed according to the calculated rank.Words with less than two characters are ignored.

→A
might contain the input variables (features) of a customer whose credit request got approved and → B might contain the input variables from a customer who got rejected.An S C → A , → B value close to 1 indicates that the customers are very similar and a value close to −1 indicates the customers are very different.

TABLE 1 .
Overview of analyzed literature.

TABLE 2 .
AI auditing related legal acts/policies or standards.

TABLE 3 .
Confusion matrix for binary credit worthiness classifier.

TABLE 4 .
Overview of AI auditing related tools or methods.

TABLE 5 .
Overview of explainable AI (XAI) related tools or methods.Objective of the ML auditing core criteria catalog.
Do you think the ML model would pass an external AI Certification/AI Assurance case fulfilling the six components of trust: predictability, dependability, faith, consistency, utility and understanding?Are the model assumptions (e.g., how to deal with missing data, model type, hyperparameters) transparently described?