From Hume to Wuhan: An Epistemological Journey on the Problem of Induction in COVID-19 Machine Learning Models and its Impact Upon Medical Research

Advances in computer science have transformed the way artificial intelligence is employed in academia, with Machine Learning (ML) methods easily available to researchers from diverse areas thanks to intuitive frameworks that yield extraordinary results. Notwithstanding, current trends in the mainstream ML community tend to emphasise wins over knowledge, putting the scientific method aside, and focusing on maximising metrics of interest. Methodological flaws lead to poor justification of method choice, which in turn leads to disregard the limitations of the methods employed, ultimately putting at risk the translation of solutions into real-world clinical settings. This work exemplifies the impact of the problem of induction in medical research, studying the methodological issues of recent solutions for computer-aided diagnosis of COVID-19 from chest X-Ray images.


I. INTRODUCTION
To respond to the overwhelming needs arising from the COVID-19 pandemic, a lot of efforts have been put into building computer-aided diagnosis solutions using machine learning methods, hoping to speed up the early detection of this novel coronavirus. This work aims to raise awareness of the risks of building models for computer-aided diagnosis without the appropriate methodologies to justify the methods employed. In particular, the countless solutions aimed at computer-aided diagnosis of COVID-19 from chest radiographs (CXR) images that are not suited for clinical use (see Table 1). The recent literature about COVID-19 solutions already brought to light transversal issues that extend beyond the problem tackled in such works, questioning the methods employed [1] and highlighting the poor quality of the datasets [2]. Conversely, this work focuses on the methodological flaws derived from the lack of domain knowledge that affect how the problems are formulated in the first place and how methods are justified.
The associate editor coordinating the review of this manuscript and approving it for publication was Derek Abbott .
It is directed at a general audience, regardless of their field of expertise, and tackles the topic horizontally, from different angles such as Machine Learning (ML), epistemology, probability and pathology. Consequently, this work does not intend to be technically exhaustive in regards to the aforementioned fields of science. More generally, the issues described in this paper concern the scientific method, with knowledge as the ultimate goal of science. In this sense, current mainstream ML community shows a trend to emphasise wins over knowledge, specially within the challenge culture [3]. Such competitions have an increasing scientific impact in areas such as biomedical image analysis, but lack enough quality control for the translation of solutions into clinical practice [4], raising ethical and legal concerns regarding the diffusion of responsibility and liability [5]- [7].
The lessons and critic of this work can be extended to similar problems and challenges of translational medicine beyond the scope of COVID-19. The most direct consequence of the issues addressed in this paper relates to the poor transferability of computer-aided diagnosis solutions into hospitals and less of rigorous review and approval (e.g. by regulators), in which research institutions have invested their efforts, budget and resources with questionable results [8]- [10].
The rest of the paper is structured as follows: The remainder of this section provides some context on the problem of induction, an example of its limitations, and how they affect our case at issue. Section II describes how conventional ML solutions suffer from previously described problems. Finally, Section III discusses the impact on medical research.

A. CONTEXT
In 1739, still under the shadow of the bubonic plague in Europe, David Hume publishes A Treatise of Human Nature, presumably without knowing that his work would not only continue to be debated more than 200 years later, but also still remarkably relevant in the technological advances of our time. In the problem of induction Hume argues that we cannot make a causal inference just by a priori means, and poses the question of how we can conclude from the observed to the unobserved. For instance, we can certainly assert that every morning the sun rises approximately in the east thanks to our everyday observations, but that is not enough to explain why it happens in such a way; for that we would need a bit of domain knowledge [11,Ch. VI].
In ancient Rome, andabatas used to fight blindfolded in the arena, unaware of the outer beasts. Regardless of their training they were doomed to fail. Similarly, we are often told ML models are black boxes whose inside cannot be inspected, but equally dangerous is the fact that models cannot see what surrounds them. Models best hope is to expect their target data resembles that from their training period. In this sense, ML models can be regarded as inductive machines performing inductive inferences based on previous observations. The key performance indicator (KPI) of a ML model is the generalisation performance, measured by how well it will generalise and perform on novel data [12,Ch. 9.4,p. 100]. For the ML model to perform well on novel data, is often assumed that novel data will resemble past data. Hume refers to this assumption as the Principle of Uniformity of Nature.
If reason determined us, it would proceed upon that principle, that instances, of which we have had no experience, must resemble those, of which we have had experience, and that the course of nature continues always uniformly the same.
(A Treatise Upon Human Nature [13] T. 1.3.6.4) But the course of nature does not necessarily continue always uniformly the same; especially when nature does not refer to the whole universe, but the particular realm where a ML model is employed. Before proceeding with an example of this issue, (see § I-B), we have to tackle two other issues.
Suppose we aim to predict whether the next president of the United States of America will be a woman or not. If we rely solely on the gender of previous presidents, by induction we will predict a zero chance. But by understanding how a person becomes a presidential candidate, and how previously became a candidate for their party, we can take into account the network of people involved in the process and recalculate our forecast with higher precision. In this case the rules are clearly defined in the law. Pouring these bits of domain knowledge into our model will show that chances are increasing over time. Encoding the rules behind the data heavily increased the robustness and precision of our model. Thanks to these rules our inference became deductive rather than inductive, since the conclusion necessarily follows from the premises; and as long as the premises are true the conclusion will also be true.
We can identify two issues in the first approach of our example: First, partial data can misrepresent the underlying phenomena that shapes the data, producing a model that does not resemble the real world. This is especially notable in the case of bias and confounders which are further aggravated by the lack of domain knowledge in designing the solutions. The second issue relates to induction. Contrary to deduction, where the truth of the premises guarantees the truth of the conclusion, inductive inferences are ampliativesince whose conclusions go beyond what is contained in their premises -and their conclusions could be totally wrong even if infinitely many examples confirm them [14]. This ampliative factor has also an amplifying effect over the partial data from which we infer a conclusion. In this case, considering only the final results of the elections amplified the bias derived from a partial collection of the data, reducing the chances of women being predicted as president to zero. We will later discuss other amplifying effects derived from induction such as the abuse of the outcome space (see § II-A).
These two issues put at risk the transferability of the solutions to real-world clinical settings. There will not be translational solutions without embedding medical knowledge into their development process. The challenges to transfer a solution to the clinical environment condition how data must be collected and curated. These steps are often skipped from the process, with researchers rushing to develop solutions with whatever data is available, regardless of their quality, coverage or suitability. Translational medicine requires robust and adaptable solutions developed and designed upon methodologies that allow for the aforementioned qualities.

B. THE CASE AT ISSUE
Now consider a ML model trained to predict different respiratory diseases (e.g. [15]- [17]) such as tuberculosis (TB), asthma, pneumonia, etc. Then, in December 2019, Dr. Zhang Jixian started to treat a pneumonia case of unknown cause [18]. What kind of inference and data led scientists to think it was caused by a new virus? (see § II-B). A chest computer tomography (CT) showed unusual changes in the lungs which were different from any known viral pneumonia. Later on, genetic sequencing related the new virus to coronaviruses that circulate in bats, including SARS [19]. In February 2020 the virus causing COVID-19 (Coronavirus disease 2019) was named SARS-CoV-2 (Severe acute respiratory syndrome coronavirus 2).
The aforementioned ML models, created before the pandemic, were trained for a particular subset of the vast hypothesis space for the prediction of respiratory human diseases. Naturally, they were not trained for this novel disease, so the now outdated biases and weights of the model neurons cannot guess this new disease. Just as doctors who were not aware of this novel disease could not provide an accurate diagnosis prior to its discovery, misidentifying the patient symptoms with those from common cold [20] or measles [21]. In fact, the pathological 1 findings shown in CXR and CT scans are actually non specific, and overlap with other viral infections (e.g. H1N1, SARS, MERS) [23]. Further on, we will see how this problem affects conventional classifiers.

II. CONVENTIONAL MACHINE LEARNING SOLUTIONS
Right after the first outbreaks, information and data about COVID-19 flooded the internet, including several datasets and repositories with CXR and CT images of COVID-19 patients [24]. Researchers rushed to develop models using data from such repositories [25]- [27], claiming that such models could be a ''very helpful tool for clinical practitioners and radiologists to aid them in diagnosis'' [28]. Exhaustive systematic reviews of models and datasets for COVID-19 detection from CXR and CT scans can be found in [1], [2], respectively, addressing the poor suitability of such models for clinical use. Most of these solutions are based on conventional classifiers such as binary and multi-class classification methods. Table 1 presents some examples from the literature. Below we discuss the limitations of this type of models.

A. BINARY AND MULTI-CLASS CLASSIFICATION
The problem of classifying a given input into one class out of several classes needs some compromises but also requires some conditions. Typically, these classifiers return a probability for each class, and the most probable class is chosen as prediction. Of course, the sum of these probabilities must be one (see Eq. 1), but reducing the outcome space from the vast sample space to a limited event space of a couple of classes artificially increases the probability of the chosen classes. No matter how many classes are considered, the probability sum of such classes will be one in the training and test sets, but nothing ensures that such events follow the same probability in the real world where more classes exist.
This abuse of the outcome space entails dangerous consequences if a classifier is deployed in a clinical environment different from the train and test set. Even if similar, a real clinical environment is prone to change, a new disease may appear and the model is forced to choose between the set of classes it was trained to classify. Simply put, a model of this kind cannot say I don't know, and therefore reducing its outcome space necessarily increases the probability of the rest of the classes to be mispredicted by the model.
Kolmogorov's second axiom. The probability of the entire outcome space is 1.
Several solutions for diagnosis of COVID-19 [33], [40] limit their outcome space to a couple of diseases, disregarding other many possible lung diseases such as tuberculosis or asthma. Furthermore, these solutions assume that the different lung diseases of the model outcome space are mutually exclusive events (see Figure 1), while in fact, many lung diseases can co-exist (e.g. COVID-19 and TB [41]- [43]) and often share common abnormalities (e.g. consolidations, opacities). Diseases are not found in nature as entities per se; they refer to a definable deviation from a normal phenotype evident via symptoms and/or signs. The different sets of symptoms, pathologies and signs are grouped into diseases, and likewise diseases are grouped into categories, all of them organised into disease taxonomies (e.g. International Classification of Diseases). Thus, one disease can have more than one etiology, and one etiology can lead to more than one disease [44]. Lung diseases in particular produce a spectrum of lung pathologies that evolves over time and whose diagnosis requires a combination of tests (e.g. radiology, pulmonary function, blood, sputum, etc). Importantly, some diseases of the respiratory system (e.g. pulmonary vasculature) can be associated with a normal CXR [45]- [47] requiring CT scans and further tests to clarify the diagnosis and prognosis of such cases. Moreover, ML methods do not necessarily capture model predictive uncertainty, adding another source of risk for their predictions in real world settings [48]. Bayesian methods can help quantifying uncertainty caused by the model structure or the use of limited samples (i.e. epistemic uncertainty) [49].
In this light, binary and multiclass solutions for the prediction of lung diseases based on CXR images cannot be translated to the clinical practice, regardless of their accuracy results, since they are based on unrealistic assumptions about the nature of what they predict, and employ predictors not suited for the problem at hand.

B. DIAGNOSIS AND MONOTONICITY
The issue previously described is also related to the concept of monotonicity. In its epistemic sense, monotonicity expresses the fact that adding more premises to an argument allows you to derive all the same conclusions as you could with fewer [50]. Specifically, under monotonic reasoning, if a conclusion p follows from a set of premises A, (denoted as A p), adding another set of premises B doesn't alter the conclusion (i.e. A ∧ B p also holds) [51], [52]. As stated by Pearl: ''The problem of monotonic logic lies not in the hardness of its truth values, but rather in its inability to process context-dependent information'' [53, Ch. 1.5, p. 24]. Therefore, reasoning is non-monotonic when a conclusion supported by a set of premises can be retracted in the light of new information. Or in other words, we can infer certain conclusions from a subset of a set S of premises which cannot be inferred from S as a whole. Medical diagnosis fits very well under such definition. In the case at issue of this work, the presence of more abnormalities could imply a different etiology, and thus a different disease.
Section I-B left open the question about what kind of inference led scientists to conclude that the abnormal cases of pneumonia treated by Dr. Zhang were caused by a novel coronavirus. Defeasible reasoning deals with tentative relationships between premises and conclusions, which can be defeated by additional information, allowing for the retraction of inferences. For instance, while we may infer that Tweety flies based on the information that Tweety is a bird and the domain knowledge that birds generally fly, we can retract this inference when we learn that Tweety is a penguin. Tweety is indeed a bird but it cannot fly.
Defeasible reasoning is also not exempt from limitations, requiring from causal information to properly derive conclusions under certain scenarios [54], [55]. Consider, for example, this problem of Pearl: if the sprinkler is on, then normally the sidewalk is wet, and, if the sidewalk is wet, then normally it is raining. However, we should not infer that it is raining from the fact that the sprinkler is on [53, Ch. 1.5, p. 24].
Conflicts may arise between hard facts and defeasible conclusions. For instance, both arguments in Figure 2 Penguin ⇒ Bird → flies and Penguin → ¬flies finish with a defeasible inference. The transitivity rule (a → b, b → c) ⇒ a → c cannot be applied to the first argument. In this case, according to their specificity we can give priority to the argument with more a specific antecedent but is not always as trivial, and complex conflicts can remain unresolved. During the last decades, non-monotonic logic, defeasible reasoning and causal reasoning have been investigated in Artificial Inteligence (AI) regarding the medical fields [56], [57]. However, methodologies and methods associated with such concepts have not been incorporated into the mainstream ML community yet. Challenges promoting wins over knowledge do not help incorporating more complex methods into the mainstream tools, limiting the success assessment of the solutions to the KPIs of interest.

C. CIRCUMVENTING THE ISSUES
During Section II-A two main issues were identified in the conventional solutions from the literature. First, mapping an image to a single disease (denoted as I → D) is partial and imprecise considering that diseases can co-exist and are not mutually exclusive events. Second, diseases can share pathologies and the pathologies from a particular disease can manifest differently and evolve over time. To workaround these issues, we can follow a process similar to the radiologist diagnosis, which should at least involve two steps. First, deriving a set of pathologies from a given image, I → P(P) where P(P) denotes the powerset 2 of the pathologies set P, i.e. the set of all subsets of P, including the empty set and P itself. Second, once the pathologies have been derived from a given image I , they can be mapped to diseases from the set of diseases D, P(P) → P(D). Figure 3 depicts a comparison of the previously described processes.
Note that the second step P(P) → P(D) could be enriched with extra information such as preconditions, tests results etc. for an increased precision. Moreover, radiologists often refer to the set of abnormalities found in the images as patterns, in this vein, each disease could alternatively be provided with a corresponding function as denoted in Equation 2, providing a matching value for a set of pathologies taken as argument.
∀d ∈ F, f D : P(P) → m d (2) Let F be the event space of diseases d 1 , d 2 , . . . , d n .
To dig a little more into the example, it could be possible to define a dedicated function for each disease D in a form similar to the example below (see eq. 3). This function would take into account the different abnormalities detected in the CXR and express the combination of pathologies that matches the disease definition Let be the sample space of all pathologies p 1 , . . . , p n and F the event space of diseases d 1 , . . . , d n with # p as the number of occurrences of a given pathology p; p being the total area of the pathology on all its occurrences; and w p the pathology relevance.
Likewise, example Equation 3 could be (and certainly should be) enriched with the additional multi-modal information derived from other tests (e.g. blood, sputum, etc). This extra information can have a different degree of relevance in the diagnosis, and even override the rest of the factors in the equation, requiring more complex functions than the above 2 Also written as 2 P in set theory. examples. Such equations are precisely where the pathology knowledge should be embedded, encoding the relevant parameters in the form of a formula together with additional information relevant for the diagnosis. For instance, a culture positive of Mycobacterium tuberculosis can suffice to diagnose TB, even with a normal CXR. On the other hand, a negative culture can cancel the rest of the parameters derived from an abnormal CXR, at least for this particular disease. [58].
All previous examples are not final but just indicative from a different method to detect diseases without incurring in the issues described in Section II-A. These methods are not novel and are often used in different areas. For instance, using a ML equivalent, a one-class classifier (OCC) could be defined for each disease, receiving pathologies as input. OCC are useful when data from other classes is difficult to obtain. In this case such methods would allow to define a corresponding equation for each disease, encoding its pathology particularities. The respective functions of the diseases could be updated as the disease pathology knowledge evolves, but the model used to extract the pathologies from CXR will not change in that case (unless new types of pathologies are to be found), in the same way that instruments for medical tests are rarely changed when the new knowledge of a disease is learnt.

III. DISCUSSION
This section discusses the limitations of inductive methods addressed before and how domain knowledge becomes essential to ease and direct its impact.

A. ON INDUCTION
Whether to reject induction as a justification method or not is still an open debate and not the aim of this work, but at least we should agree that induction, while useful, is limited. Such limits must be taken into account, especially when the problem requires non-monotonic means because the inferences of the model are tentative and defeasible.
If we visualise the data as points in a plane; every set of finite points belongs to infinite functions or curves. The problem of induction, therefore, consists in establishing criteria that allow us to say that the finite series of data confirms VOLUME 9, 2021 only one of the functions, or less dramatically but just as problematic, that one is more confirmed than the others [59].
The confirmation of a hypothesis, or in this case the candidate model, is often considered to increase as the number of favourable test findings grows, but the increase in confirmation, produced by one new favourable instance, will generally become smaller as the number of previously established favourable instances grows [60,Ch. 4,p. 34]. Many researchers blindly rely on the dogma the more data, the merrier but the addition of one more favourable finding raises the hypothesis confirmation but little [61]. The confirmation of a hypothesis depends not only on the quantity of the favourable evidence available but also on its variety. To overcome this first error naïve researchers could quickly pour in other datasets into the pot, but even such well intended decision can have unexpected consequences as the model can boost the features that make the datasets different instead of their commonalities. For example, in the context of COVID-19 detection from CXR scans, Maguolo and Nanni showed how models can learn to predict features that depend on the source of the dataset rather than the relevant medical information [62]. Again, domain knowledge becomes essential to weigh the data and direct the model on which features are relevant and which are confounding. Therefore, careful data curation is crucial to prevent risk of bias in the models.

B. ON DOMAIN KNOWLEDGE
Diagnosis is intrinsically multimodal and often requires tests of different nature to draw a conclusion, comprising multiple areas of medicine. Seeking to diagnose lung diseases with just CXR images is unnecessarily biased, and yet again, an example of how problems are often forced to fit the available data instead of fed with the data they need to become solvable. Owing to the defeasible nature of diagnosis, causal information regarding the diseases is crucial for building ML models and computer-aided diagnosis solutions.
Even though the prediction performance of ML solutions can be convincing, the lack of explicit models can make ML solutions difficult to directly relate to existing biological knowledge [63]. In this sense, Bayesian approaches may be used to embed appropriate priors from domain knowledge to better assess predictions' confidence, which ultimately increases model robustness. Consequently, translational medicine must be bidirectional, and more effort has to be put into bringing medical knowledge into the data and model design. In the case of CXR and CT images, curation by radiologists could be immensely enlightening for the construction of better models that detect lung lesions or abnormalities. Then, pathology knowledge can pave the way to embed causal relationships between pathologies and diseases into the models.

IV. CONCLUSION
This work attempts to provide some perspective to researchers of multiple areas regarding the current trend from the mainstream machine learning community to address significant challenges with careless solutions. Such solutions are oversold by meaningless KPIs unable to discern the suitability of a model for real-world clinical settings.
Several domains ranging from machine learning and epistemology to logic and pathology have been superficially tackled in this work, with a special focus on its impact on the conventional solutions developed for the automatic detection of COVID-19 from CXR images. This work focus on the automatic detection of lung diseases from CXR images as a goal, with transferability to real clinical settings as a requirement. The methodological flaws of such solutions are masked by KPIs stressing the high accuracy and precision of the models in their training and test datasets, but such solutions hide dangerous risks that may arise when transferring them into real-world clinical settings.
The epistemic issues addressed in this work concerning induction and monotonicity condition the means by which the goals are achieved and how methods are justified. The methods of such solutions were chosen by convention, disregarding the particularities of the problem at issue, and failing to consider knowledge from the domain at hand; for instance, that lung diseases are not necessarily mutually exclusive events and that diseases can share pathologies. Ignoring these facts, and ultimately ignoring domain knowledge, conditions the methods employed to achieve the aforementioned goals. The relationships between diseases and the pathologies are not in the data but do exist in reality, of which the data is merely a blurry shadow similar to the shadows of Plato's cave. The scientific community is already responding to the many issues affecting reproducibility, interpretability and transferability of ML solutions. Efforts are being made in different fronts, establishing guidelines for datasets [2], methods [1], ethics and community challenges [64], which aim to make ML solutions more suitable for their translation into real-world clinical settings. We hope this work will continue to raise awareness on this topic and help researchers develop better solutions, and ultimately unveil knowledge.

ACKNOWLEDGMENT
The author would like to thank Prof. Dr. Reinhard Schneider, 3 Dr. Wei Gu 3 (Bioinformatics Core Group), Asst. Prof. Dr. Enrico Glaab 3 (Biomedical Data Science Group), Prof. Dr. Michel Mittelbronn 3,456 and Prof. Dr. Jorge Goncalves 3 (Systems Control Group) for their support and helpful advice during the internal review process which contributed to improve this work. Special thanks to Beatriz García 3,7 (Interventional Neuroscience Group), who guided and inspired this work during our lockdown talks. 3 Luxembourg Centre for Systems Biomedicine 4 Luxembourg Centre of Neuropathology 5 Luxembourg Institute of Health 6 Laboratoire National de Sante 7 Centre Hospitalier de Luxembourg