Overcoming Confounding Bias in Causal Discovery Using Minimum Redundancy and Maximum Relevancy Constraint

Causal discovery is the process of modeling cause and effect relationships among features. Unlike traditional model-based approaches, that rely on fitting data to the models, methods of causal discovery determine the causal structure from data. In clinical and EHR data analysis, causal discovery is used to identify dependencies among features that are difficult to estimate using model-based approaches. The resultant structures are represented as Directed Acyclic Graphs (DAG) consisting of nodes and arcs. Here, the direction of the arcs in a DAG indicates the influence of one feature over the other. These dependencies are fundamental to the discovery of novel insights obtained from data. However, causal discovery solely relies on establishing feature dependencies based on their conditional dependencies, that could lead to inaccurate inferences brought about by confounding bias. Our contribution in this work is ‘Non-Confounding Causal Discovery’ (NCCD), a framework aimed at overcoming confounding bias leveraging maximum relevancy and minimum redundancy between features using the concepts of information theory. The work presented uses threshold conditioned values on which the features in the graphical structure are connected to one another. Validation was carried out on three clinical trial benchmark datasets and compared the results against the previously known Naïve Bayes (NB) and Tree Augmented Naïve Bayes (TAN) algorithms. We observe a reduction in the complexity of the graph, evidenced by a decrease in the number of arcs. Notably, the graphs generated through NCCD exhibited a capacity to eliminate confounding dependencies while concurrently preserving the overall score of the network.


I. INTRODUCTION
Mining multi-dimensional clinical trials data has gained research significance to elucidate knowledge about treatment interventions [1], [2].Traditionally, researchers rely on the use of data driven statistical tests to objectively understand the relationship between features and their ability to alter independent treatment feature across the different treatment groups in a clinical trial [3], [4].These statistical tests become challenging to administer and interpret as these data sets The associate editor coordinating the review of this manuscript and approving it for publication was Majed Haddad .differ based on subjective assumptions brought about during the design of the trial.These assumptions cause inherent biases, resulting in features that are highly correlated to each other.
Causal Learning (CL) [5] as an alternative approach is used to study the cause-and-effect relationships [6] using the dependencies that exist between features rather than emphasizing purely on correlation [7] to identify relationships.Techniques of CL are further categorized into Causal Discovery (CD) and Causal Inference (CI) respectively [8].CD identifies causal relationships using the dependencies between features in observational data.Unlike controlled experiments, observational data [9] from clinical trials is collected without any forced intervention or manipulation.CI goes a step further and estimates the genuineness of the higher-order dependencies based on the causal effect between two features.Higher order dependencies involve multiple features influencing each other.Existing approaches are iterative in nature and exhaustively capture all higher-order dependencies between causally inferred features.However, not all inferred dependencies are relevant, leading to irrelevant or at times redundant dependencies among features in CD.These irrelevant or redundant dependencies lead to complex meaningless causal structures that are further exacerbated by the inclusion of a large number of features questioning the interpretability and scalability of using CD approaches for larger datasets.
Furthermore, related works [10], [11], [12], [13] have shown that irrelevant dependencies discovered between features lead to confounding bias.Confounding bias in CD [14] is when one feature which can act as a confounder influence both independent and other dependent features causing a spurious association.It is essential to address confounding bias when a confounder plays a significant role in influencing both the treatment and class within a causal triangle.
The goal of this work is to propose an algorithmic approach to overcome confounding bias in CD.We aim to evaluate existing algorithms [15] of CD that leverage information theory to establish relationships between features, and to use the measure of relevancy to filter any redundant or irrelevant dependencies that lead to confounding bias.Exploiting the iterative inclusion of features in CD approaches, we define our measure of relevancy as the degree of contribution of an edge to the entire causal structure with respect to the class label.We eliminate the direct causal dependency between the treatment and class when treatment does not have a direct effect on the class beyond what can be explained by confounder.We hypothesize that there are changes in relevancy of established relationships between features of a causal structure with the iterative inclusion of new features during the construction of a causal structure.These changes are typically overlooked thereby leading to confounding bias.
In our work, we propose using conditional mutual information among features relative to the class to add edges to the graphical structure, ensuring relevance maintained throughout the graph.Further we normalize these dependencies and define a threshold to incrementally filter redundant edges among the features.We formalize our hypothesis using the following propositions.
Proposition 1: Conditional Dependencies exist between a pair of features that can be modeled with respect to the class label building a causal structure.
In the existing Bayesian Network (BN) techniques for CL, mutual information among the features is used to select the features and order them as nodes in the model.However, a widely accepted fact is that features selected should be relevant and should not be redundant referred to as the 'relevancy-redundancy' trade off.
Proposition 2: To ensure maximum relevancy and minimum redundancy, the measure of Joint Entropy is used to eliminate confounding relations, thereby identifying confounders.
We propose two algorithms -namely Feature Based Causal Discovery (FBCD) and Non-Confounding Causal Discovery (NCCD).These algorithms incorporate conditional checks on certain information theory metrics like entropy, mutual information, etc. while developing a graph.We tried to answer the following research questions: RQ1: Does the proposed algorithm establish a graph that is consistent with other related works?
RQ2: Is there a relationship between relevancy and the number of features?
RQ3: Does the removal of redundant edges constitute to the reduction of Confounding bias?
The two algorithms differ significantly from each other in their use of different measures to reduce the confounding features from the selected features and will inherently increase the chances of identifying the confounding effect.We compared the performance of our graphs with the existing techniques NB and TAN [16], [17]stated above using the scoring function.The rest of this paper is structured as follows.In Section II, we provide some background information and outline the challenges involved in the related work.Section III explains our definitions, framework, the datasets we used, and our proposed algorithms.Section IV, we show our results and how we validated them through experiments.Then, in Section V, we discuss what these findings mean.Finally, in Section VI, we conclude by summarizing our key points.

II. BACKGROUND
In recent times, CL techniques extract implicit yet valuable patterns that traditional model fitting techniques fail to capture.Initially, when causal relations were identified, an algorithm [18] was presented which infers causal relation using certain independence relation between three features.Further, Embedded Y-structure [19] identified causal relation between four features without any background knowledge.This has led to several enhancements over time resulting in graphical models that have manifested as trees with heuristic searches to the following broad categories of algorithms.Constraint based algorithms estimate whether certain conditional independence between features hold true [20].These estimations are performed using statistical or information theoretical measures.These algorithms output a single graph with clear semantics [21].However, they have no clear indication about the relative confidence in the model and are sensitive to error [22].The Incremental Association Markov Blanket (IAMB) algorithm is an example of constraint-based algorithms.It discovers the Markov Blanket of the target feature [23], where the Markov Blanket of a node is defined as ''the knowledge needed to predict the behavior of that node'' [24].In Bayesian, Probabilistic Graph Model (PGM) is a statistical model that is well-known in the medical community for its ability to represent the dependencies of features in the form of graphs [25], [26].PGMs exploit the probability distribution of features (vertices) to establish the edges between them.The use of the probability distribution allows PGMs to be robust to handle the uncertainty-brought about by random errors and measurement errors.These errors are prevalent in clinical data.
Learning of a PGM is a NP hard problem as it deals with the task of finding a graph which best fits the given data [17], [27].Peter-Clark (PC) algorithm [13], which is a constraint based causal discovery identifies causal relations when all relevant features are observed without sample selection and outputs a DAG.However, difficulty in observing all relevant features, and sample selection can be a problem.Using PC, Fast Causal Inference (FCI) algorithm [13] produces a Partial ancestral graph (PAG) in presence of confounding.A procedure to remove the existing ancestral relations, which can be confounding was conjectured [28].
Score based algorithms on the contrary maximize a score of a function of a graph based on the input data [21].Examples of the score function being loglikelihood or posterior probability to include Hill Climbing (HC) [29] and Tabu [30].Examples include the NB and TAN [16], [17] algorithms which make use of the conditional dependencies and probability distributions between the features to build a network.
NB Model is a zero-dependency model where the class label (or target features) θ is inferred from a Bayesian model of an n-dimensional input vector X, which is conditionally independent given the class label θ [6].The NB model calculates the conditional probability of the features given the class label [31].This is the simplest of all BN and is computationally inexpensive [31].However, it assumes conditional independence among features that limits its ability to capture latent confounders.
As an extension of the NB model, TAN is a one-dependence model which weakens the conditional independence assumption of the NB model.It allows dependencies between features by permitting each feature to have the most correlated feature as its confounder [6].This constraint limits the ability to observe the multiple confounders between features.These models, however, do not scale well computationally to large multi-dimensional datasets.More recent enhancements of BN models [32], [33] have been flexible in incorporating feature selection techniques to reduce the number of dimensions in large datasets [34], [35].
Similarly, Hybrid algorithms combine the ideas of both constraint and score-based learning techniques.They first learn the structure of the graph using causal discovery as in the case of the Min-Max Hill climbing algorithm (MMHC) [27].They then establish direction to the edges of the graph using a Bayesian scoring.These techniques are commonly referred to as BN models.The K -dependence Bayesian Classifier (KDB) approach adopts a greedy technique [6].It allows each feature to have K related confounders, using mutual information (MI).Conditional mutual information (CMI) between the features and K confounders to the class label is used to build the model.It should be noted that KDB uses MI between features and fails to capture higher order interactions that can exist between confounders themselves.Flexible K -dependence Bayesian Classifier (FKBN) technique was developed by LiMin Wang et al [36].Unlike KDB, FKDB considers all the features of the model to be related to each other.It uses CMI to establish confounders with respect to the class label.This way it ensures that all related confounders to features are included in the model [6].However, the complexity of FKBN models increase due to its ability to capture higher order interactions between confounders.
Confounding behavior of features was the major challenge across this study.Confounding bias, which is a result of existing selection bias, has been causing unexpected correlations [37].Selection bias indicates the systematic error caused due to hindrance of Causal Effect estimation [38].Extreme Confounding is also termed as Simpsons Paradox [39].Therefore, controlling confounding at each level has been of major importance.To effectively address confounding bias in our work, we propose a measure of relevancy to filter any irrelevant dependencies.We further discuss how the resultant DAG from NCCD abstains few relations which can be confounding.

III. METHODOLOGY
Here we provide a detailed overview of the proposed methodology, and the specific aims of this work.Beginning with the concepts of information theory that formalize our methodology, we then discuss the description of the datasets used, followed by a detailed explanation of the developed algorithms.The main objective of our work is to construct a graph using a combination of maximum relevancy and minimum redundancy criteria of the features to overcome confounding bias.Figure 1 describes the framework of our methodology.
We define the terminology based on the concepts of information theory.
Definition 1 (Entropy): It denotes the uncertainty of a feature.The entropy of a random feature X with a probability mass function p(x) as in (1).If the values of the feature X are biased towards one value, the entropy or uncertainty is said to be low.
Definition 2 (Conditional Entropy): Conditional Entropy of a feature X conditioned on Y can be defined as the uncertainty of feature X given feature Y as in (2).
Definition 3 (Joint Entropy): The Joint Entropy of two features X and Y can be defined as the uncertainty measure of the pair wise occurrence of the features as, Definition 4 (Mutual Information (MI)): MI of two features X and Y can be defined as the information shared by two features as, We introduce threshold variable Tau (τ ) as the average of normalized MI values, serving as a threshold for deciding whether the MI between a parent and child node (features) is significant enough to include an edge between them.
Similarly, the threshold variable Gamma (γ ) is the average of normalized MI values, representing a threshold for determining whether the MI between a feature and the class label is significant enough to include an edge between them.
Definition 5: (Conditional Mutual Information (CMI)): CMI for two features X and Y given a third feature θ, is defined as the information shared by both features conditioned on the third as in (5).
In the construction of the graph, edges are added based on maximum CMI values, emphasizing connections that capture the most meaningful conditional relationships between features and the class label.We choose the feature with the minimum CMI to find the most independent parent for the child feature to ensure that the identified parent is relevant and contributes non-redundant information to the graph.
Definition 6 (MI Matrix): A i × i matrix where i denotes the number of features in the system.The values of the MI matrix are calculated using (4).
Definition 7 (CMI Matrix): A i × i matrix where i denotes the number of features in the system.The values of the CMI matrix are calculated using (5).

Definition 8 (Directed Arc):
The arcs E i , also known as directed edges are links used to connect the features with each other.Arcs are directed towards either a node or away from a node.The arcs signify the conditional dependencies inferred by CMI and overall associations detected by MI.
Definition 9 (Node): A node X i is a feature having either discrete or continuous values.
Based on the direction of the arcs, it can be of two types: parent or child.If a node has an arc pointed towards it, we say that the node is a child.If the node has an arc pointing away from it, it is the parent node.A node can act as a parent and child at the same time.With reference to Figure 2, the node A is the parent for node B. Node B acts as the parent for C and as a child for node A at the same time.
In the specific context of causal analysis and confounding, three fundamental types of nodes are introduced within the graph structure: Class Node (θ) represents the class feature, which may depend on other nodes within the graph.Treatment Node (T ) signifies the feature under investigation, often regarded as the cause or influential factor in relation to the class.Confounder Node (C) represents a feature that acts as a confounder, influencing both the treatment and class nodes.
Definition 10 (Causal Graph): A directed graph G = (X i , E i ) is a collection of nodes (or features) (X i ) and arcs (E i ). Figure 2 depicts an example causal graph.Definition 11: (Confounding Bias): Confounding bias occurs when a third feature denoted as ′ C ′ , simultaneously influences both the treatment feature (T ) and a class feature (θ), leading to a spurious association between T and θ. ′ C ′ acts as confounder, introducing distortion in causal dependencies.This scenario is explained in Figure 3.

A. DATASETS
For our study, we considered three clinical trial datasets.These include ANN Thyroid dataset, Carcinoma and AIDS datasets.The choice of these datasets reflects common clinical settings.Our choice is three-fold namely: (a) to assess the ability of the model to handle different types of diagnostic features, (b) to examine the ability of the model to address diversity in dependent features, and lastly (c) to test the ability of the model to adapt with the clinical setting where the treatment option varies with time.

1) ANN THYROID DATASET
ANN Thyroid dataset is commonly used in machine learning, and medical research for thyroid disease diagnosis.It is available in the UCI Machine Learning repository [40].It contains information about patients who have undergone thyroid function tests.The dataset consists of 7200 instances (samples) with 21 attributes.These attributes include measurements of various thyroid hormones and related factors, such as 'T3 resin uptake', 'total serum thyroxine (T4) level', 'free thyroxine (T4) level', 'thyroid stimulating hormone (TSH) level', and others.Each sample in the dataset is labeled as either normal or abnormal, indicating the presence of a thyroid disease.The continuous features are discretized using the binning technique for discretization.This dataset is referred to as the thyroid dataset in the rest of our work.

2) CARCINOMA OF THE OROPHARYNX DATA
This data was obtained from a large clinical trial carried out by the Radiation Therapy Oncology Group in the United States [41] The data includes patients with squamous carcinoma of three sites in the mouth and throat.Patients entering the study were assigned to one of the two available treatment groups.The data set had 195 observations and 12 features with one class label.The class label was binary, with both possible outcomes being death or survival.The features of case number, institution code, date of entry into clinical trial and time of trial follow up are not significant to our analysis and are hence removed.The continuous feature of age is discretized using the binning.This dataset is henceforth referred to as the carcinoma dataset.Data was obtained from the University of Massachusetts, Amherst website [42].

3) AIDS CLINICAL TRIAL DATA
This data was obtained from a placebo-controlled trial, which aimed at comparing two treatment groups [41].The data includes the observations of 1,151 patients enrolled in the trial and 16 features with the class label being either death or survival of the patient.Unnecessary features like 'id', 'time', 'censor', 'time_d', 'tx_grp' are removed and the continuous features of 'age', 'cd4', 'priorzdv' are discretized using binning.This dataset is referred to as the AIDS dataset in the rest of our work.Data was obtained from the University of Massachusetts, Amherst website [42].

B. DATA PRE-PROCESSING AND VALIDATION
The pre-processing adopted as part of our methodology includes the removal of unnecessary features where all three data sets had certain features like 'id', 'case number', 'institution code', 'date when first diagnosed', 'time for follow up', etc. which are not significant for our analysis.Such features were removed.
We then check for missing values if any such that data sets do not contain any missing values and hence no further action was required in this direction.We then discretize continuous features like age using the discretization by the binning technique.Also, all features of the data have been checked for normal distribution.
For the purpose of validation, network scores of developed graphs and the graph metrics are compared with those of the already existing algorithms like NB and TAN.

C. PROPOSED ALGORITHMS
To eliminate confounding bias by minimizing redundant and maximizing relevant causal relations, we propose two algorithms: FBCD and NCCD.The algorithms make use of the concepts of information theory defined previously to develop a causal graph.
Equivalent features are treated as related features by placing an edge between them.Since a feature always has the highest mutual information with itself.This leads to graphs that are densely connected with redundant edges.
The proposed Algorithm 1, known as FBCD, constructs a graph from a set of features and a class label.It involves using CMI, MI1, and MI2 between all features (nodes) and with the class label to establish the dependencies.The algorithm starts by initializing nodes and setting a threshold τ based on the MI2 values between feature pairs.To overcome redundant edges, FBCD incorporates a threshold value τ on the parent node.The τ value is determined as a percentage of the average of the mutual information matrix I X i ; X j .This prevents the connection of two features with high mutual information.In each iteration, it selects a root feature with the highest MI1 with the class label, adds it to the graph and connects it to the class label.The algorithm then explores the root dependencies with other features, determining child nodes using the maximum CMI.The algorithm establishes dependencies within the selected nodes by evaluating the minimum CMI of a feature with respect to the child node.Additionally, it ensures that the MI2 between the feature and the child node is below a specified threshold.In such cases, the feature is identified as the parent of the corresponding child node.This process continues until all the features are considered, resulting in a graph encapsulating the informative dependencies between features and class label.To find the optimal value of τ , we increased it in increments of twenty percent.This aligns closely with proposition 1 that the class label feature remains connected to all other features resulting in redundancy.
However, Algorithm 2, named NCCD aligns with proposition 2. We use entropy to normalize.Through normalizing mutual information values with their respective entropy.This results in the removal of redundant edges with respect to class label and thereby mitigate confounding bias.The formulas for computing entropy and joint entropy of the features are provided by Eq. 1 and Eq. 3, respectively.
We make use of CMI between features with respect to the class label for adding edges to the graphical structure.This constraint ensures that relevancy is maintained while building the graph.The assumption made by FBCD that all features affect the class label is compromised in NCCD.This is brought about by introducing a threshold γ .Here, we ensure that only those features with a certain minimum value of mutual information with the class label are linked.The value of γ is calculated as the average of the normalized I (X i ;θ) .We have threshold τ calculated as the average of normalized I X i ;X j conditioned on which the parents to a node are added.Similar to FBCD, we establish the parent-child dependencies based on CMI and MI2.To find the values of γ and τ which give the best network, both values have been incremented by a step of 20 percent.
The computational complexity of both FBCD and NCCD is of O n 2 .Since both leverage matrices CMI, NMI1, NMI2 based on the number of features, leading to an asymptotic complexity of O (n) .The while loop iterates until given nodes are empty and internally checks both the 'given' and 'selected' lists, resulting in a complexity of O n 2 .The addition and removal of elements from these lists have an asymptotic complexity of O (1) .

IV. RESULTS AND VALIDATION
This section provides an overview of five experiments conducted and their corresponding results to validate the performance of proposed algorithms.

A. DEVELOPING INFORMATION THEORY-BASED CAUSAL GRAPH
This experiment is aimed at building structural graphs that capture the causal dependencies among the features.
1) The values of I X i ;X j , I (X i ;θ) and I X i ;X j | θ where X n denotes the features of the dataset and θ denotes the class label are calculated using ( 3) and ( 4), respectively.2) The average of the matrix generated using I X i ;X j is calculated.
3) The value of the threshold τ is a percentage relative to the value calculated above.4) Increment the value of τ by a step of 20% every time, we develop four networks for each data set using the FBCD.

B. SELECTING OPTIMUM THRESHOLD VALUES
This experiment is aimed at choosing the optimum value of the threshold which will strengthen the maximum redundancy and the minimum relevancy claim of the first experiment.
For selecting the optimum value of the threshold, the graphs developed at each value of the threshold are compared against each other using scoring functions.Scoring functions are metrics to decide the best fit of a network and are also Algorithm 2 NCCD Input: Nodes X 1,...,n (represent n features), CMI ← I X i ; X j | θ , NMI 1 ← I (X i ; θ) normalized with entropy H (X i , θ), NMI 2←I X i ; X j normalized with entropy H X i , X j for all features, class label θ, ∀i, j = 1, . . ., n Output: Graph G. 1: τ ← avg (NMI 2) , ∀i, j = 1, . . ., nandi ̸ = j 2: γ ← avg (NMI 1), ∀i = 1, . . ., n 3: givenNodes ← X G.addArc X parent , child ; 26: end if 27: end while called network scores.Given the data and scoring function, we must be able to develop a causal graph which maximizes the scoring function.

1) LOGLIK SCORING FUNCTION
Makes use of information theory concepts to compute the score of a network.Given a network G they are related to the compression that can be achieved over the data D with an optimal code induced by the network [43].This score is the logarithm of the likelihood of data D given the network G.It is obtained by log P G (T ) = −L(D|B).The loglik score is computed using the following.

2) BAYESIAN DIRICHLET (BDE) SCORING FUNCTION
Proposed by Heckerman et al [44].Given a directed acyclic graph, BDe makes assumptions of parameter independence, parameter modularity, uniformity of prior distributions and lack of missing values [43], [44].The equation below denotes the BD score function, where τ denotes a gamma function and P (G) the prior probability of the network and N ′ ij denote the hyper-parameters of the network.

BD (G, D)
= log (P (G)) Since the N ′ ij are quite difficult to compute, an additional assumption of likelihood equivalence is considered resulting in the BDe scoring function given by where This is one of the first Bayesian scoring functions, proposed by Cooper and Herskovits [34].It is represented below.
It is a particular case of Bayesian Dirichlet with un-informative assignment N ′ ijk = 1 which corresponds to the zero pseudo-counts.τ (c) = (c − 1) ! with c being an integer, where τ denotes a gamma function [43].
The first experiment was conducted on AIDS and Carcinoma data sets, i.e., the values of I X i ;X j , I (X i ;θ ) and I X i ;X j |θ are calculated.The average of the matrix generated using I X i ;X j is used to determine the value of the threshold τ which is increased at a step of 20%.For every increment of τ , a graph is developed resulting in four graphs.
Table 1 shows the scores of carcinoma data set graphs obtained by varying the threshold value τ of FBCD.Refer to Figure 3.All the graphs are equivalent and hence have the same scores.Similarly, Table 2 shows the scores for graphs of AIDS data set obtained by varying the threshold value τ in FBCD.Refer to Figure 6.
We were unable to obtain the network scores for thyroid dataset due to a higher number of dependencies.We opted to analyze the graph using average threshold.Refer to Figure 8.

C. NORMALIZING RELEVANCY AND REDUNDANCY
This experiment is aimed at normalizing the relevancy and redundancy constraints of the first experiment -through the following steps: 1) The values of I X i ;X j , I (X i ;θ), I X i ;X j |θ , H (X i , θ), H X i , X j where X n denotes the features of the data set and θ denotes the class label are calculated using (3), ( 4) and ( 5), respectively.2) The obtained values of I X i ;X j and I (X i ;θ ) are normalized using the H X i , X j and H (X i , θ) respectively.3) The average of the matrix generated after normalizing I (X i ;θ ) is calculated used to determine γ .4) The average of the matrix generated after normalizing I X i ;X j is calculated used to determine τ .5) The threshold values of γ and τ are percentages of the values calculated in step 3 and step 4. 6) Increasing the value of γ and τ by a step of 20% every time, we develop sixteen graphs, one for each combination of both thresholds.

D. COMPARISON OF THRESHOLDS
This experiment is aiming at comparing the various redundancy and relevancy thresholds generated by the above exper- iment and selecting the optimal graph.Here the Loglik, BDe and K 2 scores are calculated for the sixteen graphs generated for every data set.Furthermore, the graph which has the best score values is selected.The values of I X i ;X j , I (X i ;θ), I X i ;X j |θ , H (X i , θ), H X i , X j , where X n denotes the features of the data set and θ denotes the class label are calculated using (3), ( 4) and ( 5) respectively and the obtained values of I X i ;X j and I (X i ;θ ) are normalized using the H X i , X j and H (X i , θ) respectively.
The average of the matrix generated after normalizing I (X i ;θ ) is calculated and used to determine γ and average of the matrix generated after normalizing I X i ;X j is calculated and used to determine τ .Increment the values of γ and τ by 20% every time we developed 16 graphs.Table 3 shows the network scores of thyroid data set graphs obtained by NCCD.Graphs 10,11,12 and 14,15,16 have the best scores for BDe, K 2 and graph 13 has the best Loglik score.Refer to Figure 9.
Table 4 shows the network scores for AIDS data set with NCCD graphs.Graphs 13, 14, 15 and 16 give the best Loglik, BDe and K 2 scores.Refer to Figure 7.
Table 5 shows the network scores for Carcinoma data set with NCCD graphs.The scores exhibit uniformity across all the graphs.Refer to Figure 5.

E. VALIDATION
This experiment aimed at performing validation on the data sets using the graphical results of previous experiments and comparing the scores with the networks of NB and TAN.
This experiment involved the analysis of AIDS, Thyroid, and Carcinoma datasets.Tables 6, 8, and 10 display the comparison of network scores for the respective datasets.Network scores, including Loglik, BDe, and K 2 scores, serve as quantitative measures of the quality and fit of the networks generated by the different algorithms.Tables 7, 9, and 11 provide insights into the structure of the networks generated by the algorithms, specifically highlighting the number of edges and the degree of class label (θ) within these networks.
Remarkably, the network scores achieved by the NCCD algorithm prove to be highly competitive with those of the established NB and TAN algorithms, as demonstrated in Tables 6, 8, and 10.The Loglik score indicates that our model fits the data well, capturing dependencies.BDE explains our network's ability to capture underlying complex causal dependencies and K2 scores were employed to evaluate the conditional independence on the edges of the network.However, an especially noteworthy observation lies in the reduction of the number of edges directly connecting with the class label when using the NCCD algorithm.The observed reduction in the number of edges is attributed to the information theoretic measures included in the NCCD algorithm.The incremental adjustment of these thresholds aims to optimize the trade-off between preserving relevant dependencies and minimizing redundant edges.

V. DISCUSSION
To verify the objectives set in our dataset selection, each dataset was chosen to align with specific objectives.We used the thyroid dataset, which contains 16 diagnostic features representing the patient's current thyroid condition to verify objective (a).For the second objective (b) we used the Carcinoma dataset with 12 features.Features like 'Site', 'T_stage', and 'N_stage' have more than one dependent feature, explaining tumor characteristics across four stages of disease progression at 3 different sites, the faucial arch, tonsillar fossa, posterior pillar and 4 features explaining the state of the tumor node.This dataset tests the model on precision and verifies (b).The AIDS dataset was selected to verify (c), this contains 7 response features collected from the subjects under different treatment options at varying times.
Using FBCD, when the value of the threshold τ is increased at a step of 20%, it is observed that all the datasets yield similar scores for loglik,bde and k2 scores.Hence, it can be inferred that the choice of threshold value did not significantly impact the performance of the algorithm in terms of these scores.Moreover, it has been observed that the loglik score consistently outperformed both bde and k2 across all the analyzed datasets.We compared our scores to the scores of existing algorithms in accordance with RQ1.By incorporating principles from information theory, our algorithm leverages the same fundamental concepts that have proven effective in the established relative works.
Based on our causal graph analysis using NCCD, we observed variations in network scores across different datasets.Notably, scores for the carcinoma dataset remained largely consistent as evident from Table 5.This observation aligns with our RQ2, as the carcinoma dataset comprises a comparatively limited number of features.This finding implies that the features in the small dataset maintain their relevance without significant changes in their interactions.However, the scenario changed with thyroid and AIDS datasets, which are characterized by a higher dimensionality and increased complexity.In these cases, network scores Graph with the best Loglik, BDe and K2 scores created using FBCD for carcinoma dataset (refer to Table 1).

FIGURE 5.
Graph with the best Loglik, BDe and K2 scores created using NCCD for carcinoma dataset (refer to Table 5).
exhibited fluctuations, reflecting the challenges posed by datasets with a greater number of features.
Figures 4 and 5 show the two graphs of FBCD and NCCD with best BDe, K 2 and Loglik scores for carcinoma dataset.It can be noticed that the NCCD graph has fewer number of arcs with respect to the class label when compared to FBCD.From Figure 4 using FBCD, the presence of 'sex' as a direct cause of both 'treatment (tx)' and 'class' features could introduce potential confounding bias into our analysis, as it may unintentionally mediate the relationship between treatment and class.However, the structural refinement from Figure 5, which demonstrates 'sex' influencing 'treatment' and 'treatment' directly impacting 'class' been algorithmically designed using NCCD to mitigate this potential confounding bias.
Figures 6 and 7 show the two graphs of FBCD and NCCD with best BDe, K2 and Loglik scores for aids dataset.It can be noticed that the NCCD graph has fewer number of arcs with respect to the class label when compared to FBCD.In our algorithms, we've observed a significant shift in the causal relationships between 'treatment(tx)', 'race', and 'class'.2).

FIGURE 7.
Graph with the best Loglik, BDe and K2 scores created using NCCD for aids dataset (refer to Table 4).
From Figure 6, 'treatment' and 'race' both directly influenced 'class', potentially introducing confounding bias as they could mediate the relationship between 'treatment' and 'class'.However, using NCCD from Figure 7 we algorithmically adjusted this structure, with 'treatment' now influencing 'race', and 'race' subsequently affecting 'class'.Similarly, the causality between 'treatment' and 'priorzdv' has been reconfigured.This modification aimed to mitigate confounding bias by creating a clearer and more direct pathway from 'treatment' to 'class'.
Figures 8 and 9 show the two graphs of FBCD and NCCD with best BDe and K2 scores for thyroid dataset.From Figure 8, 'on_thyroxine' is primarily associated with 'query_hyperthyroid', 'query_hypothyroid', and 'class', aligning with a more focused relationship between medication usage and the thyroid conditions of interest.However, using NCCD from Figure 9, This reconfiguration includes various features that may impact the use of thyroid medication.Additionally, by   3).
removing certain edges and the elimination of the edge between 'on_thyroxine' and 'class', and incorporating these additional features into associations, we mitigated confounding bias.
In our causal graph analysis, we have successfully identified and eliminated redundant edges pertaining to treatment features and class feature.This refinement ensures that only the most appropriate relationships between the treatment feature, class feature, and other relevant features remain, mitigating the potential for confounding bias to distort our causal dependencies addressing RQ3.

VI. CONCLUSION AND FUTURE WORK
In this work, we hypothesize that changes in the relevancy between established dependencies within the causal structure occur because of incorporating new features, leading to inaccurate dependencies.We proposed two algorithms which incorporated threshold values based on which features are connected to each other in a graph.
The first algorithm, FBCD, assumes that all features influence the outcome.This is falsified in the second algorithm by introducing a threshold for parents to be added to the outcome node.This way the graph had fewer arcs and was less complex.Graphs developed by NCCD were also able to eliminate certain features, but at the same time maintain the score of the network.This implies that the features of less significance in predicting the class label have been eliminated, thus reducing the redundancy of the network.
By eliminating redundant connections, we have refined our causal graph to a more concise and relevant representation and have effectively minimized the potential for confounding bias to distort our results.
Our future work focuses on extending the methodology to make predictions on the treatment option as the class label by including more features in the analysis.This way we aim at developing a cause-and-effect model for predictions.Moreover, the temporal aspect of features like age can also be included.

FIGURE 1 .
FIGURE 1. Proposed framework.The diagram illustrates the step-by-step process from dataset acquisition to final inferences and conclusions.The dotted line connects back from inferences and conclusion to each preceding step, emphasizing the iterative nature of the framework.

FIGURE 2 .
FIGURE 2. An example of a causal graph (G).Each directed arc represents a causal link, and the nodes represent different features.

FIGURE 3 .
FIGURE 3.An illustration of confounding dependencies.The unidirectional arrow represents the relationship, where T influences θ , and C influences θ .The two arrows represent influence between T and C , which may occur in either direction capturing their dependencies.

FIGURE 4 .
FIGURE 4. Graph with the best Loglik, BDe and K2 scores created using FBCD for carcinoma dataset (refer to Table1).

FIGURE 6 .
FIGURE 6. Graph with the best Loglik, BDe and K2 scores created using FBCD for aids dataset (refer to Table2).

FIGURE 8 .
FIGURE 8. Graph created using FBCD for thyroid dataset.(Network scores for the graph could not be obtained due to the higher number of dependencies).

FIGURE 9 .
FIGURE 9. Graph with the best BDe and K2 scores created using NCCD for thyroid dataset (refer to Table3).
PUJIT PAVAN ETHA received the bachelor's degree in electronics and communication from Jawaharlal Nehru Technological University, Kakinada, India, in 2018.He is currently pursuing the Ph.D. degree in computational analysis and modeling (CAM) with Louisiana Tech University, USA.He is also a Graduate Research Assistant with the Data Mining and Machine Learning (DMML) Laboratory.His current research interests include machine learning in modeling complex, relational structures, graphs, and networks.PRADEEP CHOWRIAPPA received the Ph.D. degree in computational analysis and modeling (CAM), in 2008.He is currently an Assistant Professor in computer science with Louisiana Tech University and the Manager and a Mentor of the Data Mining and Machine Learning (DMML) Laboratory.His research and educational interest includes information and intelligent systems, with a focus on creating newer models for data driven discovery.His current research interest includes the use of semi-supervised or active learning models to build context aware clinical decision support systems with concept drift.

TABLE 1 .
Network scores for carcinoma data set with FBCD graphs.

TABLE 2 .
Network scores for AIDS data set with FBCD graphs.

TABLE 3 .
Network scores for thyroid data set with NCCD graphs.

TABLE 4 .
Network scores for AIDS data set with NCCD graphs.

TABLE 5 .
Network scores for carcinoma data set with NCCD graphs.

TABLE 6 .
Comparison of network scores -AIDS data set.

TABLE 7 .
Comparison of graphs -AIDS data set.

TABLE 8 .
Comparison of network scores -thyroid data set.

TABLE 9 .
Comparison of graphs -thyroid data set.

TABLE 10 .
Comparison of network scores -carcinoma data set.

TABLE 11 .
Comparison of graphs -carcinoma data set.