Identification of HIV-1 Vif Protein Attributes Associated With CD4 T Cell Numbers and Viral Loads Using Artificial Intelligence Algorithms

The Human Immunodeﬁciency Virus (HIV) Viral Infectivity Factor (Vif) is a 192-amino acid accessory protein essential to viral replication which counteracts host APOBEC3 proteins. APOBEC3 proteins interfere with the replication of HIV, hepatitis C virus, hepatitis B virus and retrotransposons. Vif is a recent candidate target for therapeutic and preventative interventions in HIV/AIDS yet little is known about its clinical relevance. We describe the results of applying different machine learning algorithms (Apriori, Multifactor Dimensionality Reductor, C4.5, Artiﬁcial Neural Networks and ID3) to the search of associations between HIV-1 Vif protein attributes and clinical endpoints. Final iterations showed that the presence of mutations in BC Boxes, APOBEC motifs and Cullin5 binding motifs were together associated with higher initial CD4 T cells while mutations of speciﬁc APOBEC motifs coupled with the conservation of other APOBEC motifs were associated with lower historic CD4 T cells. Conservation of speciﬁc APOBEC motifs and BC boxes were linked to lower initial viral loads while different combinations of mutations in the Nuclear Localisation Inhibition Signal and BC Boxes were associated with higher historic viral loads. Further scrutiny of these combinations through traditional statistical methods revealed striking differences in both CD4 T cells and viral loads in patients stratiﬁed into those having the previous combinations. While artiﬁcial intelligence algorithms do not phase out traditional statistical methods, our Artiﬁcial Intelligence (AI)-based approach highlights their use at reducing the dimensionality of large and complex datasets and at proposing novel, unimaginable, associations of biological patterns with functional relevance or clinical roles.


I. INTRODUCTION
The Human Immunodeficiency Virus (HIV) Viral Infectivity Factor (Vif) is a 192-amino acid (23 kDa) accessory protein essential to viral replication. Vif proteins counteract host proteins exhibiting anti-viral activity of The associate editor coordinating the review of this manuscript and approving it for publication was Hiram Ponce .
the APOlipoprotein Bmessenger RNA Editing enzyme, Catalytic polypeptide-like (APOBEC3) family. APOBEC3 proteins are zinc-dependent deaminases responsible for nucleic acid editing (mutating cytidine to uridine in both viral DNA and RNA molecules). The APOBEC3 family has seven members (APOBEC3A, APOBEC3B, APOBEC3C, APOBEC3D, APOBEC3F, APOBEC3G and APOBEC3H). APOBEC3 proteins interfere with the replication and propagation of HIV, hepatitis C virus, hepatitis B virus and retrotransposons in humans [1]. APOBEC3G, discovered in 2002, is the best characterised member of the family, it forms stable complexes with the viral core which are ultimately encapsidated into budding virions [2], [3]. APOBEC3G hypermutates HIV DNA during the second round of viral replication leading to non-functional virions. Additional APOBEC3 antiviral activities include plus-strand transfer interference, reverse transcription blockage, as well as inhibition of viral DNA replication and priming, inhibition of viral DNA elongation and inhibition of proviral integration [3]. Vif binding of APOBEC3G recruits elonginB (EloB)-elonginC (EloC)-Cullin5 (Cul5) E3 ligase complex which in turn induces proteasomal degradation of the complex [3]. In addition, Vif blocks APOBEC3G catalytic activity, inhibits APOBEC3G incorporation into budding virions and interferes with APOBEC3G translation [4]. Moreover, recent evidence has demonstrated that certain Vif alleles derived from specific HIV-1 strains can modulate the host cell cycle to induce G2/M cell cycle arrest. While the exact way in which Vif proteins hijack the cell cycle has not been elucidated, both Vif-induced cell cycle arrest and APOBEC4 degradation seem to involve the same Vif functional regions: cullin-5 (CUL5) E3 ubiquitin ligase, elongin B and C as well as the core binding factor beta (CBFβ) [5]- [7]. As such, Vif allows HIV to evade host innate mechanisms that would otherwise protect cells. In recent years, the Vif accessory protein has become a candidate target for both therapeutic and preventative interventions in HIV/AIDS. Nonetheless, little is known about the clinical relevance of Vif protein features and diversity.
Current strategies of exploring the effect that viral polymorphisms have on clinical endpoints rely on either hypothesis-driven techniques (whereby attributes inferred to have functional implications are tested) or on exploratory studies searching for statistical associations which might be indicative of true interactions (which must then be confirmed). When information on the biological role of viral attributes (phenotypic or genotypic in nature) is scarce or unclear, exploratory studies allow novel or interesting associations to be discovered. However, the use of traditional statistical strategies in these exploratory studies (i.e., through contingency tables and χ 2 or Fisher's exact test) involves testing for the effect of numerous attributes which imply the need for statistical corrections for multiple testing. This is particularly important in genome-wide association studies, where the number of variables to be tested give rise to multiple opportunities for spurious associations to arise [8].
Machine learning is a sub-field of AI for use in classification or regression problems very specially adapted to the detection of complex non-lineal interactions in datasets having multiple independent input variables (e.g. attributes) as well as dependant outputs (e.g. clinical endpoint classes) [9], [10]. Some of the most important machine learning algorithms for use in classification problems include Artificial Neural Network (ANN), Support Vector Machine (SVM), Bayesian methods, Decision trees, Apriori and Multifactor Dimensionality Reduction (MDR) algorithms, among others [10]- [18]. ANNs are currently regarded as state-of-theart algorithms for multi-dimensional dataset explorations and classification of cases. Unfortunately, both ANN and SVM are characterised by a ''black-box'' behaviour in which the underlying patterns of interactions remain invisible and largely unexplorable to the operator. However, Apriori, MDR and Decision trees include mechanisms that help assess how informative the attributes are. In this study we describe our results at applying four different artificial intelligence algorithms to the search of genetic associations among HIV-1 Vif protein attributes and clinical endpoints.

A. STUDY COHORT
Seventy-seven proviral DNA Vif sequences derived from an archived cohort of HIV-infected antiretroviral therapy (ARV)-naive Mexican mestizo patients were included in this study, the protein features of which have been described previously [19]. Patient samples were referred to our laboratory by the state's public HIV/AIDS clinic ''Centro Ambulatorio de Prevención y Atención en SIDA e ITS'' from 2009 to 2014, no RNA samples were available for these patients. CD4 T cell numbers were assessed using a FACScan flow cytometer (Becton, Dickinson and Company, Franklin Lakes, NJ, USA) while HIV viral loads were determined with a COBAS Amplicor HIV-1 Monitor assay (version 1.5 Ultrasensitive, F. Hoffmann-La Roche Ltd. Basel, Switzerland) by the state reference laboratory (Departamento Estatal de Prevención y Control de VIH/SIDA, Servicios de Salud del Estado de San Luis Potosí). An in-depth description of the clinical features of this study cohort is provided in a previous previous publication [20]. Ethics approval for the study was granted by the corresponding Institutional Review Boards (Facultad de Medicina UASLP and the state's public health authority ''Servicios de Salud del Estado de San Luis Potosí''). Figure 1 summarises the different protein substitutions present in the 77 sequences, n = 77. Analysis of Vif protein substitutions focused on functionally relevant regions and domains known for interacting with other proteins relevant to the biological role of Vif. Substitutions outside of these regions were not considered so as to facilitate interpretation of results and avoid inferences which have not been substantiated through functional or site-directed mutagenesis studies. In Figure 1 Vif protein domains and functionally relevant regions included in our analysis are shown in bold type. These include 17 attributes (a = 17): eight APOBEC3-protein binding domains (APOBEC-1 through -8), the nuclear localization inhibitory signal (NLIS), the two Core Binding Factor interaction sites (CBFβ-1 and -2), the three Cullin-5 binding domains (Cul5-1, -2 and -3) and the three Elongin B/C box sites (BCBox-1, -2 and -3). Other sites including the FIGURE 1. Vif protein attributes. Vif protein attributes (a = 17) included in the study are shown in bold type and include APOBEC-1 through -8, the nuclear localisation inhibitory signal (NLIS), the Core Binding Factor interaction sites (CBFβ-1 and -2), the Cullin-5 binding domains (Cul5-1, -2 and -3) and the Elongin B/C box sites (BCbox-1, -2 and -3). The number of sequences bearing non-synonymous substitutions at each of these sites is shown below the HXB2 reference sequence. Substitutions observed outside of these functional domains and regions are not shown for clarity. tryptophans (W) involved in APOBEC3G binding, the MAPK phosphorylation sites, the Zn++-binding motifs, protease processing site, additional phosphorylation sites and the dimerization sites were not included in our analysis. The 17 Vif protein attributes present in each sequence were arbitrarily encoded as either Conserved (Cons) or Mutant (Mut) after comparing the physico-chemical nature of the substitution to the HXB2 Vif reference sequence. These Conserved or Mutant attribute status were encoded as 0 and 1 in our working database, respectively. Conserved status was assigned when none of the sites within a region had a non-conservative substitution (with regards to HXB2) while Mutant status was assigned when at least one non-conservative substitution was present in that region.

C. ATTRIBUTES AND DATABASE COMPILATION
Clinical information for each of the patients included in this study along with their corresponding Vif protein attributes were compiled into a database. The patient's CD4 T cell numbers and viral loads (VL) assessed at the time of initial medical examination are designated herein as initial CD4 and VL. Median values of each patient's CD4 T cell numbers and viral loads assessed on a trimestral basis during the patient's follow-up were calculated (after proving their non-parametric distribution) and designated herein as historic CD4 and VL. Patient derived sequences (S1 through S77 in Table 1) were stratified into <500 or ≥ 500 CD4 T cells/µL and ≥10,000 or <10,000 cp/mL of viral load groups for each of the four different categorical clinical endpoint classes (Initial CD4, historic CD4, initial VL and historic VL) based on established criteria [21]. CD4 and VL classes were encoded as 1 if CD4 T cells <500 cells/µL and VL >10,000 cp/mL or otherwise as 0 (see Table 1).

D. ARTIFICIAL INTELLIGENCE ALGORITHMS
Here, we introduce our AI-based approach for the identification of the best Vif attribute combinations associated with each of the four clinical classes (see Figure 2). Individual clinical endpoint class databases (Initial CD4, historic CD4, initial VL and historic VL) were screened through three AI algorithms (Apriori, MDR and C4.5) to enhance our identification of Vif attributes repeatedly associated to a clinical class. The Apriori, MDR and C4.5 algorithms identified rules, models or decision trees, respectively [15], [22], [23]. The Apriori algorithm was implemented on the Waikato Environment for Knowledge Analysis (WEKA) workbench v3.6 to generate rules associated with each clinical class [24]. Rules include a body (a string of Vif attributes) associated TABLE 1. The conserved or mutated state of the 17 Vif protein attributes present in regions of interest (APOBEC-1 through BCbox-3) of the 77 patient (Pt) sequences were encoded as 0 or 1, respectively. Clinical endpoint classes given in far-right columns were encoded as 1 when <500 CD4 T cells/µL or >10,000 cp/mL, or 0 if otherwise.

FIGURE 2.
AI-based approach. Selection of Vif protein attributes through artificial intelligence algorithms first required establishing a baseline classification using ANN and subsequently selecting for most informative attributes using the Apriori algorithm, MDR and C4.5 to determine which of these improved classification performance of a second-round analysis with ANN. Vif protein attributes selected through this procedure were then used as input for inducing decision trees with ID3 to further select Vif attributes and their status for final testing through traditional statistical tests.
to a head (clinical endpoint class). Apriori is very computationally expensive and is not apt for work with high dimensional datasets. The inclusion of only 17 attributes in Apriori produces around 1.8 million rules. The MDR algorithm detects and allows the user to visualise nonadditive combinations and interactions of attributes influencing a clinical class. MDR is currently regarded as a non-parametric model-free alternative to traditional statistical VOLUME 8, 2020 techniques [25], [26]. MDR is also very computationally expensive and was therefore limited to the generation of only six models having from 1 to a maximum of 6 Vif protein attributes. As model overfitting is common to most AI algorithms, estimation of a model's suitability for generalisation through 10-fold Cross-Validation (CV) was used. Accuracy is a measure of a model's capacity to correctly identify truepositive and true-negative cases against the total number of cases available. However, Balanced Accuracy (BA) is calculated by adding the fractions of correctly identified cases per class divided by the number of classes, and therefore is less affected by data imbalance. The best MDR models were therefore selected on the basis of CV and BA. The C4.5 algorithm used for decision tree induction was also implemented on WEKA, where it is designated J48. This algorithm produces decision trees in which different Vif attributes are hierarchically arranged and related to their clinical endpoint class. Trees are generated by modifying the algorithms pruning parameter (also known as Confidence Level (CF) in 0.01 increments from 0.01 to 0.51, thereby producing 51 possible trees. The hierarchical relevance of each attribute is automatically established based on the information gain ratio during tree induction. Non-redundant and informative trees (i.e. those not having a single branch) were considered for further analysis. ANN (Multi-Layer Perceptron) was also implemented on WEKA and baseline classification accuracy was determined by including all 17 Vif protein attributes and clinical classes during the training process [27]. Once topranking rules, models and trees had been generated by the individual AI algorithms (Apriori, MDR and C4.5), ANN classification performance was assessed using the attributes from these as input. Vif protein attributes that were consistently present in the best Apriori-ANN, MDR-ANN and C4.5-ANN results were then selected as input for further processing using the ID3 algorithm. ID3, also implemented on WEKA, generates un-pruned decision trees displaying the hierarchy of attributes and their status (conserved or mutated) as well as their relationship with each clinical class [23], [28]. Finally, Vif protein attribute-status thus identified by AI algorithms were tested through traditional statistical tests (see Figure 2). Full nucleotide and amino acid sequences for each patient were not used directly in our analysis given the computational expense implied and to avert making inferences of polymorphisms which have to date not been shown to be relevant to the biological role of Vif.

E. TRADITIONAL STATISTICAL ANALYSIS
Frequencies of Vif protein mutations were calculated by direct counting of attributes (mutant/conserved sites) and expressed as the percentage of the sequences bearing each. Statistical significance of attribute frequency differences between clinical endpoint groups relied on two-sided Fisher's exact test and binary logistic regression tests for independence using IBM SPSS Statistics (version 21, IBM Corporation, USA). Covariates used in logistic regression analysis included all attributes found statistically significant.
Significance was established at p < 0.05. Correction for multiple tests employed the Benjamini-Hochberg step-up procedure [29]. Comparison of non-stratified (real) CD4+ T cell numbers and viral loads present in groups having or lacking attributes, attribute status or attribute combinations relied on either t-test with Welsh's correction or Kolmogorov-Smirnov t-test with 2-tailed p values using GraphPad Prism 6 depending on the normality of their distribution (GraphPad Software, Inc. USA).

III. RESULTS
The N-terminal APOBEC3 binding site ( 14 DRMR 17 in Figure 1) was highly conserved and therefore excluded from subsequent analysis. Premature stop codons prevented two patient sequences from providing information for some Vif attributes. Apriori identified a total of 511,552 rules associated with clinical classes: 135,598 for initial CD4 T cell numbers (129,903 associated with <500 CD4 T cells/µL), 138,062 for historic CD4s (133,151 associated with <500 CD4 T cells/µL), 112,926 for initial VLs (85,255 associated with >10,000 cp/mL) and 124,966 for historic VLs (124,966 associated with >10,000 cp/mL), see example of output shown in Table 4 in Appendix for detailed results.
MDR produced 6 different models having combinations from one to a maximum of six Vif attributes. The three top models associating Vif attributes to initial CD4 T lymphocyte numbers had 6, 2 and 1 attributes, exhibited BAs of 56.45, 56.45 and 59.58 and CV consistencies of 5/10, 6/10 and 8/10, respectively. MDR models for historic CD4 T lymphocyte numbers had 3, 2 and 1 attributes, exhibited BAs of 55.17, 55.84 and 56.66 and CV consistencies of 5/10, 6/10 and 6/10, respectively. For initial VL's these had 5, 3 and 4 attributes, BAs of 57.65, 60.46 and 61.73 and CV consistencies of 5/10, 8/10 and 6/10, respectively. For historic VL's, only two models were considered as the third best did not exceed a 50% BA minimum. The models had only 2 and 1 attributes, BAs of 57.58 and 62.88 and CV consistencies of 7/10 and 10/10, respectively. See Table 5 in Appendix for detailed results.
Of the 51 possible trees generated by C4.5 only four unique trees were identified which associated Vif attributes with initial CD4 class, two trees for historical CD4's, four for initial VL's and two for historic VL's. The remaining trees were either non-informative or redundant, see Table 6 in Appendix for detailed results.
ANN baseline accuracy using the 17 Vif protein attributes for the classification of patients into each of the four clinical classes (initial and historic CD4 T lymphocytes and initial and historic VLs) was of 76.62, 71.43, 64.94 and 55.84, respectively. ANN classification performance exceeded the baseline classification threshold produced by ANN alone in eight Apriori rules, three MDR models and ten C4.5 trees in the initial CD4 T lymphocyte analysis; in 19 rules, three models and 13 trees for the historic CD4 T lymphocytes analysis; in one rule, three models and ten trees for the initial VL analysis as well as in 12 rules, two models and 11 trees for the historic VL analysis, respectively. The specific contribution of each Vif protein attribute to the results of each of these results is shown in Table 2. Average ANN classification performance using the attributes suggested by the Apriori, MDR and C4.5 algorithms on the initial CD4 T cell analysis was 77.3, 75.8 and 78, respectively. Those for the historic CD4 T cell group were 77.4, 79.2 and 79.3; those of the initial viral load group of 60.8, 60.6 and 65.42 and those of the historic viral load group were of 56.9, 63.6 and 62.5, respectively. Vif attribute occurrence in each of the individual algorithms results as well as their contribution to the ANN classification performance was weighed arbitrarily to compensate for the fact that most rules produced for Apriori were ignored (as only the top 10 for each class were used). As such, the attribute's occurrence was increased 2-fold for those of Apriori, 6-fold for MDR and 8-fold for C4.5 (far-right columns in Table 2). Ultimately, the three attributes with the highest-ranking weighted contribution were selected for inclusion as input attributes for ID3. Only the two highest-ranking attributes in the historic CD4 T lymphocyte group were selected as the third-highest weighted contribution (that of APOBEC-7) was below half of that of the highest (APOBEC-2).
ID3 produced single trees for each of the clinical endpoints using these AI-suggested Vif protein attributes. The tree produced for the initial CD4 T lymphocyte group had three levels, six nodes and seven branches (see Figure 3a). That for the historic CD4 T lymphocyte group had two levels, three nodes and four branches (see Figure 3b), whereas the trees produced for both the initial and historic viral load group had three levels, five nodes and six branches (see Figure 3c and Figure 3d, respectively). Each of the branches (attribute  status combinations) indicated by these trees were then used to manually re-encode our original database to stratify each of the patient's sequences into groups having or not-having these combinations for further traditional statistical analysis. Table 3 summarises those attribute status combinations which proved to be statistically significant for each of the clinical endpoints. Two combinations were found to be significant for the initial CD4 T lymphocyte group, both suggesting a protective effect from having less than 500 CD4 cells/µL. Two combinations were significant for the historic CD4 T lymphocyte group, one acting as a possible risk factor for having less than 500 CD4 T cells/µL and the other one showing a protective effect. Two combinations resulted significant for initial VLs, the first one protecting from viral loads in excess of 10,000 cp/mL and the second one acting as a risk factor for these high viral titres. The single combination having statistical significance in the historic VL group was found to act as a risk factor for high viral loads.
When the real (un-stratified) CD4 T lymphocyte numbers and viral loads for each of the patients were analysed after grouping them into those having these attribute combinations and those lacking them, striking differences were observed for both initial and historic CD4 T lymphocyte numbers and initial VL but not for historic VL (see Figure 4). Mean initial CD4 T lymphocyte numbers among patients having BCbox-3 Mut , APOBEC-4 Mut , Cul5-3 Mut (n = 4) was of 649.8 ± 35.46 Standard Error of the Mean (SEM) and ± 70.91 Standard Deviation (SD) cells/µL while those lacking this attribute combination (n = 56) had 256.6 ± 17.73 SEM ± 135.0 SD cells/µL, p = 0.0003 (see Figure 4a). Mean historic CD4 T lymphocyte numbers among patients having APOBEC-2 Cons , APOBEC-3 Cons (n = 13) was of 617.5 ± 27.28 SEM and ± 94.52 SD cells/µL while those lacking this attribute combination (n = 29) had 335.5 ± 23.6 SEM ± 127.1 SD cells/µL, p < 0.0001 (see Figure 4b). Likewise, mean initial viral loads among patients having APOBEC-2 Cons , BCbox-1 Cons , BCbox-2 Cons (n = 11) were of 1368 ± 616.8 SEM and ± 1950 SD cp/mL while those lacking this attribute combination (n = 40) had 193,666 ± 43,341 SEM ± 274,115 SD cp/mL, p < 0.0001 (see Figure 4c, note that is in logarithmic scale for visualization). No Vif protein attribute alone proved to be associated with significant differences in CD4 T cell numbers, nor with VL on either initial or historic groups. For comparison's sake, the effect that each of the 17 Vif attributes had on the four clinical endpoints was tested through traditional statistical methods. The frequency of BCbox-2 mutations was higher among patients having ≥500 initial CD4 T cells/µL (81.3%) than in patients having <500 cells/µL (49.2%), p = 0.025, suggesting a protective effect. This difference became even more contrasting when all BCbox mutations were considered (BCbox-1 through -3) as a single attribute (93.8% versus 59.3%, respectively, p = 0.014). This last effect remained significant after logistic regression (odds ratio OR = 0.097, 95% CI 0.012 -0.786, p = 0.029). With regards to historic CD4 T cells, only APOBEC-2 mutations were found to be detrimental. The frequency of APOBEC-2 mutations was higher among patients with <500 cells/µL (42.6%) in comparison to those having ≥500 cells/µL (12.5%), OR = 5.21 (95% CI 1.08 -25.0, p = 0.039). Interestingly, APOBEC-2 mutations were also associated with higher initial viral loads as it was more frequent among patients having ≥10,000 cp/mL (44.9%) versus patients having <10,000 cp/mL (21.4%), OR = 2.988 (95% CI 1.031 -8.657), p = 0.049. Mutations of the NLIS were the only attributes associated with greater historic viral loads. Mutated NLIS were present in 48.5% of patients having ≥10,000 cp/mL historic VL and in only 23.3% of those having <10,000 cp/mL, OR = 3.106 (95% CI 1.162 -8.302), VOLUME 8, 2020 p = 0.029. All of the previously mentioned statistically significant findings lost power after correcting for multiple tests, which lowered the alpha level from 0.05 to 0.003 and 0.002, respectively.

IV. DISCUSSION
Initial evaluations of HIV sequences, such as the screening for antiretroviral drug resistance mutations in a treatment naive patient, are best performed using plasma-derived viral RNA at a time where the patient exhibits high viral loads and before initiating antiretroviral therapy. As of 2013, all Mexican HIV-infected patients are immediately prescribed antiretroviral drugs on diagnosis in accordance with World Health Organisation recommendations and irrespective of their CD4 T cell counts or clinical features (active tuberculosis, hepatitis B infection and/or pregnancy) [30]. Whereas plasma-derived viral RNA sequences provide information on the most replication-competent viral species present at the time of sampling, the use of proviral DNA provides information on archived viral sequences which have been present since the time of HIV integration. As such, the presence of premature stop codons in our Vif sequences highlights the fact that some features may represent archived and genetically defective, replication-incompetent genomes. Nevertheless, proviral DNA has been shown to be an alternative source of viral nucleic acids for molecular studies such as genotyping, genotypic tropism testing, and phylogenetic studies in patients having low to undetectable viral loads [31]- [35].
Our analysis focused on assessing the clinical relevance of Vif substitutions at two distinct clinical phases of HIV infection, 1) at the time of initial medical evaluation, a point in which patients had not yet been subjected to antiretroviral therapy, and 2) during follow-up and after being exposed to the effect of antiretroviral drugs. Our identification of Vif substitutions associated with clinical variables at time of initial evaluation suggests that Vif polymorphism might prove to have an effect on the progression of unchecked HIV infections. On the other hand, the identification of associations further-down in the medical follow-up of patients suggests that these effects might still be present in spite of current antiretroviral therapy. We developed an AI-based attribute combination discovery approach which when combined with traditional statistical methods is capable of identifying associations between sequence traits and clinical endpoints in HIV/AIDS. Our results highlight the capacity of AI algorithms to guide traditional statistical methods for the study of the biological role and clinical relevance of factors for which hypothesis-driven techniques would be otherwise unsuccessful or laborious. Our AI-based approach is a complex analysis pipeline which allows us to identify Vif attributes repeatedly identified by different AI algorithms as important, so as to further enhance the selection of those attributes that would later be explored through traditional statistics. Examination of the results generated by individual AI algorithms (see Tables 4-6 in Appendix) provides evidence that no single algorithm was capable of identifying all attributes found to be significant on the final iteration. In addition, this application demonstrates the way AI algorithms can condense data with little human intervention. To our knowledge, this represents the first report associating complex multi-dimensional combinations of Vif protein attributes with two of the most important clinical follow-up parameters in HIV/AIDS: CD4 T lymphocyte numbers and viral loads. Strict adherence to the principles of testing and correction for this simple dataset comprising 77 combinations of 17 attributes and 4 different outputs would have limited the statistical power of most findings. The results produced by AI algorithms alone were congruent with those produced through traditional methods. BCbox mutations were associated with high initial CD4 T cells in univariate analysis but were also part of the final AI attribute combination associated with these. These results are in agreement with those published previously describing the epistatic effects of some pairs of amino acids encompassing the BCbox regions of Vif proteins with low CD4 T cell counts [36]. The importance of BCbox attributes in Vif's function highlighted in our results is in agreement with previous findings regarding its functional role. Vif hijacks the E3 ligase using the BCbox region that interacts with ElonginC and a zinc finger motif that interacts with Cullin5. Vif recognition and binding of APOBEC3 through Cul5 involves forming a complex with EloB and EloC, which in turn recruits CBFβ. The interaction between Vif and EloC is mediated by the 144 SLQ(Y/F)LA 149 motif (BCbox-1) present in the viral Elongin B/C-box [37], [38]. This domain is perhaps the most critical Vif region determining APOBEC3 protein suppression. Previous reports have shown that the short side chain of Ala 149 plays a crucial role in EloC-binding. As both the Vif-induced G2/M arrest and APOBEC3G degradation effects involve interactions with virtually the same host ubiquitin ligase machinery including Cul5, EloB and EloC as well as CBFβ, the exact biological role through which Vif exerts the observed effects in our study cohort can not be ascertained. Nevertheless, Vif's capacity to induce G2/M arrest has been observed in HIV viruses derived from clinical samples and this capacity has been shown to be associated with increased viral replication in vitro T cells cultures [6], [39].
Similarly, APOBEC-2 mutations were associated with the risk of low historic CD4 T cells and high initial viral loads in univariate analysis but also formed part of the AI combinations associated with the risk of low historic CD4 T cells and high historic viral loads. Interestingly, the conservation of APOBEC-2 was shown to be associated with lower initial viral loads by AI algorithms. That APOBEC3 binding sites are among the most repeatedly encountered Vif protein regions associated with both CD4 T cell numbers and VL is not surprising. Different motifs are used selectively by Vif to bind different APOBEC3 family members [37]. For Vif to exert its action it must bind APOBEC3 to subsequently act as the substrate binding subunit of a cullin RING ligase-5 (CRL5) E3 ligase complex [37]. Previous authors have demonstrated that the single most important factor   governing Vif functionality, and therefore clinical relevance, corresponds to the APOBEC3 binding regions. The importance of APOBEC-2 motif 22 KSLVK 26 was first established in a cohort of Brazilian Brazilian treatment-naïve patients, where the K22H mutation was shown to be associated with lower CD4 T cells and higher viral loads [40]. With regards to the relevance of the APOBEC-4 motif, previous studies have also highlighted the importance of positions 39 and 48 for Vif to counteract the influence of APOBEC3H proteins [41], [42]. It is also worth noting that both APOBEC-2 and -3 motifs, both of which were determined to be clinically relevant in our study, are located in the N-terminal region of Vif protein and in sites that have been previously shown to be under positive selection whereas the C-terminal APOBEC-8 motif did not seem to be associated with clinical classes [36]. Contrastingly, NLIS status was the only attribute exhibiting contradicting associations between traditional and AI methods. Artificial intelligence is the capability for machines to imitate intelligent human behaviour once trained through mathematical and statistical techniques to enable prediction of previously unseen patterns without having been explicitly programmed to do so. The ability of AI to analyse datasets and detect patterns in an n-dimensional feature space provides them with the capability of suggesting attribute combinations which would seem intractable to humans. This capacity has been illustrated in our results and further substantiated through traditional statistical techniques. While AI does not completely phase out traditional statistical methods, the AI-based approach proposed herein highlights their use at reducing the dimensionality of large and complex datasets and at proposing novel, unimaginable, associations of biological patterns with functional relevance or clinical roles. Determining the overall generalisation of AI applications to real-world patient management is critical to the development of a truly successful implementation strategy.

V. CONCLUSIONS
This paper proposes an AI-based approach, which was shown to be capable of identifying associations between HIV sequence traits and clinical endpoints in AIDS. These results highlight the capacity of AI algorithms at guiding traditional statistical methods in the search for novel interactions of virus and host genes and proteins in the absence of hypothesisdriven techniques. To the best of our knowledge, this represents the first report describing such novel and complex multi-dimensional associations of Vif protein attributes with CD4 T lymphocyte numbers and viral loads in HIV/AIDS. While this study focused on the genetically distinct Mexican mestizo human population, we envisage that future applications of this AI-based approach are very likely to prove beneficial and informative for other HIV-infected human groups. These results open the possibility of incorporating the study of novel genetic marker combinations into routine clinical management algorithms currently in use, further complementing the molecular arsenal of tools available for people living with HIV and AIDS.
Although most of our findings have been previously independently reported, the way in which their combinations interact and modulate clinical endpoints has not been previously explored. While aware of the limitations imposed by the use of proviral DNA and by the size of our dataset, our sequential approach employing both artificial intelligence algorithms along with traditional statistical methods was capable of identifying Vif sequence attributes associated with clinical endpoints. As is well known, the low number of patients enrolled for this pilot study underpowers the clinical significance of the discovered associations. This does not, however, undermine the need to further investigate these findings in larger and even different study cohorts. Our discovery supports the notion that similar approaches have great utility at guiding conventional association-discovery approaches in biomedical sciences. Tables 4-6 correspond to supplementary material which complements Tables 2-3 and Figures 3-4, see Section III. Table 4 summarises the results obtained through the use of the Apriori algorithm, Table 5 summarises those obtained by MDR and Table 6 those corresponding to the C4.5 algorithm.