Performance Enhancement of Predictive Analytics for Health Informatics Using Dimensionality Reduction Techniques and Fusion Frameworks

Predictive analytics has become an essential area of research in health informatics. The availability of multi-source and multi-modal data in healthcare has made the disease prediction, diagnosis, and medication process more effective and reliable. However, the analysis and decision making have become challenging task, particularly when data is in multiple formats and from different sources. In this study, different frameworks have been proposed to handle multi-nature data at different levels for predictive analytics. Dimensionality reduction techniques have been applied to extract relevant features to enhance the analysis. To improve the performance of predictive analytics at different fusion levels, the potential benefits of multi-modal data have been discussed. Moreover, notable improvement in prediction accuracy has been observed through experimental evaluation of the proposed frameworks. Furthermore, the issues which have been found during dimension reduction and fusion approaches have also been highlighted.


I. INTRODUCTION
Over the past few decades, the evolution of Information and Communication Technology in healthcare has brought revolutionary changes in data collection tools and techniques. Healthcare is a data-intensive field generating a massive volume of data every day. Health informatics systems are making a significant contribution to transform healthcare from a data-intensive to a data-rich field. An intense increase in the data size and dimensions used for different applications demands novel approaches to make effective use of data to enhance the performance of these systems [1], [2].
Health informatics systems play an essential role in hospitals and healthcare centers. These systems need to be modernized as they face many issues in acquiring relevant and complete data [3], [4]. Various issues have been found in health informatics data such as uncertainty, imperfection, heterogeneity, inconsistency, redundancy, high dimensionality and representation create problems for the fusion process [5]. The performance of disease management and prediction systems is often adversely affected due to data quality issues [6].
The associate editor coordinating the review of this manuscript and approving it for publication was Sedat Akleylek .
To overcome these issues, it becomes essential to remove discrepancies from the data to discover valuable hidden information for analysis and decision making.
Data belonging to different sources and modalities have different features that provide different information about disease and patient condition. Thus feature fusion of multi-source and multi-modal data can improve the accuracy and reliability of results as compare to single modality [7]. Moreover, relying on the data from a single source or modality can be challenging to discriminate different diseases and classification of severity level. The fusion of multi-source and multi-modal data has become common in recent years. Multimodal systems not only consider different aspects of disease management, but also allows for missing data imputation, quality-aware fusion, and improve the precieved experience.
A variety of techniques can be used to analyze current and historical information to make accurate predictions. The predictive analytics approaches allow the maintenance of actions based on changes in the parameters and factors that can affect disease diagnosis. However, the quality of input data has a significant impact on the performance of the prediction system [8], [9]. In HDD, multiple parameters belonging to one entity may be collected to diagnose disease. Architecture of classical uni-modal approach for predictive analytics. Here, predictive analytics is performed for data x i retrieved from each source S i separately. First, data from each source S i is preprocessed to clean the data x i . Then DRT transforms the data x i into lower dimension y i . Finally, predictive analytics model is applied according to type of data to attain the results.
HDD often holds redundant, sparse, missing, noisy, and irrelevant features [10]- [12]. Moreover, analysis of HDD and diverse nature data is challenging for decision-making and disease prediction.
The predictive analytics approaches allow the maintenance of actions based on such changes in the parameters and factors that affect disease diagnosis. A variety of techniques can be used to analyze current and historical information to make accurate predictions. However, the quality of input data has a significant impact on the performance of the prediction system [8], [9]. In some situations, hundreds or even millions of measurements belonging to one entity may be collected to diagnose disease. In such a situation, generally, the issue of HDD is raised. HDD often holds redundant, sparse, missing, noisy, and irrelevant features [10]- [12]. Moreover, analysis of HDD and diverse nature data is challenging for decision-making and disease prediction; only relevant features are required.
Many Dimensionality Reduction Techniques (DRTs) have been successfully applied in the literature to extract only precise and relevant features for analysis and disease prediction. Dimensionality Reduction (DR) can be performed via feature selection and feature extraction [13]. Feature selection creates a subset of features using correlation analysis or weighting methods [14]. Feature extraction transforms the original HDD into a low-dimensional representation by eliminating the redundant features [15]. Feature extraction can be performed using linear approaches [16], such as Principal Component Analysis (PCA), Independent Component Analysis (ICA). And nonlinear approaches [17], such as Kernel PCA (KPCA), Isomap, and Self Organizing Map (SOM), etc. Different DRTs can be applied for different types of data (e.g., image, text, signals) [11]. Moreover, different researchers have applied different machine and deep learning models, such as Support Vector Machine (SVM), Decision Tree (DT), Logistic Regression (LR), Random Forest (RF) [18], [19], Clustering [5], [20], and Neural Networks (NN) [21] for disease classification and prediction [3], [22].
At present, most of the frameworks proposed in the literature are using a single modality for predictive analytics (e.g., the prediction of disease) ( Figure 1). For example, Electrocardiogram (ECG) signals have been used for arrhythmia detection [23]. The problem with this approach is that ECG signals and ECG reports may not be available simultaneously. But, other information such as symptoms of heart disease, risk factors information, and demographic data may be available, which can be helpful for disease prediction. An ECG-based prediction system that uses ECG signals only can limit the scope and need modification in such a situation. Similar issues have been found with other disease prediction and management systems developed to work with one modality at a time [2], [24]. Some studies proposed fusion frameworks such as for readmission prediction [25], disease prediction which are specific for the fusion of multi-modal or multisource data at a time [18], [26]. These systems offer better results than unimodal based system. There is need for a mechanism than can utilize multi-source, multi-modal, and multi-nature data simultaneously to develop dynamic and efficient disease prediction and management systems.
Several techniques have been proposed to combine data from multiple sources and modalities in recent years, which may hold similar and different data formats used for predictive analytics [18], [21], [26]- [28]. These techniques include data integration, data fusion, feature fusion, knowledge fusion, and multi-sensor fusion [4], [21], [26]. There is need to develop a system that can fuse multi-source, multi-nature, and multi-modal data to improve the performance of disease diagnostic and prediction systems [5]. Multi-source fusion increases the reliability and availability of data. However, different sources may hold data in different modalities (e.g., formats). When the objective is to fuse (combine) diverse nature multi-source data, it becomes more suitable to apply multi-source fusion. But multiple sources may contain redundant features, so there is a need to fuse features belonging to different modalities. Feature fusion models can combine feature sets belonging to two or more modalities to acquire distinct, relevant, and precise features. Many studies explored different levels of fusions [3], [29], [30]. Selection of required fusion level needs a clearly defined objective for combing various features belonging to different sources in the form of different modalities and vice versa. The resulting (combination of) features belonging to multiple sources and modalities is expected to generate improved prediction performance compared to individual modality or data belonging to a single source [5].
This study proposed different fusion frameworks to combine the data from multiple sources and modalities for predictive analytics to overcome the issues mentioned above. Moreover, knowledge and decision fusion frameworks have been introduced to improve decision-making. It becomes challenging to use a particular DRT or a common classifier or Machine Learning (ML) model for multi-nature and multimodal data [11], [19], [31]. To resolve this challenge, suitable DRTs can be applied to extract the most relevant, reliable, and precise features to improve the performance of the classifier or the ML model. Ultimately, to enhance the performance of predictive analytics for health informatics.
This study makes the following contributions: • Multi-modal, multi-source data, feature, knowledge, and decision fusion frameworks have been proposed to combine the data from multiple sources and modalities to improve predictive analytics.
• For considering the contribution of different DRTs and ML models, we explored different DRTs and ML models for the proposed fusion frameworks.
• Key issues belonging to DRTs and fusion frameworks for predictive analytics have also been highlighted.
The present study is organized as follows: Section II presents the related work. Fusion frameworks are proposed in section III. Section IV discusses the results of the proposed frameworks. Section V explores the key challenges that limit the application of DRTs and fusion approaches. Finally, section VI concludes the study.

II. REVIEW OF LITERATURE
Several multi-source, multi-modal, and knowledge fusion frameworks have been introduced in the literature to improve the performance of disease prediction systems. This section provides a brief review of different fusion frameworks suggested in the literature for handling high-dimensional, multisource, and multi-nature health informatics data. According to reviewed literature three common fusion levels have been found. Data fusion approaches such as integration and fusion have been used to fuse multi-source or multi-modal data. Feature fusion approaches have been defined to combine features from multiple sources and modalities. Feature selection and extraction methods can be applied for feature fusion (e.g., to combine features belonging to different sources and modalities). Decision fusion approaches aims to combine the results of different classifiers using major voting, min-max score, etc.

A. MULTI-MODAL DATA FUSION
In literature, most of the studies proposed fusion frameworks perform multi-modal fusion of different medical signals (e.g., ECG, EEG, etc.) [7] or imaging modalities (e.g., MRI, PET, CT, or CXR) [32], [33]. To resolve the issues of multi-source data fusion a genetic algorithm was proposed to select data from source having optimal information [34]. In a recent study, Muhammad et al. [7] conducted a brief review of various studies from 2014 to 2020 that perform multi-modal fusion of medical signals to develop efficient health management systems. A multi-modal data fusion strategy was proposed to detect the progress of Alzheimer's [35]. Author, applied Spatial Group ICA was used to reduce the dimensionality of imaging modalities and Canonical Correlation Analysis (CCA) was used to fuse the features of fMRI and sMRI. However, disease classification task was performed with SVM and recurrent convolutional network.
In study [36], authors proposed a multi-modal fusion framework to combine the features of image and speech for smart health monitoring. A multi-sensor data fusion methodology was developed for blood pressure assessment system for CVD management in ambulatory care [37]. For different health informatics applications sensor data fusion frameworks have been proposed in different studies [1], [38], [39].
For the collection of multi-source data from different medical sensors to develop a personalized health management system, Korzun and Meigal introduced a fusion framework based on semantic links to combine data from various sensors to improve the performance [1]. Nweke et al. reviewed different data fusion and features fusion strategies [31]. A taxonomy of multi-sensor and multi-view features and data fusion has been developed for human activity recognition. Both classical DRTs and automatic deep learning methods were used for multi-sensor and feature fusion based on multi-view fusion approaches.
In another study, for the prediction of human personality, Kampman et al. [8] proposed a multi-modal fusion framework using major voting (ensemble) method to combine text, audio, and video data that can be used in healthcare. Majumder and Pratihar proposed a multi-sensor fusion via fuzzy clustering for the prediction of heart diseases [21]. Vijayasarveswari et al. introduced a multi-phase feature selection approach for cancer prediction [26]. Zhang et al. introduced a multi-modal fusion framework based on Local Linear Projection (LLP) [40]. Khan et al. proposed a feature VOLUME 10, 2022 fusion strategy to improve the prediction and classification accuracy of Coronary Artery Disease (CAD) [41].

B. MULTI-MODAL FEATURE FUSION
During multi-modal feature fusion data belonging to different modalities, sources, and sensors either homogenous or heterogeneous are collected and combined. For instance, to enhance the heart disease management system via multimodal fusion, [42] applied PCA to reduce the dimensionality and genetic algorithm for classification. For the prediction of heart disease at an early stage, a multi-sensory-based data and feature fusion framework have been proposed by Muzammal et al. [18]. They extracted time domain and frequency domain features after preprocessing of collected data. Multi-sensor and multi-modal features were fused to develop a decision support system for heart patients. They used a fog cloud-based. A review of fusion implementation in health informatics has been found in [43]. In this study, authors discussed different fusion level used to fuse imaging and Electronic Health Records (EHRs) data using deep learning techniques. For automatic diagnosis of myocardial infarction via ECG data, Wang et al. [19] introduced a multi-feature fusion approach. The proposed approach performed feature fusion based on DRTs, statistical features, and entropy features. Extracted features were classified using RF. The proposed approach achieved a higher disease classification accuracy than state-of-the-art approaches.
In clinical practice, the collection of multi-modal medical imaging data has become common. To extract relevant and reduced features and to overcome the issue of data sparseness, different DRTs such as Sparse PCA, PCA, ICA, and CCA have been applied by Yang et al. at preprocessing phase [44]. After preprocessing, they fused extracted features for brain disease diagnosis. [45] proposed a multi-modal feature fusion framework for the detection and classification of Arrhythmia and Myocardial Infarction (MI) from ECG signals data collected from MIT-BHI dataset using CNN layers for feature extraction and SVM as a classifier. For the detection of Atrial Fibrillation (AF), Shi et al. proposed a feature fusion framework. Discriminative CCA was applied to reduce the dimensionality of CMRI data [46].
To enhance the prediction accuracy of heart disease prediction systems for smart health monitoring, Ali et al. proposed a feature fusion framework to combine the structured and unstructured data using the deep learning approaches [47]. In another study, Ali et al. [4] proposed an analytical engine for efficient prediction of disease based on multi-source data. They apply different data mining strategies to reduce the context-aware dimensionality of data and train and test different ML models for disease prediction.

C. DECISION FUSION
Decision fusion is considered as the highest fusion level. Decision fusion was implemented in various application of health informatics. For the development of efficient disease detection system, the authors in [55] preprocessed two different input modalities separately and than fused them at classifier level (fuse the results of different classifiers). When fusing multi-modal and HDD, completeness and quality of data have a significant impact on fusion [5]. Deng et al. introduced a classifier (decision) fusion framework based on feature selection to overcome the issue of data incompleteness and fusion of high dimensional and multi-modal medical imaging modalities [14] used for the classification of Alzheimer's disease. The proposed feature selection-based classifier fusion approach achieved higher accuracy with incomplete datasets.
Smirnov and Levashova presented a review of knowledge fusion patterns reported in different studies [30]. According to this study, three kinds of fusion was performed, e.g., knowledge fusion for the knowledge stored in repositories (e.g., accumulation of knowledge, problem-solving, multisource fusion, searching, concept, attribute or domain fusion), knowledge fusion among knowledge workers (e.g., knowledge sharing, problem-solving, decision making, or distribution of knowledge), knowledge fusion among knowledge workers and knowledge repositories (e.g., analysis and problem solving). In a recent study, Tariq et al. proposed a COVID-19 disease prediction and resource management system where decision level fusion of multiple classifiers was performed [56]. To enhance the classification accuracy of the COVID-19 patients a decision fusion method was introduced [54].

D. MULTI-LEVEL FUSION
Some studies performed fusion at different stages /levels to develop a dynamic and efficient systems for health informatics. For instance, For the efficient diagnosis of heart disease, Hassan et al. [49] introduced a multi-stage fusion (e.g., feature and decision fusion) using generative model and multi-variate process control method. The proposed multi-stage fusion architecture used shared and separated ICA to overcome the issue of high dimensionality, incompleteness, heterogeneity, due to multi-modal and nonnormalized features in data. A brief review study for different data fusion (level) for healthcare enhancement was provided in [57].
For the identification of different diseases (Alzheimer's and Cancer), based on multi-modal clinical data, Viswanath et al. [48] proposed a fusion framework for knowledge representation of multi-modal data via decision, kernel, and low-dimensional representations. Knowledge fusion based on weighting criteria was performed using direct fusion, co-association matrix fusion, and structural fusion. They presented results by combining multi-modal interpreters. DRTs were used for data and feature fusion at different levels. To explore the issues of the ageing population and the spectrum of pandemics in the modern age, Cai et al. [50] presented a survey of the top 10 countries publishing research articles focusing on data-driven health management systems. They explored various approaches and applications used for multi-modal data and knowledge fusion in the data-intensive healthcare domain. A fusion framework for two modalities (image and text data) has been presented to perform data and knowledge fusion for a Clinical Decision Support System (CDSS). SMOTE method was used for feature fusion. In another study, Zhang et al. discussed multi-view data fusion and feature fusion strategies using feature selection approaches [58]. For monitoring blood pressure using ECG sensors, Smirnov and Levashova proposed a multi-level fusion model that can perform multi-sensor fusion and information fusion for the development of a predictive model [30].
To improve the COVID-19 diagnostic accuracy [53] proposed feature fusion and decision fusion methodology. Multiple CNN architectures were used to extract features and feature level fusion was performed to combine the features. Different classifiers were trained and tested on the fused features set. The Majority Voting method was used for decision fusion to attain the optimal results for medical recommendations to control the pandemic. Attallah et al. proposed a diagnostic system for distinguishing COVID-19 and non-COVID-19 cases [51]. The system was trained and tested using CT images, where the CT image features were extracted with four pretrained deep CNN models, and than were fused for training SVM classifier. The authors experimented with different fusion strategies to investigate the impact of feature fusion on the diagnostic performance.
To overcome the challenges of uni-modal data for emotion recognition, Jiang et al. proposed a multi-modal information fusion approach [59]. The proposed approach was used to extract relevant features from multi-modal data (e.g., ECG signals, visual, audio, text data) and fuse them for emotion recognition and health monitoring of patients. For dimensionality reduction and feature extraction, classical DRTs, such as Linear Discriminant Analysis (LDA) and some deep learning approaches, such as CNN, RNN, and DBN, were applied.
They used feature and decision fusion. DRTs were applied for feature fusion. For decision fusion, results of different classifiers were combined via weighted sum to obtain the collective decision. Other studies introduced multi-level fusion such as fusion techniques suggested for health monitoring via wearable sensors [60], fusion of text and images for COVID-19 detection [52], decision and feature fusion for COVID-19 data sets [53], feature and data fusion [61] etc. Table 1 summarizes the reviewed literature along with key findings concerning various fusion frameworks and the significance of DRTs to improve the performance of various ML models. It has been explored that most of the studies focused on one or two types of fusion (e.g., multi-modal feature and or knowledge fusion) for a particular situation, which makes their scope limited. According to the reviewed literature, to the best of our knowledge, no study has been proposed which support fusion frameworks for multi-source, multi-modal, multi-nature data and feature fusion to combine reduced feature sets and decision or knowledge fusion frameworks to combine the results of different ML models for predictive analytics. Moreover, another contribution of this study, we applied various combinations of DRTs and ML models to select the best approach used for different fusion frameworks.

III. PROPOSED FUSION FRAMEWORKS FOR PREDICTIVE ANALYTICS
This section illustrates the structure and functionality of the proposed fusion frameworks in detail. The steps of the uni-modal data processing pipeline are used as a baseline for the development of disease prediction models and systems ( Figure 1). According to the existing studies, a uni-modal data processing pipeline can work with one modality or one type of data at a time and cannot handle the heterogeneous and high-dimensional health informatics data for predictive analytics (e.g., disease prediction). In this study, different fusion frameworks have been proposed to process heterogeneous (e.g., multi-nature, multi-source, and multi-modal) and high dimensional health informatics data to develop an efficient and reliable disease prediction system (Figure 2 to 6).
These steps include data acquisition from multiple sources belonging to multiple modalities having different data types (e.g., medical images, text, signals, and tabular data). Various operations are performed at the data preprocessing and exploration phase to prepare data for further processing. During the dimensionality reduction phase, different DRTs are applied to reduce the dimensionality of data and to extract precise features for analysis. Then reduced feature set is forwarded to the predictive analytics phase, where different ML models are applied to predict disease efficiently. Finally, medical treatments can be recommended based on the prediction and classification of disease.
Moreover, due to variation in the significance of multiple features for solving the specific issue, not all the features of the data collected from multiple sources are useful for analysis and decision making. To extract relevant features from different modalities, DRTs according to the type and nature of data can be applied. Extracted features belonging to different sources and modalities can be combined (fused) for analysis. These reduced feature sets hold more precise and relevant features for disease prediction, hence, improving the performance of the disease prediction system.
For example, data collected from multiple sources (e.g., patient conversation with physician, medical diagnostic devices, lab test reports, other online and offline sources, medical research, surveys, etc.). The data X belonging to different sources S is combined in a uniform format for analysis using fusion frameworks. Each dataset x i can have f k features, where 1 ≤ k ≤ d. The fusion of multi-nature datasets may result in duplication of data and features. The data X is represented in the form of a matrix Z. The rows and columns of matrix Z are obtained from x i and f k , respectively.
In this study, a collection of DRTs are applied to eliminate redundant features. The new collective source obtained after fusion can offer a uniform representation of data. Data from multiple modalities M is denoted by m j where j = 1, 2, · · · , q. Modalities indicate different formats of data found in healthcare, such as structured (e.g., patient demographic information, hospital visit history, and billing information), semi-structured (e.g., clinical notes), and unstructured (e.g., signal, images). In this study, we categorized these modalities (data types) as tabular/structured, signals, images, and text formats. These modalities can hold information, such as patient demographic information, risk factors, and medical reports (e.g., lab tests and ECG reports). Each modality m j stores data in the form of features that are represented as matrix Z. The preprocessing phase P for each modality m j is p j to develop a standardized representation of features. These features may comprise of patient information gathered from multiple sources S and modalities M. Next, DRT are applied to transforms the HDD Z ∈ R l×d having d dimensions and l rows into lower dimensions Y ∈ R l×r where r < l < d in an ideal case.
Features extracted for analysis via DRTs will be represented as f t where 1 ≤ t ≤ r. Moreover, the reduced feature set for each modality m j is represented by h u where 1 ≤ u ≤ r. The reduced representations after applying DRTs can be quite different from original data. For example, after applying PCA on a dataset, reduced data will be presented in the form of Principal Components (PCs) based on the function used for calculation of these components, e.g., covariance or variance, etc. The reduced feature set h u belonging to multiple modalities m j can be fused and sent to the selected ML models/ classifiers c w . Results of different ML models can be combined to make a collective decision which is also called decision fusion. Different frameworks have been proposed in the following subsections, which represent the fusion of data and features at different phases of a standard data processing pipeline used for predictive analytics.
A. MULTI-SOURCE DATA FUSION FRAMEWORK Data belonging to a single source may not hold sufficient information for effective medical decision-making [62]. To overcome this issue, data from multiple sources can be acquired and combined using a multi-source data fusion framework ( Figure 2). The main objective of the multi-source data fusion framework is to increase the accuracy and reliability of data to improve the accuracy of the results. According to Figure 2, the proposed multi-source data fusion framework combines data from multiple sources and preserves them in EHRs. During the data acquisition phase, different datasets x i belonging to multiple sources S i are specified for analysis.
where x i ∈ S i At the preprocessing phase P, various preprocessing operations such as imputation missing values, outlier detection and removal, and class balance are performed to remove anomalies. The preprocessed data X p is obtained by imputing missing values and balancing the class. The preprocessed data may be greater then or equal to original data.
The dimensionality reduction phase reduces the dimensionality of data and extracts relevant features for analysis. DRTs are applied according to the type and nature of data. DRT is applied to obtain reduced data Y .
In this study, a serial of methods including graph-based, decompositions, clustering, and embedding are used to generate the low dimensional representation of linear and nonlinear data Y . This can improve the overall performance of the predictive analytics. Reduced feature sets belonging to different modalities are used as input to ML models (classifier) C for disease prediction and classification tasks (Figure 2).
A multi-source data fusion framework can collect and integrate data and information collected from multiple sources S i 758 VOLUME 10, 2022 FIGURE 2. Proposed paradigm of multi-source data fusion framework. First, data x i from different sources S i are combined and stored in EHR. Then, data X is processed using different preprocessing techniques P. Next, DRT is applied on preprocessed data Z . At the end, predictive analytics model is applied on data Y to attain the results.
to understand the phenomena of interest. For example, data collected from multiple sources, such as patient's demographic information, lab test reports, and the medical images (e.g., X-rays, CT scans, ECG reports), and signals data (e.g., ECG recording, cough or breathing, and other real-time monitoring systems) can help for the prediction of disease. Instead of making decisions manually for each patient, the proposed multi-source fusion framework can fuse (combine) and process data X from multiple sources S i for efficient disease prediction systems. These systems can assist the physicians for efficient decision making for the medical recommendations.
The main problem with this framework is the diversity of data that needs different strategies for preprocessing, dimensionality reduction, and ML models for the efficient utilization of the most relevant features belonging to multiple sources S i and modalities M.

B. MULTI-MODAL FEATURE FUSION FRAMEWORK
Uni-modal approaches can only process one type of modality at a time, which restricts their use in the presence of multimodal data. Multi-source fusion approaches offer an opportunity to work with data belonging to different sources S i . It is difficult to apply any specific data preprocessing mechanism to data collected from multiple sources S i belonging to various modalities M and data types. To overcome the limitations of the multi-source fusion framework (Figure 2), we proposed a multi-modal feature fusion framework to improve predictive analytics (Figure 3).
Feature level fusion of multiple modalities M is a challenging task, as each modality m j can have various types of issues like missing values, high dimensionality, sparse datasets, data redundancy, noise, missing or lost signals, quality of signals, lingual issues, lexical and semantic problem, class imbalance etc. To overcome these issues, separate preprocessing p j of data belonging to each modality m j , and data type is essential before applying any technique for analysis. The proposed multi-modal data and feature fusion framework comprises of the standard data processing pipeline including the acquisition of data from multiple sources S i and modalities m j , and categorize them as tabular/structured, signals, medical images, and clinical notes (Figure 3).
where x i ∈ S i derived from source S i having modality m j . At the preprocessing phase p j , different preprocessing and data exploration strategies are applied for each data type belonging to multiple modalities and sources separately to normalize and standardize data for further processing.
Then relevant features belonging to various types of data and information collected from multiple sources and modalities are fused in common space at the feature fusion phase to understand the phenomena of interest.
Next, DRT is applied to obtain reduced dimensions of data X p,j .
The reduced feature set(s) Y p,j are forwarded to the predictive analytics phase, where ML models are applied for disease prediction and classification tasks. Finally, classifier C is employed on reduced dataset to attain results.
The use of suitable DRTs can improve the performance of the ML models. Standard evaluation techniques can be applied to evaluate the results.

C. KNOWLEDGE FUSION FRAMEWORK
The downside of feature fusion is that all the features fused in a common space may not be further processed with common DRTs and ML models to enhance the performance. Moreover, the fundamental challenge in the fusion (combining) of disparate modalities lies in reducing the dimensionality of data and exploring potential features. Simple methods just concatenate the feature vectors to fuse multiple images. VOLUME 10, 2022 FIGURE 3. Structure of proposed multi-modal data fusion framework. In this paradigm, data x i ,j from different sources S i and modalities m j are processed using different preprocessing techniques p i according to type of data. Then, DRT is employed to fused preprocessed data to acquire reduced feature set Y . The results are obtained after applying appropriate predictive analytics approach. However, these methods do not consider the varying data types (e.g., image, text, signals) which need further consideration to improve the quality of multi-source and multi-nature data. In this regard, a knowledge fusion framework has been proposed in this study (Figure 4).
The proposed knowledge fusion framework comprises various phases of the standard data processing pipeline ( Figure 4). It starts from acquiring data from multiple sources S and modalities M and categorizes them according to data types, such as tabular/structured, signals, medical images, and clinical notes. At preprocessing phase P, different preprocessing and data exploration strategies are applied (as shown in Figure 3) for each data type belonging to multiple modalities and sources separately to normalize and standardize data before further processing. Instead of using common feature space for features F belonging to different modalities M and data types, separate steps are performed for each modality m j .
To reduce the dimensionality of data and extract relevant features F belonging to various types of data and information collected from multiple sources S and modalities M , a collection of DRTs according to the type and nature of data was applied for each data type separately.
Next, DRT is applied to obtain reduced data Y .
DRTs transform the input data X to a low-dimensional representation Y while preserving the original context of information and pairwise relationships between data points. Typically, these pairwise relationships can be preserved via similarity measure or distance-based approaches when combining data from multiple datasets x i [30]. However, different calculations such as variance and covariance are considered for ICA, PCA, and other DRTs depending on their functionality [11]. Similarly, different DRTs, according to the type and nature of data are selected to enhance the performance of ML models used for predictive analytics. Then reduced and relevant feature sets y i,j belonging to each data type from multiple sources S and modalities M are generated.
The reduced feature set(s) y i,j belonging to each data type is forwarded to the predictive analytics phase. ML models are applied for disease prediction and classification tasks for each data type separately at this phase.
Finally, the results of different classifiers are analyzed as individual classifier and a combination of classifiers can be fused at knowledge fusion phase.
Most of the disease prediction models are constructed using a single ML model. However, the selection of an appropriate ML model is often problematic [63]. Furthermore, the absence of a standardized evaluation techniques for the performance of the classifier also complicates the decision-making process. In addition, multi-modal features, when used as input, may lead to different combinations between features and ML models. This has further increased the complexity of the problem. Different ML models may decode different information. One should select a more reliable model by maximizing the utilization of attained information rather than selecting an optimal one from the available classifiers.
A suitable combination of DRTs and ML models improves the accuracy of results generated using different ML models. Results generated by different ML models for different data types are combined in the knowledge fusion phase, which improves the reliability of knowledge gained for predictive analytics and decision-making. The proposed knowledge fusion framework combines the decision of multiple ML models and utilizes this information for effective decisionmaking. However, the fusion (mixing) of multi-source, multinature, and multi-modal data having high dimensionality is also a big challenge. It is dependent on approaches used for fusion (combining data). Despite all the challenges, the knowledge fusion framework can offer better results when compared with uni-modal and multi-source fusion frameworks to process HDD found in the health informatics domain.

D. REDUCED FEATURE SETS FUSION FRAMEWORK
For the efficient utilization of multi-source, multi-modal, and multi-nature data X found in different data types, a reduced feature sets fusion framework is proposed ( Figure 5). It comprises phases of standard data processing pipeline starting with the acquisition of data from multiple sources S and modalities M . This framework also applies separate data preprocessing and exploration operations for each data type at preprocessing phase P similar to the multi-modal feature and knowledge fusion frameworks. After preprocessing and exploring data and features belonging to different modalities and data types, there can be multiple possibilities for further processing data to obtain the reduced representation of multi-nature data Y .
For each x i,j , preprocessing was performed where There are two possibilities either apply DRTs or combine the features (fusion) according to the problem statement.
DRT is applied to preprocessed features p belong to s i ∈ m j to obtain reduced dimensions of data Y as h u .
Apply DRTs on fused feature set F Multiple classifier Cw where w = 1tob can be employed on different reduced datasets belonging to different modalities to attain results. The results of different classifiers can be combined using majority voting method.
Features in different modalities can be combined in different ways. One possibility is to combine the features of multiple modalities found in different data types (e.g., tabular and signals data) ( Figure 5). This representation is similar to the multi-modal feature fusion framework (Figure 3). Then apply DRT or a combination of DRTs to reduce the dimensionality of data and fuse it in the reduced feature set fusion phase. Another way is to apply different DRTs to reduce the dimensionality of data having different data types (e.g., medical images and clinical notes) as shown in Figure 5 and then fuse it in the reduced feature set fusion phase. Feature sets reduced in both steps are fused (combined) in the reduced feature set fusion phase and forwarded to the predictive analytics phase for prediction and classification tasks. For instance, if the feature sets of modalities cannot be combined before applying DRTs, then the same DRT cannot be applied to the specific data type. According to the problem under consideration, it may require different DRTs for different modalities and data types to reduce the dimensionality of the selected data or combine two or more modalities for DRTs.
Moreover, this framework offers better representations of features in lower dimensions if different DRTs are applied separately to each type of data or combine two or more modalities, according to the problem under consideration. The proposed reduced feature set fusion framework perform the preprocessing for all modalities separately to achieve this objective. Then the features of two or more modalities having different data types (e.g., image and text data) can be combined before applying DRTs. VOLUME 10, 2022 FIGURE 5. Proposed reduced feature set fusion framework. Data x i ,j can be preprocessed using different preprocessing techniques p i depending on the modality of data. Reduced features set can be obtained and fused before or after applying DRTs depending on modality m j of data. Finally, machine learning model is employed to obtain the results.
The proposed reduced feature set fusion framework offers an opportunity to fuse data before or after applying DRT and combine the reduced feature set to explore the best results. It is considered that the ML model applied on the optimal (reduced) feature set can improve the performance of ML models. This fusion framework needs extra attention for exploring data. Data comprises multiple modalities, sources, or data types can be combined before and after applying DRTs to get the optimal feature sets for analysis and predictive analytics. Moreover, a common ML model applied on multi-source, multi-modal, and different data types and reduced representations may not improve the performance of the specific ML model. Despite all these improvements and flexibility to develop reduced feature sets and their fusion, this framework faces challenges for the fusion of multiple reduced features sets, their representations, and the selection of a common ML model for different reduced feature sets.

E. DECISION FUSION FRAMEWORK BASED ON A HYBRID APPROACH
A decision fusion framework based on a hybrid fusion approach is proposed in this study to overcome the issues and challenges found for the fusion of multi-source, multimodal, and different data types of multi-nature health informatics data ( Figure 6). Hybrid fusions increase the flexibility by offering an opportunity to dynamically applying DRTs on different data types, combine features of multi-source, multi-modal, HDD before and after applying DRTs. Reduced feature sets are forwarded to the predictive analytical phase, where most efficient ML models are applied to reduced feature sets of different data types to explore effective decisions.
Decisions generated via different ML models are fused to make a collective decision (in some situations, it may look like the knowledge fusion framework (Figure 4). However, multiple combinations of data X and features F belonging to multi-source, multi-modal, HDD, and multi-nature require a mechanism for selecting the most suitable and compatible techniques to improve the accuracy of decisions. The fusion of decisions generated by multiple ML models needs human interaction to get a common decision to improve the accuracy of decisions for disease prediction and management. Multiple ML models are typically used when combining data from multiple sources S to get a uniform representation in a common decesion space. The proposed decision fusion framework can represent the results generated by different ML models ( Figure 6).
Next, DRT is applied to obtain reduced data Y . There are two possibilities either apply DRTs or combine the features (fusion) according to the problem statement. DRT is applied to preprocessed features p belong to s i ∈ m j to obtain reduced dimensions of data Y as h u . Apply DRTs on fused feature set F Individual or multiple classifiers CorCw can be employed on different reduced datasets belonging to different modalities to attain results. The results of different classifiers can be combined using majority voting method.
The results generated by different ML models using fused features or independent data sources are fused (combined) to improve the accuracy and reliability of decisions. The decision fusion framework offers an opportunity to diagnose disease based on the preserved information to detect new patients with similar symptoms in less time. For this purpose, computation of probabilities decides how different imaging (medical imaging) and non-imaging (patient demographic information and diagnostic history) modalities M can be used for decision fusion to improve the performance.
However, selecting the right fusion level depends on fusion strategy and data complexity to achieve the required results. A mechanism is needed to ensure dynamic updates when new data arrives for analysis. DRTs works well to eliminate the redundant and non-significant features before and after the fusion of multi-modal, multi-source, multi-nature data.

IV. EXPERIMENTAL EVALUATIONS OF THE FUSION FRAMEWORKS AND DISCUSSION
In this study, we explore the efficiency of the proposed fusion frameworks for the detection of COVID-19 cases. Experimental work was performed using COVID19 Chest X-ray dataset [64]. The dataset holds 178 normal and 95 COVID-19 cases. The dataset comprises three modalities, e.g., X-ray images, patient information as tabular data, and clinical notes, which are helpful for the prediction of COVID cases. Dataset consist of tabular, image, and text data which hold different information for the detection of COVID-19 cases. To explore the significance of each modality (data type) for disease detection different combinations of these modalities were used according to the proposed fusion frameworks.
During the preprocessing phase P, discrepancies from data X are removed for each modality m j to ensure the completeness of data. This step is necessary before applying any analysis technique and achieving the optimal outcomes (results). Moreover, it improves the efficiency of ML models used for classification and prediction. In this phase, we applied a filter to identify the missing values and use the majority vote method [65], to impute the missing values. To overcome the imbalanced class issue, we eliminated the classes of data having five or fewer samples. The diversity of data caused due to multiple labeling were identified and converted into a uniform format, such as categorical data is converted into numeric data to prepare it for analysis. Then we split the whole dataset into train and test sets. Figure 7 represent the significance of features. Different features are ranked according to their significance for disease detection. Highly correlated and least significant features have been eliminated using DRTs.
A collection of DRTs like PCA, SVD, LSA, PP, ICA, LPP, LDA, KPCA, LLE, SOM, LVQ, and t-SNE are applied to reduce the dimensionality of the data. It also helps to explore the best suitable DRTs according to the nature and type of data. Different classifiers and ML models are trained and tested on reduced feature sets. The prediction accuracy of the ML model indicate the performance accuracy and the best combination of DRTs and ML models for prediction and classification task. Moreover, different ML models were trained and tested to develop an efficient disease prediction model. The compatibility of DRTs and ML models should be kept in view when selecting a combination of DRTs and ML models to achieve the highest performance accuracy. Accuracy of the models is calculated as number of right predictions divided by total predictions. Table 2 presents the result where collection of DRTs were used to reduce the dimensionality of various data types and reduced feature set were sent to KNN (ML model). PCA achieved highest accuracy score 97% for Tab data. For Img data Isomap achieved accuracy score of 96% and LSA achieved 95% accuracy for Txt data. Similarly, bold value indicate highest accuracy for different fusion levels and modalities.These results are obtained for normalized dataset. However, the results of the DRTs and performance of of ML model can change according to the problem statement (e.g., number of reduced dimensions used for analysis) or type and nature of data.
For different fusion levels, modalities, and DRTs SVM attain different accuracies have been shown in Table 3. In unimodal for tab data SVM with PCA achieved highest accuracy of 94%. Isomap achieved highest accuracy of 96% for img data, and for txt data LSA achieved an accuracy of 92%. Similarly, accuracies of different modalities and fusion level after applying DRTs for SVM have been shown. Bold values indicate the highest accuracy. Table 4 represents different accuracies CNN model achieved after applying a collection DRTs for different modalities. In unimodal CNN for tab data with PCA achieved highest accuracy of 95%. t-SNE d highest accuracy of 97% for img data, and for txt data LSA achieved an accuracy of 95%. Accuracies attained for different modalities and fusion level after applying DRTs with CNN have been shown (Table 4). Bold values represent the highest accuracy value.
After applying a collection of DRTs, Table 5 indicates the different accuracies of RF with different modalities and fusion levels. RF with PCA attained the highest accuracy of 94% for tab data. t-SNE attained the highest accuracy of 97% for img data, and LSA attained an accuracy of 95% for txt data. In the same way, accuracies for different modalities and fusion level after applying DRTs with RF have been shown. Bold values indicate the highest accuracy.
The performance accuracy of DT has been presented in Table 6 which shows the different accuracy levels of DRT and ML models for the specified fusion level. For tab data DT achieved highest accuracy of 94% with PCA. Isomap achieved highest accuracy of 97% for img data, and for txt data LSA achieved an accuracy of 95%. Similarly, accuracies for different modalities and fusion level after applying DRTs with DT have been shown. Bold values shows the highest accuracy.
Another purpose of these tables is to present different combinations of DRTs and ML models for the proposed fusion frameworks. Similarly, accuracies of different modalities and fusion level after applying DRTs for the selected model helps in selection of best combination.
In knowledge fusion framework, a combination of multiple ML models were analyzed for fusion framework including SVM, RF, CNN, KNN, and DT. Grid search and cross validation techniques were applied to tune the hyper parameters for DRTs and ML models. This combination helps to explore the actionable insights from the HDD, multi-nature, multi-modal, and multi-source health informatics data for efficient predictive analytics (disease prediction and management systems).
For the decision fusion, best results of different ML models and DRTs are combined to ensure the accuracy of decisions for disease prediction and management. In this study, we follow major voting method [66] to combine the results of different ML models(KNN, SVM, CNN, RF, DT). Parameters of individual ML model were decided using grid search approach. To best of our knowledge, only few studies proposed decision fusion framework for disease prediction. In this study we proposed a decision fusion framework to enhance the accuracy of prediction system. According to the study [67], structure of different classifiers may generate ambivalent results. Due to the limitations of conceptual framework and Dieterich's reasons (statistical, computational, and representational) which indicate the need of novel mechanisms to improve the accuracy of the disease prediction system [66]. In this study, we justify how multiple classifiers may offer better decision than a single one classifier.
With the aim to improve the decision for disease prediction we proposed a decision fusion framework. To evaluate the advantage of using multiple classifiers for decision fusion versus a single classifier approach for disease detection. The proposed framework is a justification of concept that for multi-source and multi-modal HDD no single classifier can be suggested. As different classifiers generate varying results and have different capabilities which can be best utilized via decision fusion to enhance the performance and reliability of disease prediction systems. For this purpose, we attain result for both scenarios apply 5 classifiers (KNN, SVM, CNN, DT, and RF) for different modalities and apply a collection of DRTs and fuse the results of multiple classifiers (KNN, SVM, CNN, DT, and RF) via majority voting method. This fused   decision is more accurate and reliable for disease prediction. Moreover, the results based on reduced feature set attained after applying DRTs. In this study, the prediction accuracy of the model is used to weight the vote. The accuracy of the ML model was calculated using sensitivity, specificity, the Area Under Curve (AUC), and Receiver Operating Characteristics (ROC) [68]. To weight the vote, the participation of decision got increase with high sensitivity and specificity score.   Isomap is 92%, respectively. For Txt data highest accuracy of individual classifier KNN and SVM with LSA is 92%, CNN with LSA is 95%, RF with Isomap is 88%, and DT with MDS 90%, respectively. According to majority voting method KNN with PCA generate best result for tab data, CNN with t-SNE achieved highest score for image data, and CNN with LSA generate best results. Decision of these classifiers are considered more significant for decision fusion.
The knowledge and decision fusion frameworks are practically useful when data comprises highly heterogeneous, multi-dimensional, and high-dimensional healthcare data. The proposed fusion frameworks will undoubtedly be a milestone for developing dynamic disease prediction and management systems utilizing multi-modal, multi-source, multi-nature, and high-dimensional health informatics data.

V. OPPORTUNITIES AND CHALLENGES OF DIMENSIONALITY REDUCTION AND FUSION APPROACHES FOR USED PREDICTIVE ANALYTICS IN HEALTH INFORMATICS
Despite all the achievements and successful implementations of DRTs for DR and fusion approaches used for combining multi-modal, multi-sources, high dimensional healthcare data for predictive analytics, these techniques (DRTs and fusion) still face many issues in their adoption and implementation. This section highlights the opportunities and challenges of DRTs and the fusion approaches used for health informatics data. These challenges indicate new opportunities for improvement to research communities.

A. HEALTHCARE DATA REPRESENTATION AND TRANSFORMATION
Data fusion approaches can select the most relevant features from the most effective modality having an appropriate amount of features for disease prediction. In the real world, most of the healthcare data collected from multiple sources often found in different modalities (format /data types) and has high dimensionality [5], [40]. DRTs can reduce the dimensionality of data and present it in lower dimensions. Many DRTs can work with a specific type of data, e.g., image, text, or signals. Similarly, when fusing data from these diverse kinds of modalities, several issues have been found, such as converting data type, measuring scales, and formats (qualitative and quantitative). In such a situation, fusion and processing of multi-nature data become a challenging task [1], [3]. In some situations, humans can easily perceive such variations in data and make decisions accordingly. While for the same situation, when relying on DRTs and data fusion tools, it may require many computations and coding even to solve a simple problem. Despite such issues, the development of the ML model that can automatically perform fusion and integration of multi-modal data has become the need of the hour. It will save time for manual interpretation each time new data evolves; furthermore, it also reduces the error rate.

B. MANAGEMENT OF STREAMING DATA
Another key challenge with data fusion approaches is dealing with multi-nature health-related data [59]. In the healthcare domain, a massive volume of data is originating at a rapid pace every hour. The main benefit of the evolving data is that medical experts can get assistance from automatic disease prediction systems. Data in healthcare collected from different devices and sensors and their integration has become a challenging issue [32]. Fusion approaches should be capable of dynamically combine data belonging to multiple formats and sources [69]. Although, some fusion approaches have been proposed in the literature to manage continuous streaming data. Yet, some challenging and unsettled issues regarding the effective analysis of the large scale and continuously growing data are missing values, blind sources, and noise.

C. COMPLEXITY AND COMPUTATION
An increase in healthcare data and the diversity found in data has become a big challenge. Similarly, the fusion process is also complex and computationally expensive [3]. To overcome, such issues DRTs offer an effective solution by providing a low-dimensional representation of the data. DRTs also has data interpretation issue after transforming data in lower dimensions. In some situations, using a simple classifier is not enough to meet the requirements of multi-nature data and needs modification. There is a need to improve the ML algorithm to handle complex and varying nature healthcare data.

D. FUSION STRATEGY
Data fusion approaches are often affected by subsequent phases such as data gathering, preprocessing, and DR. The advantages and disadvantages of fusion may also depend on the strategy used for fusion. Similarly, environmental factors, especially in healthcare, matter a lot [21]. A paper-based approach is in practice in some healthcare organizations, due to which required data may be left from the analysis. Similarly, the selection of suitable fusion strategy according to given requirements is also a big challenge because of varying needs of healthcare data as well as variations in treatments and diagnostic process [5], [31].

E. HEALTHCARE DATA ANALYTICS
To obtain compact and precise healthcare data for analysis and decision making, selecting reliable data sources with few errors, noise, and missing values is a big challenge. For predictive analytics in healthcare, time-frame, correctness, and reliability of health status measuring devices (e.g., glucometer, ECG device, blood pressure measuring instruments, etc.) and correct recording are essential. Particularly, health informatics data collection, storage, process, and transmission have become challenging and may affect the reliability, consistency, and accuracy of data quality [70].

F. MEASUREMENT AND EVALUATION OF PERFORMANCE
Fusion of data belonging to multiple sources and modalities measurement and evaluation of performance after applying fusion strategies (feature and knowledge fusion) is another challenging area to consider [59]. More dependency on IoT devices and technology has made performance evaluation more complex. In healthcare, there are many approaches to measure performance based on improved accuracy, reduce computation time, reduce error rate, reduce response time, and significance and reliability of results. Others such as F-measure, Area Under the Curve (AUC), Mean Squared Error (MSE), and Correlation Coefficient (COR) etc [68]. The selection of a suitable measuring approach to improve performance is also a challenging task.

G. VARIATION IN TERMS AND CONCEPTS
When combining data from different sources, variations often found in terms, abbreviations used, and concepts [71]. Variation in terms such as physician and doctor represent the same entity, while Fly and fly similar words may be used for different purposes (context) such as Fly as a noun and fly as a verb. Such variations are common in health informatics data and need modification according to the concepts and their use. Although Natural Language Processing (NLP) approaches play an influential role in overcoming this issue, it is still a big issue, especially when combining data from different modalities and sources [5].

H. DIMENSIONALITY REDUCTION AND FUSION OF REDUCED REPRESENTATIONS
When applying DRTs for DR, it may be challenging to answer the key questions: how much dimensionality for the given data set can be reduced? The decision regarding the number of components to use for analysis and decision making. Similarly, how many features or components should be eliminated to avoid or eliminate the noisy data? [72]. Although, most of the time, these questions vary according to the problem statement and nature of the data. It became complicated and time-consuming to evaluate the effects of reduction. Variation in feature selection and extraction methods also affects data representation and aspects used to measure the performance [31]. Moreover, combining the results of different DRTs having different representations, such as PCA represent reduced feature set as PCs. In contrast, ICA as ICs in such situation fusing the reduced representations becomes a challenge.

VI. CONCLUSION
To combine multi-source, multi-modal healthcare data for predictive analytics, we proposed multi-modal, multi-source, knowledge, and decision fusion frameworks in this study.
We applied a collection of DRTs to reduce the dimensionality of data while preserving the original aspects of HDD. Different ML models were used for disease prediction. The proposed frameworks offer more reliable and authentic results, as it predicts the disease after combining the features of data and the results (outcomes) at different phases. It can be helpful for the development of effective and reliable systems for health informatics. Moreover, fusion approaches with reduced feature sets presented in this work can provide considerable support for developing CDSS, prediction systems, and intelligent health monitoring systems. No doubt, the proposed fusion frameworks offer a clear representation of different fusion levels. The experimental results show that the proposed methodology achieved better accuracy when compared with individual modalities and the state-of-the-art fusion approaches. However, an apt problem for future study is to explore a comprehensive approach to identify the relationships among features influence during data and feature fusion. So, novel approaches can be explored in future.