Introduction
In healthcare, the generation of vast amounts of patient data is aimed at improving care quality and cost reduction. However, effectively analyzing this data presents a significant challenge in identifying trends and patterns for problem-solving [1]. The advancement of information and communication technologies has facilitated the sharing of health information, enabling more sophisticated analysis of big data [2]. Big data analytics is recognized as a valuable tool for knowledge discovery from centralized and distributed databases. Leveraging advanced statistical models and machine learning algorithms, healthcare organizations can develop accurate and personalized treatments while identifying cost-saving opportunities. Additionally, big data analytics can optimize operational efficiency, reducing patient wait times and enhancing the effectiveness of medical staff. Ensuring data privacy and security is paramount as the adoption of big data analytics expands. Healthcare organizations can mitigate risks associated with data breaches and cyberattacks by implementing robust data governance frameworks and adhering to best practices for data handling [3].
The availability of vast amounts of medical big data has brought about a revolution in the healthcare industry. Leveraging big data analytics to share, analyze, and process this data holds immense potential for identifying treatment patterns and reducing healthcare costs. However, it is of utmost importance to prioritize the privacy and security of medical data, implementing robust measures and regulatory frameworks to safeguard patients’ information [4]. Any delay in treatment resulting from concerns about privacy or security can have life-threatening consequences, underscoring the need for healthcare providers to establish reliable infrastructure for systematic big data analytics. The integration of big data analytics in healthcare represents a significant advancement in improving public health, empowering healthcare providers to devise more effective treatment plans and enhance patient outcomes.
The healthcare industry is constantly confronted with vast amounts of data originating from diverse sources like smart devices, electronic health records, medical imaging, and genomics data [5]. Managing and analyzing this data is a challenge due to its unstructured and complex nature. As shown in Figure 1, the diagram illustrates the general flow of data within the big data analytics ecosystem in healthcare. Technological advancements, including cloud computing, machine learning, and natural language processing, have significantly improved the management of healthcare big data by enabling efficient processing and analysis. The COVID-19 pandemic has further highlighted the importance of digitizing medical records and employing telemedicine, resulting in a substantial increase in data volume and placing additional pressure on the healthcare industry to effectively manage and secure this data. Consequently, efforts to develop robust security measures and privacy protocols have intensified to safeguard sensitive patient information. Block-chain technology presents a potential solution by offering a decentralized and secure approach to data storage and sharing. Furthermore, techniques like differential privacy can be employed to anonymous data, preserving patient privacy while allowing meaningful analysis of big data [6]. As the healthcare sector continues to embrace big data technologies, ethical considerations such as data ownership, informed consent, and transparency must also be addressed. Establishing clear guidelines and regulations for big data usage in healthcare ensures its ethical and responsible application, benefiting patients and the healthcare industry as a whole.
The general characteristics of healthcare big data analytics with respect to the sources of data.
A. Motivation and Scope
The healthcare sector is grappling with various challenges, encompassing rising expenses, growing demand for quality healthcare solutions, increasing patient expectations, and a shortfall of professionals, etc [7]. The complexity of healthcare data arises from the fact that it is generated by diverse sources and systems such as electronic health records, medical imaging devices, and wearables, each with its own data structure and format. In addition, the data must comply with privacy regulations such as HIPAA and GDPR, which further complicates the process of data integration and analysis [8]. The increasing availability of healthcare data and advancements in technology have paved the way for harnessing big data analytics in healthcare. However, there is a pressing need for a comprehensive review that synthesizes the existing frameworks, explores the implications, examines the diverse applications, and assesses the overall impacts of big data analytics in healthcare.
The research is motivated by the need to establish a comprehensive view of utilized frameworks. This empowers researchers and practitioners to make informed decisions and adopt appropriate approaches. Additionally, exploring the implications can guide stakeholders in understanding the potential benefits and challenges associated with implementing big data analytics in healthcare. By conducting a Systematic Literature Review (SLR) on this topic, we want to provide valuable insights and contribute to the understanding and advancement of this field and contribute towards identifying successful applications and assessing their impact, researchers can drive further development and exploration. The implementation of BDA in healthcare requires a multidisciplinary approach, involving experts from various fields such as computer science, statistics, and healthcare professionals. The use of specialized technologies such as BDA, machine learning algorithms, and natural language processing can aid in the processing and analysis of healthcare data, enabling healthcare providers to gain insights into patient care and outcomes, disease patterns, and healthcare utilization. Despite the challenges, the potential benefits of BDA in healthcare are enormous, including improved patient outcomes, reduced healthcare costs, and more efficient healthcare delivery.
In this SLR, our primary goal is to provide a comprehensive exploration of the essential elements for harnessing the power of BDA within the healthcare domain. To achieve this, we strategically structured our paper to ensure a clear progression of topics that collectively paint a holistic picture of the intersection between advanced analytics and healthcare. We examined healthcare applications of big data, including how data science, machine learning, natural language processing, and deep learning can be harnessed to address real-world healthcare challenges. It was our intention to provide concrete examples and insights into how these tools can be leveraged to enhance vital signs monitoring, predict diseases, optimize patient management, and improve hospital operations.
The main contributions of this paper are mentioned below:
An in-depth analysis of the different components of BDA and how they interact with each other. This helps readers understand the complexities of BDA in healthcare and how it can be utilized effectively.
A detailed analysis of the promising application areas of BDA in healthcare. The paper discusses successful implementations of BDA in various healthcare areas and how they have improved healthcare outcomes and reduced costs. This information is valuable for healthcare practitioners and researchers who are interested in implementing BDA in their organizations.
A presentation of the challenges and limitations of BDA in healthcare. By highlighting these challenges, the paper helps readers understand the potential barriers to implementing BDA in healthcare and how they can be overcome.
A list of reliable and authentic sources of healthcare analytics that researchers and practitioners can use.
The paper also provides an inventory of modelling tools, techniques, and deployed solutions.
Finally, the paper highlights the advantages of using BDA in healthcare through various use cases. By presenting these use cases, the paper demonstrates the potential impact of BDA on healthcare outcomes and costs.
To the best of our knowledge, this review is among very few comprehensive studies that shed light on above mentioned contributions. This review is a valuable resource for anyone interested in the potential benefits and challenges of BDA in healthcare.
B. Organization of the Review
The rest of the paper is organized as; Section II presents the published surveys focusing on the same area and their limitations. Section III discusses the overall SLR method used for this paper. Section IV describes an ecosystem for big data employed in healthcare and it answers the RQ-1. Section V discusses the big data applications in Healthcare by answering the RQ-2. Section VI presents a detailed answer to RQ-3 in its subsections. Section VII presents open research challenges that were identified during the course of this research and it also answers RQ4. Section VIII provides a discussion and implications related to the research questions. Finally, Section IX concludes the study. Figure 2 depicts the section layout of the review paper. Table 1 provides a list of acronyms used in this study.
Existing Surveys
While preparing our Systematic Literature Survey we noticed several surveys have been undertaken in the extant literature to investigate the prospects and challenges associated with big data analytics and the healthcare domain. We also observed that current surveys remain focused on foundational basics and challenges in big data healthcare. Such as the study conducted by Nambiar et al. [9] primarily centered on the examination and analysis of the challenges and prospects associated with the utilization of big data analytics within the healthcare sector. Further authors discussed big data growth expectations for the year 2015, then statistics shown for spending by geography. Lastly, they enlighten healthcare infrastructure. Following a literature study presented by Raghupathi et al. [10], they presented a comprehensive examination of the key attributes of big data, explored an architectural framework, and elucidated several application possibilities within the healthcare domain. Andreu-Perez et al. [11] performed a systematic literature review (SLR) spanning the years 2008 to 2015. The objective of their study was to offer a thorough examination of advancements in the field of biomedical and health informatics within big data. Luo et al. [12] conducted a comprehensive examination of the recent progress made in the utilization of big data in several healthcare domains. The authors emphasized the substantial expansion observed within the last five years.
Islam et al. [13] had a systematic literature review (SLR) spanning the years 2005 to 2016, with a particular focus on the potential of healthcare analytics through the utilization of data mining and big data. In their study, Bahri et al. [14] directed their attention to examining the many difficulties and possibilities associated with the utilization of big data analytics within the healthcare sector. In their study, Galetsi et al. [15] undertook a systematic literature review (SLR) with the aim of investigating the potential of data-driven methods in enhancing the effectiveness of public health and healthcare organizations. Tandon et al. [16] did a systematic literature review (SLR) examining the utilization of blockchain technology in the healthcare sector. In a similar vein, Imran et al. [17] published a thorough study spanning two decades, offering valuable insights into the application of big data analytics in the field of healthcare. Their work serves as a roadmap for future research and development in this domain.
In their study, Khanra et al. [18] did a systematic literature review (SLR) spanning the years 2013 to 2019. The authors identified and analyzed five distinct viewpoints pertaining to the application of big data analytics in the healthcare domain. Ikegwu et al. [19] conducted a systematic literature review (SLR) on the topic of big data analytics in data-driven industries. Their study aimed to explore the current state of knowledge in this area. In a similar vein, Zhang et al. [20] investigated the primary technologies employed in the rapidly expanding virtual world sector, commonly referred to as the Metaverse. Additionally, they examined the application of big data technology in crucial domains including e-health, transportation, commerce, and finance. After reviewing extensive literature for Big Data Healthcare Analytics for the target period of 2013 - 2023, we came to the conclusion that the current review paper lacks an exact focus on a holistic view of both healthcare and big data. Existing studies either present thoughts on big data or healthcare solely rather than discussing them together. In this review paper, we succinctly summarize the main issue and set the stage for further research Our study offers a more extensive analysis of the current research deficiencies in the domain of big data in healthcare, as compared to the existing surveys. This paper examines the ecosystem of big data in the healthcare sector, focusing on the issues that arise within this context. Additionally, a comprehensive analysis of various uses of big data in healthcare is conducted. Furthermore, a compilation of reliable resources for dataset collecting is presented. Furthermore, we analyze the benefits and applications of big data in the healthcare industry, therefore enhancing our comprehensive comprehension of the subject matter.
Systematic Literature Review - Materials and Methods
This section delves into the methodology employed to meticulously craft the overarching framework of our Systematic Literature Review (SLR). The purpose of this section is to present a concise and transparent overview of the methodology and search strategy that was utilized in this review to identify and assess relevant research. The primary aim was to undertake an extensive comparative analysis of previous and ongoing initiatives, critically evaluate the research findings, and uncover any deficiencies or restrictions in current knowledge with regard to addressing the research questions.
A. Methodology
In line with established best practices, this SLR rigorously follows the methodology outlined by the widely recognized PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines [21], [22]. The thoroughly designed methodology of this study unfolds in a systematic manner across three distinct stages: Input, Process, and Output, each intricately interlinked to ensure a comprehensive and transparent approach to our review, as illustrated in Figure 3. For clarity, the sub-steps within each of these stages are further illuminated in this Figure, which follows a three-stage review process inspired by Kitchenham et al. [23]. This structured methodology not only guides our systematic examination of the literature but also enhances the reproducibility and reliability of our findings.
SLR Methodology and inclusion/exclusion process for selecting relevant paper in the study.
Firstly, we developed our research questions, conducted an initial scoping review of the literature, and consulted with experts in the field. We identified gaps in the literature and areas where further research was needed, which led to the development of our research questions. The research questions were designed to explore the big data ecosystem in healthcare, review the top applications, and provide a comprehensive list of authentic data sources and modelling tools. Furthermore, the motivation behind each individual research question is mentioned in Table 3.
To achieve these objectives, a systematic and rigorous search of prominent academic databases, including Google Scholar, PubMed, Scopus, and Web of Science, was conducted. The search was confined to scholarly articles published within the last ten years, specifically from 2013 to the current year. We searched the search engines with specific keywords related to our research topic to ensure a comprehensive search. The keywords used in the search strategy were carefully selected based on their relevance to the research questions and big data analytics in healthcare. We cross-checked all scientific studies from the original source of publication to ensure accuracy and completeness.
After collecting the initial set of papers, we performed a title and abstract screening to remove irrelevant studies and then conducted a full-text screening to select papers that met our inclusion criteria. We recorded and reported the number of papers screened, assessed for eligibility, and included in the final review. To ensure data accuracy and reliability, we performed a data extraction process using a standardized form that captured essential information from each included paper, such as study design, sample size, data source, and key findings. Lastly, we conducted a quality assessment of the included studies to evaluate the risk of bias and the overall methodological quality of each study. Two reviewers performed the quality assessment independently, and any discrepancies were resolved through discussion and consensus.
B. Research Questions
In this subsection, we introduce our research questions (RQs), which seek to investigate a particular facet within the wider context of our study. By constructing a well-defined research inquiry, it is possible to conduct a more comprehensive investigation into the topic at hand, therefore yielding significant and informative findings. Through the investigation of these research questions, our objective is to make novel contributions as listed in the contribution subsection. These RQs provide a comprehensive examination of the research topic, encompassing the underlying rationales, the methods utilized to address the issue, and the potential ramifications of the results. Our objective is to enhance the body of knowledge and offer significant perspectives to stakeholders in the healthcare industry through a comprehensive response to each research question.
RQ-1: What are the components of the big data ecosystem in healthcare, and how do they interact with each other? What are the main challenges and limitations of this ecosystem?
RQ-2: What are the most promising application areas of big data analytics in healthcare, and what are some examples of successful implementations in each area? How can these applications improve healthcare outcomes and reduce costs?
RQ-3: What are the most reliable and authentic sources of healthcare data, and what are some commonly used modelling tools, techniques, and commercial solutions (presenting use cases) in big data analytics for healthcare?
RQ4:- What are the open research challenges reported in the literature in the last three years? Focusing on the solutions for the advancement of Healthcare.
C. Search Strings
Our investigation starts by searching the keywords of “big data,” “big data analytics,” “healthcare,” “clinical applications,” and “healthcare data.” Additionally, we augmented our set of keywords by including the terms “survey,” “review,” and “literature.” In addition, our search query encompassed the terms “multimodal big data,” “natural language processing (NLP),” “blockchain,” “security,” “privacy,” and “electronic health records (EHR).” The utilization of logical operators, such as “AND” or “OR”, was employed in conjunction with search strings as necessary. We searched on Scopus as it is considered as a standard for retrieving searches from credible databases.
The research question for RQ-1 seeks to investigate the big data ecosystem within the healthcare domain and its interrelationships, including the challenges and constraints inherent in this ecosystem. The search query encompasses terms such as “big data ecosystem” and “healthcare ecosystem,” in conjunction with keywords such as “components,” “challenges,” and “limitations.” Furthermore, it encompasses pertinent topics such as the “life cycle of big data,” “search tools for big data,” and “characteristics of big data,” as well as the “opportunities” and “challenges” associated with big data in the healthcare domain. This search query assists in identifying scholarly publications and research projects that explore different facets of the big data ecosystem in the healthcare industry and provide insights into the issues and constraints encountered by the existing system. The objective of the search query for RQ-2 was to ascertain the most advantageous domains for the utilization of big data analytics within the healthcare sector. This involves finding use cases where such analytics have been effectively implemented, evaluating the resulting achievements within each domain, and assessing the prospective advantages in terms of healthcare outcomes and cost savings. The inquiry integrates key phrases such as “promising application areas in healthcare,” “utilization of big data analytics in the healthcare sector,” and “healthcare implementations.” These are combined with terms such as “healthcare outcomes” and “reduction of costs for healthcare.” Moreover, this encompasses distinct domains of application such as “standardized documentation in healthcare,” “analysis of multimodal data,” “implementation of digital systems in healthcare,” “utilization of blockchain technology in healthcare,” “advancement of healthcare education,” within the framework of big data in the healthcare sector. This search query aids in identifying pertinent scholarly publications.
The primary focus for the search query of RQ-3 is to find credible sources of healthcare data, prevalent modelling tools, approaches, and commercial solutions utilized in big data analytics for healthcare. Additionally, it seeks to explore the many factors associated with data governance and ethics in this domain. The search encompasses concepts such as “trustworthy data sources,” “credible data sources,” “tools for modelling,” “techniques for modelling,” “commercially available solutions,” “analytics for large datasets,” “management of data,” and “ethical implications.” By integrating the specified keywords, the search results include scholarly articles and research studies that examine reliable sources of healthcare data, commonly used modelling tools and techniques, commercially available solutions in the field of big data analytics, as well as strategies to tackle data governance and ethical concerns within the healthcare domain. The search query for RQ-4 primarily aims to uncover open research difficulties that have been documented in the literature during the past years. Additionally, it seeks to investigate prospective solutions that might contribute to the advancement of healthcare. The search encompasses key terms such as “big data healthcare open research challenges,” “big data healthcare future research directions,” and “advancement of healthcare.” Furthermore, it incorporates potential avenues for future research within the realm of big data healthcare.
The word cloud, as illustrated in Figure 4, was utilized to condense and visually represent the predominant search phrases employed in this study for the purpose of identifying pertinent research. Therefore, it can be inferred that our search strings exhibit a strong correlation with the specific research inquiries and the studies that have been chosen.
D. Inclusion and Exclusion Criteria
In the early stage of our research, we conducted a comprehensive search using specific keywords, which resulted in thousands of records. To ensure the relevance and quality of the studies that we analyzed, we implemented a detailed methodology that is depicted in Figure 3. Our methodology included several inclusion and exclusion criteria, such as selecting studies that were published within a specific time frame, written in the English language, and fully available to readers. All other papers that did not meet these criteria were excluded from our analysis. By following this rigorous methodology, we were able to focus on a specific subset of studies that met our pre-defined standards for inclusion in our research.
E. Data Extraction and Synthesis
The present study involved a comprehensive review of the selected research papers listed in the bibliography. We followed a systematic approach to extract and synthesize the relevant data, based on the key attributes specified in Table 4. These attributes included paper ID, study title, author names, publication date, open database access, publication source, research context, document type, the topic addressed, and citation count. We recorded the data in an Excel sheet and synthesized it in a way that enabled us to effectively manage and evaluate the research data.
To draw relevant conclusions from the data, we performed various analyses to extract insights such as distribution of publications by year, publisher, reference type, and research study type. Firstly, we present a classification of publications by year, as depicted in Figure. 5; the analysis of this chart shows the increasing pattern of selected studies for this survey. This also shows the diversification of publication years. Secondly, we examined the distribution of publications by their publisher, as shown in Figure. 6; this chart shows the diversification of our study selection process is strictly aligned with the inclusion and exclusion criterion as presented in Figure. 3; this also shows that we tried to represent the presence of all major publishers. Further, we present the distribution of papers by reference type, as shown in Figure. 7; the objective of this chart is to show the diversified inclusion of reference type (Journal Article, Conference Paper, etc.) included in this survey; this chart also highlights the importance of journal article category in our particular manuscript domain. Last but not least, we presented an overview of the distribution of study type, as shown in Figure. 8; this chart is a further addition to the previous chart. The analysis of this chart shows the growing interest in article types published within the scope of this manuscript the experimental papers and the review articles such as our survey. These analyses helped us to identify trends, patterns, and insights that are relevant to our research questions.
F. Quality Evaluation Question
In this subsection, we present four quality questions that we applied to evaluate the final selected papers with Quality Evaluation of Evidence (QEQ) scores.
Does the article explicitly discuss the big data and data analysis methods used?
Does the article discuss advantages and challenges related to the topic?
Does the article present or discuss potential applications of the topic in healthcare?
Are the outcomes presented in the article valid and aligned with the utilized methodology and topic of interest?
Table 5 summarizes the paper quality based on the designed QEQ. The score is calculated on the basis that: If the answer to a particular question turns out to be true, Y is counted and each Y is mapped to 2.5 points collectively all four Y are mapped to 10 points as the formula defined in Eq.1. If any of the selected studies scored between 6 to 10, we counted that particular study as the most relevant study, and the class category is set to be “High”. For a paper to be in a “Medium” class, the particular paper must score between 3 to 5, and for a paper to be in a “low-class” the score should be at least 2.5. In our analyses, very few studies fell in the low category because of rigorous methodology; all papers had passed through our inclusion and exclusion criteria. All the mentioned reference studies are used in our manuscript except those studies that fall below “low-class” papers from our QEQ table. For example, a study that does not pass any of the quality questions is categorized as extremely low and directly excluded from consideration such studies are not recorded in this table.\begin{equation*} TotalScore\_{}Yes = \left ({\sum _{i=1}^{4}Y_{Qi}}\right)\times 2.5 \tag{1}\end{equation*}
\begin{equation*} TotalScore\_{}Other =10-\left ({\sum _{i=1}^{4}N_{Qi}}\right)\times 2.5 \tag{2}\end{equation*}
This QEQ table list is recommended to be considered as a subset of the full QEQ study. We applied a QEQ filter to 256 numbers of studies as mentioned in Figure. 3, while here we maintain a subset to contribute to effective SLR methodology and avoid multi-page tables to maintain simplicity and reduce the length of the manuscript. While our designed QEQ questions provide a useful framework for evaluating the quality of the papers, it’s important to keep in mind that they are just one aspect of our comprehensive evaluation process. So it is suggested that other factors may influence the quality of the papers including but not limited to the study design, sample size, statistical analysis, potential biases, and the overall relevance and contribution of the research to the field.
G. Scheme to Answer Research Questions
This study aims to provide answers to the research questions developed based on an extensive review of existing surveys. Table 3 lists the proposed research questions, their motivation, and the direction of the answers provided in the study. Additionally, a research question-answer mapping Table 6 is included, which provides an overview of the references used for specific answers. Furthermore, the study includes a research gap Table 2, which compares various parameters used in the review with existing surveys. This approach provides a comprehensive analysis of the research questions and the existing literature, contributing to a better understanding of the research gap and providing insights for future research.
Ecosystem for Big Data Employed in Healthcare - Answer to RQ-1
The big data ecosystem/platforms can potentially improve the applicability of clinical research studies in real-world scenarios, which traditionally have been hindered by the diversity of the populations being studied. In addition, it provides an opportunity to perform patient stratification, which is necessary for successful and precise medical treatment [50].
The deployment of an ecosystem in the healthcare sector has a significant and transformative history, commonly referred to as Healthcare Informatics and Analytics (HCI&A) [87]. HCI&A encompasses the integration of information technology and data analytics in healthcare to improve patient care, decision-making, and operational efficiency. HCI&A has evolved over time, with distinct stages known as HCI&A 1.0, HCI&A 2.0, and HCI&A 3.0. Each stage represents advancements in technology and approaches to healthcare informatics. HCI&A 1.0 focused on the implementation of electronic health records (EHRs) and basic data analysis. HCI&A 2.0 introduced more sophisticated analytics techniques, such as predictive modeling and machine learning, to extract insights from large healthcare datasets. HCI&A 3.0 aims to leverage emerging technologies, such as artificial intelligence and the Internet of Things, to enable more personalized and proactive healthcare [185].
These revolutions in HCI&A are closely intertwined with the evolution of the World Wide Web. The utilization of Web 2.0, characterized by user-generated content and social media platforms, has played a significant role in facilitating collaboration, information sharing, and patient engagement in healthcare. However, as technology continues to advance, the healthcare industry is transitioning toward the development and optimization of Web 3.0. Web 3.0, also known as the Semantic Web, emphasizes the use of linked data and semantic technologies to enhance interoperability, knowledge representation, and intelligent decision support in healthcare.
In this section, we discuss the big data ecosystem that has been deployed in healthcare. To create our proposed ecosystem, we have adopted the most relevant aspects from the literature and have developed an optimized version of these characteristics, which is depicted in Figure 1. The ecosystem comprises various components that are responsible for the efficient processing and management of big data in healthcare. By implementing this optimized big data ecosystem, healthcare organizations can improve their operations and enhance patient care.
A. Healthcare - Big Data Life-Cycle
The term life cycle itself portrays the visual picture of stages involved in a particular domain as depicted in Figure 9. Like all other domains, healthcare typically involves the following stages, typically considered in the healthcare data life-cycle; these steps may change as per requirements. The life-cycle [13], [15], [18], [19], [39], [50], [78], [115], [128] usually starts with the data collection stage, followed by processing to transform and clean the data. The data analysis then involves applying such as statistical or machine learning techniques to identify patterns or insights in the data. The next step, data interpretation, involves drawing conclusions and making decisions based on the analysis. Finally, the data delivery stage involves presenting the findings to the end-users or stakeholders in a useful and actionable format.
Data Collection:- This stage involves collecting various (multimodal) data types, including patient data, medical records, laboratory data, and other information from relevant sources.
Data Storage:- Once the data has been collected, it needs to be stored in a way that is easily accessible and searchable. This may involve the use of cloud storage or other forms of secure data storage.
Data Processing:- The data processing involves using various tools and techniques to analyze the data and extract useful insights from it. Further, the common method includes machine learning algorithms, statistical analysis, or other advanced ways of data processing.
Data Analysis:- Once the data has been processed, it can be analyzed to identify meaningful trends and patterns present in the data. This may further apply data visualization tools to help make sense of the data and identify important insights; various tools are listed in Table 12.
Data Interpretation:- The last but not least stage of the big data life-cycle in healthcare is data interpretation. This involves taking the insights gained from the data analysis and using them to make informed decisions about patient care, treatment plans, and other healthcare decisions.
B. Characteristics of Big Data/Health Records
The term “big data” has gained widespread usage in recent years, as the volume, velocity, and variety of data being generated and collected has increased exponentially. This has led to a need for advanced analytical methods that can extract valuable insights from this data. Big Data Analytics (BDA) is a collection of techniques and technologies designed to address these challenges, including machine learning, data mining, and predictive analytics. These tools enable researchers and analysts to identify patterns and relationships in large and complex datasets and to use this information to make better decisions and predictions.
Defining big data precisely can be challenging, as it often depends on the perspective of domain experts. Rather than a specific definition, experts typically look at the “V’s” of data, which represent the characteristics of data. The number of V’s has increased over time, with different experts defining them in their own way. In the literature, the number of V’s associated with big data has expanded from the fundamental 3-V’s to 5-V’s, 7-V’s, 40-V’s, and even 51-V’s, as reported in scientific material [44], [76], [108].
In the context of healthcare, the literature commonly refers to six V’s,
Volume: Refers to the size of data, indicating the vast amount of information generated and collected in healthcare.
Variety: Represents the complexity of data, encompassing different data types and formats, such as structured, unstructured, and semi-structured data.
Velocity: Signifies the speed at which data is generated, transmitted, and processed, highlighting the real-time nature of healthcare data.
Veracity: Refers to the quality and reliability of data, considering factors such as accuracy, consistency, and trustworthiness.
Value: Represents the knowledge and insights that can be derived from data, emphasizing the potential benefits and actionable information that data analysis can provide in healthcare.
Variability: Reflects the variability and dynamic nature of data, accounting for fluctuations and changes in data patterns and characteristics over time.
C. Big Data Search Tools
In the context of Big Data, traditional search engines cannot handle the vast amount of unstructured data. Thus, Big Data Search Tools are developed to enable efficient and effective searching and analysis of massive datasets. During literature work we found one of Github repository [118] that lists the two sources as Big Data Search Tools such as i) Apache Luncene and ii) Apache Solr as summarized in Table 7. Both of the tools are different in terms of their key applications. Solr has gained more popularity among developers and researchers due to its ease of use, scalability, and ability to handle large amounts of unstructured data. Solr has also been widely adopted in the industry, with companies such as Netflix, eBay, and Instagram using it for their search applications [117], [118]. Our review suggests that these tools can be vital in managing and searching Big Data in various domains, including healthcare.
D. The Role of Big Data in Healthcare
Enabling EHRs opens up exciting prospects for enhancing clinical decision-making infrastructure. There are, nevertheless, significant challenges to overcome. In this section, we discuss the four major components of big data in terms of healthcare data: volume (size), variety (diversity), variability (temporal resolution), and value (quality). Then we discuss the sequential flow of descriptive, predictive, and prescriptive analytics, and how each plays a key role in clinical decision-making.
Currently, conventional primary healthcare models are dependent on various disconnected systems and information sources, which are obsolete now. The new digital healthcare paradigm will move toward an inherent capability to ensure information is exchanged between systems in a way that is both seamless and secure. EHRs are massively heterogeneous and multimodal by nature. Clinical data must be preserved at all costs, as this is a key premise underlying all medical information management systems. Further, it is not only to preserve the data’s originality but to keep it as secure, private, and de-identified as possible. Big Data is characterized by its size, speed, diversity, accuracy, and significance. Big Data in terms of healthcare is really big. In 2013 it was predicted globally that the healthcare data to be produced 153 exabytes; But in 2020, 2314 exabytes of data were reported by [123]. Moreover, the 11000% change in data volume is also reported by [123]. It can be concluded that the amount of information has doubled every year. Healthcare is a real big data sector [59], [124]. It is reported that 30% of the stored world data is health sector data [59].
The velocity indicates how rapidly information is being generated, stored, or transmitted. Every year, a patient receives approximately 80 megabytes of data in EHR [124]. By 2025 the growth of healthcare sector data will be around 36% of the global data [125], [126].
The value or quality of the data is determined by how well it can be used to generate and evaluate hypotheses. It’s also important to know if the provided or collected data can help to predict what will happen in the future. If so, we can act early to make things better. Viability [107] is also a quality dimension that shows whether the data are useful for the use case. Because of the data, data mining and artificial intelligence, and all other sub-techniques including but not limited to machine learning, deep learning, and natural language processing can be effectively applied, allowing us to understand more about clinical decision-making systems.
Big data in terms of healthcare is a conceptual framework of artificial intelligence as a path through descriptive, diagnostic, predictive, and prescriptive analytics. Understanding historical data is the goal of descriptive analytics, which employ methods such as data aggregation, data mining, and user-friendly visualizations to get there. Reports that answer questions such as “How many patients were admitted to a hospital last year?” are typical examples of descriptive analytics. Within the last 30 days, how many patients did not survive? Or, how many people become infected while being treated? Descriptive analytics provides simple methods for summarizing data with histograms and graphs to display the attributes of data distributions. The connecting of datasets is typically necessary to acquire substantial insight and understanding for the purpose of optimizing healthcare delivery while reducing costs. To rephrase, it is preferable to combine facts from various sources. Simply, this means coordinating efforts throughout a hospital to share patient data. In so far as it is based on a single moment in time in the past, descriptive analytics is restricted in its potential to inform decision-making. It is helpful, but it may not be predictive of everything that happens.
On the other hand, diagnostic analytics aims to determine the reason behind a phenomenon by analyzing the collected data. Diagnostic analytics could include correlation techniques that find links between clinical variables, treatments, and drugs. While predictive analytics allows us to determine what will happen and how likely it will happen, we might want to know, for example, how likely a patient is to die, how long they will be in the hospital, or how likely they are to get an infection. Predictive analytics uses the data’s past values to give useful information about important events that will happen in the future. Predictive analytics are in trend/demand because healthcare professionals believe in evidence-based systems to predict and avoid adverse effects. In addition, predictive analytics facilitate early detection, saving lives and improving patients’ quality of life. Lastly, prescriptive analytics optimize decisions. They use all available information to make the best action decision. Predictive analytics help us evaluate clinical interventions and examine the system’s usefulness. Furthermore, prescriptive analytics predicts what will happen and the reasoning behind why it will happen. Prescriptive analytics helps turn a prediction model into a decision model.
The availability of healthcare large data offers several benefits but also poses several significant challenges. The first of these is interoperability and then privacy and security. With such a diversified healthcare system, which comprises continuous data sources and stakeholders such as healthcare providers, physicians, government agencies, and wearable technology. It is necessary to implement a centralized data repository. Maintaining the high level of interoperability required for efficient information sharing at the right times is a significant challenge.
The lack of standards in the healthcare field exacerbates the situation even worse. Patient privacy and safety must be considered during the interoperability design phase. A lack of interoperability, for instance, could lead to medical blunders and put patients at high risk. Having timely access to data is also crucial for ensuring patient safety. Patient data should be shared in real-time in response to a valid request, but care must be taken to protect patient confidentiality. This adds a new level of complexity to healthcare administration. One difficulty with big data in healthcare is quick changes in actual facts.
Big data in healthcare is an invaluable resource that can be described by its size, variety, speed, veracity, and value. Clinical decision support systems use the information in this data by following a path from descriptive analytics to predictive analytics to prescriptive analytics.
E. The Need of Big Data Analytics in Healthcare
The number of BDA applications in healthcare is gradually expanding due to the increasing volume of big data in this area. Big data in healthcare may arrive from various sources, including diverse and multi-spectral observations on patients, such as their demographics, treatment histories, and diagnostic results. Data can be structured (e.g., genotype, phenotype, or genomics data) or unstructured (e.g., a collection of observations) (e.g., clinical notes, prescriptions, or medical imaging). When it comes to implementing data in healthcare, it is frequently necessary to generate and gather high-quality real-time data. Decision-makers in the healthcare industry can take meaningful action due to significant insights gained from large amounts of information. The enormous rise in data acquired via EHRs, registries, or wearable sensors has brought a big data revolution to the health sector. This huge available data gives several benefits such as increased quality of life, disease diagnosis, treatment, and healthcare service delivery system. Big data generated in healthcare is massive, heterogeneous, and fast. In addition to non-uniform data, big data in healthcare requires real-time data analysis. Big data is evolving rapidly, and healthcare organizations are deploying technology to keep themselves updated.
According to Hardy Carter [51] and Dimitrov [38], the demand for big data in healthcare can be classified into potential advantages, as listed in IV-E. Real-time applications of big data in healthcare can also be grouped into three subcategories, namely: a) improving patient care, b) enhancing doctors’ experience, and c) reducing organizational efforts.
Predictive modeling to identify patient-centered conditions
Early detection and prevention of diseases or medical conditions
Extensively research and development to cure diseases
Promotion and use of Electronic Health Records (EHRs)
Patient engagement and empowerment through data-driven insights
Predictive analytics to identify and mitigate risks for patients
Alert generation for instant care
Health data analysis for strategic planning and resource allocation
Fraud reduction and data security enhancement
Reducing unnecessary hospital visits or emergency room visits
Integration of medical imaging and other diagnostic tools for better healthcare outcomes
Smart and better staff management
Continued education and development opportunities for medical professionals
Self-harm and suicide prevention using predictive analytics and intervention
Help to develop new inventions
Reducing administrative and managerial costs through data-driven decision-making.
The list of potential applications of Big Data Analytics in healthcare presented above is not exhaustive. However, it provides a comprehensive overview of some of the most important applications. Achieving these benefits will enable big data-driven health organizations to outperform their peers in their daily operations. Figure 10 depicts the Big Data-driven health organization performance framework, which includes sub-health-domains. It is noteworthy that the successful implementation of big data analytics in healthcare requires the collaboration of various stakeholders, including clinicians, data scientists, and policymakers. This collaboration enables the integration of data from multiple sources and the development of algorithms that can provide actionable insights. It is also important to ensure that ethical considerations are taken into account when using patient data for research purposes. This can be achieved through the establishment of clear guidelines and protocols for data sharing and informed consent [101].
F. Opportunities for BDA in Healthcare
As per the discussion in the above section; We prepared a comprehensive list of possible tasks or Opportunities; these could be further summarized as per the following four categories.
We may consider any of the below-mentioned opportunities as BDA healthcare applications for future application of this review. BDA can also be used to analyze patient data, such as symptoms, medical history, and lab results, to help healthcare providers diagnose more accurately.
Medical Diagnosis - Medical Diagnosis can be multi-modal in nature; such as medical images (X-rays, MRIs) or clinical raw text, time series, or tabular format data.
Community Healthcare - BDA can improve community healthcare by analyzing population health data, identifying risks and trends, and developing targeted prevention and intervention programs. BDA can identify high-risk populations for diseases like diabetes and heart disease and develop outreach programs to promote healthy behaviors and prevent disease.
Hospital Monitoring - BDA may be used to monitor and enhance hospital operations, including patient flow, resource allocation, and quality of treatment. For example, BDA may be used to measure patient wait times, detect bottlenecks in the treatment process, and optimize resource allocation to enhance efficiency and save costs. BDA may also be used to evaluate patient safety and quality of treatment by assessing patient data, such as medication mistakes and adverse events.
Patient Care - BDA may be used to improve patient care by evaluating patient data, monitoring patient progress, and delivering individualized treatment suggestions. For example, BDA can be used to assess patient vital signs, examine pharmaceutical efficacy, and forecast patient outcomes. Moreover, BDA may be utilized to create individualized treatment plans predicated on patient data and medical history, allowing doctors to give more precise and efficient care.
G. Challenges of BDA in Healthcare
The benefits of big data analytics (BDA) in healthcare are significant, but implementing BDA poses various challenges. Our review of relevant literature [25], [32] has identified three main categories of challenges: data, process, and management challenges. To help visualize these challenges, we have created Figure 11. Note that the figure only lists a few challenges for each category, as the list of potential challenges is extensive. In this section, we briefly describe the most commonly reported challenges.
Data privacy and security: Healthcare data is sensitive and contains personal information that must be protected to comply with regulations such as HIPAA. Implementing Big Data in healthcare requires a robust security infrastructure to protect patient information from unauthorized access or theft.
Data quality: The quality of healthcare data [35] is crucial to ensure accurate analysis and predictions. However, healthcare data is often incomplete, inaccurate, or inconsistent due to various factors such as human error, outdated systems, or inadequate data management practices.
Data integration: Healthcare data is often stored in multiple disparate systems, making it difficult to integrate and analyze effectively. Integrating data from different sources requires a standardized data format and a robust data integration infrastructure.
Resource Constraints: Implementing BDA in healthcare requires significant resource investment, including hardware, software, and personnel. Lack of resources may hinder the implementation of BDA in healthcare.
Data governance: Effective healthcare data governance is crucial to ensure compliance with regulations, maintain data quality, and protect patient privacy. This requires a clear definition of roles and responsibilities, policies and procedures for data management, and a framework for data sharing.
Skills and expertise: Implementing Big Data in healthcare requires skills and expertise in various areas such as data analytics, data science, machine learning, and software development. Healthcare organizations may need to invest in training or hire new talent to build the necessary capabilities.
Cost: Implementing Big Data Healthcare can be costly due to the need for infrastructure, hardware, software, and human resources. Healthcare organizations may need to invest significant resources to implement a robust Big Data infrastructure.
Resistance to change: Healthcare is a highly regulated and conservative industry, which can lead to resistance to change. Implementing Big Data in healthcare requires a culture shift towards data-driven decision-making and a willingness to adopt new technologies and practices.
Ethical considerations: The use of Big Data in healthcare presents ethical questions, such as the use of patient data for research or business, the possibility of discrimination or bias, and the need to tell patients how their data is being used.
Addressing these challenges requires a strategic approach that considers the unique characteristics of the healthcare industry and the specific needs of patients and providers. By addressing these challenges, Big Data can significantly impact healthcare, leading to better outcomes, improved efficiency, and reduced costs [18], [37], [43], [45], [46], [65], [68], [71]. Each of the above-mentioned challenges could be defined in a detailed manner but presently we only focus on security and privacy concerns as discussed in the following subsection. The list of challenges can not be finalized as we came across various interchangeable terms. Individual researchers [55], [57], [58], [64], [68], [80], [127] put their efforts to list few of them as we have also presented a very short but effective list of challenges being faced in BDA healthcare.
H. Big Data Security and Privacy in Healthcare
Security and privacy [60], [61] in the context of big data are crucial considerations. However, both terms are mistakenly treated as the same and refer to distinct concepts that are difficult to differentiate.
Security is the confidentiality, integrity, and availability of data while privacy is the appropriate use of user’s information. For security, various techniques such as Encryption, Firewalls, etc. are used to prevent data compromise from technology or vulnerabilities in an organization’s network. To maintain privacy organization can’t sell its patients’, and users’ information to a third party without the user’s prior consent. Security may provide confidentiality or protect an enterprise or agency while privacy concerns patient’s right to safeguard their information from any other parties. Security offers the ability to be confident that decisions are respected. While privacy is the ability to decide what information an individual goes and where to. At last, security focuses on data protection, while privacy concerns the appropriate use of user’s information [47], [54], [129].
Security is often defined as the prevention of illegal access, various definitions also include the preservation of data integrity and availability, among other things. It is primarily concerned with protecting data against malicious attempts and stealing data for financial gain. Although security is critical for data protection, it is insufficient when securing personal information. Further, we may have to deal with the below-mentioned sub-areas for better understanding and implementation and leave them uncovered as it is out of our coverage.
Privacy is frequently characterized as protecting sensitive information, such as personally identifiable health care information, from being disclosed to unauthorized parties. In particular, it focuses on the use and control of an individual’s personal data, including the development of rules and the establishment of authorization criteria to guarantee that personal information about patients is gathered, disseminated, and handled appropriately. Below are the main sub-domains for further exploration and we left this part and did not shed it as it is out of the domain of study.Big Data Applications in Healthcare - Answer to RQ2
The healthcare industry’s sources of Big Data include hospital records, patient medical records, test results, and Internet of Things (IoT) devices. Biomedical research generates Big Data that is used in public healthcare. By integrating biological and healthcare data, modern healthcare organizations can modify medical therapy and even personalize medicine, as noted by Dash et al. [74]. Healthcare has become a vital component of people’s lives, resulting in an explosion of medical big data. Healthcare practitioners are now utilizing IoT-based wearable technology to expedite diagnosis and treatment. The Internet has recently connected billions of sensors, devices, and automobiles [38]. Remote patient monitoring is one such technique used currently in inpatient treatment. Despite the benefits of these technologies, they also raise significant concerns regarding the privacy and security of data during transit and logging. The delay in treatment could risk the patient’s life.
A. Health Standards Documentations
The development of the health platform has taken a significant amount of time. The Clinical Documentation Architecture (CDA R1) was first defined in May 2005 [134], and it became the American National Standards Institute (ANSI) approved HL7 standard [135], [136], which became the specification for the Reference Information Model (RIM) [190]. Even while it has been disseminated worldwide, its implementation is still not as widespread as it should be. The Continuity of Care Document (CCD) is an HL7 CDA implementation of the Continuity of Care Record (CCR). A summary of the patient’s health state, including issues, drugs, and allergies, is included in the CCR data set. This summary includes fundamental information regarding the patient’s care plan, documentation, and health insurance [62], [137]. Further, from the literature, it is observed that a recent development and speedy migration from HL7 document structure to FHIR. FHIR is a standard for electronic healthcare data exchange proposed by HL7. FHIR is HL7’s latest and most popular healthcare data-sharing standard. It integrates HL7’s v2 and v3 but also offers more contemporary and flexible interoperability. Further, it uses HL7 communications protocols including HL7 Version 2 and HL7 Version 3, and current web technologies like RESTful APIs and JSON for organized data exchange. Healthcare providers, suppliers, standards development groups, and individual contributors collaborated on FHIR. HL7 encouraged conversations, consensus-building, and field testing to ensure FHIR met stakeholder interoperability needs. Healthcare community input to provide a more current and adaptable healthcare interoperability standard [191], [192].
B. Multimodal Big Data Analytics
The term “multimodality” refers to the process of utilizing a wide range of data types along with various modes of representation. Data in healthcare is multimodal by nature and it is becoming more multimodal day by day. Emerging technologies allow people to use different ways to interact with a system and combine different types of information simultaneously. Multimodal data with the help of AI tries to understand and get insights from different types of data parameters by making connections between them [193]. In fields such as biology, medicine, and health, it can help analyze connections between different biological processes, health indicators, and outcomes. It can also be used to create models for understanding and explaining these relationships [90], [104], [113].
In the field of high-performance computational sciences such as big data analytics and processing, multimodality is relatively a new concept that aims to integrate multiple data streams in various formats such as text-image-video and audio to enhance the precision of information extraction and inference, reduce bias, and generate an overall better representation of the physical, medical, or societal processes that are described by the data. Incorporating multimodal into the processing of multidimensional and multimodal data sets in mission-critical domains such as health and medicine can help to design better decision support systems; inherently better health analytics, improve prediction, diagnosis, risk factors, and patient follow-ups. These systems are to be used by health professionals and policymakers.
Multimodal data encompasses information derived from multiple sources or modalities, which unveil essential characteristics of real-world domains, including clinical applications. Currently in the clinical domain, disease severity or disease diagnosis and mortality prediction-based machine learning models require multimodal data to achieve better results than the conventional approaches. Missing data is commonly reported in multimodal data [86], [113]. As different types of examinations are conducted for individual patients and missing data may arise due to mishandling/data corruption errors from several components such as demographics data collection, lab test data, clinical notes, etc [113].
Recently, COVID-19 became a reason for adopting a multimodal approach because from the literature it has been noticed that COVID-19 data was generated in a multimodal nature. This increased the demand for such tools and techniques to predict, prevent, and manage diseases at a large scale from a single patient to the whole sample [84], [90]. The extensive use of multimodal approaches also has been reported in several studies in other areas of bio-medicine and health, such as chronic disease surveillance, screening and assessing child mental health, oncology, emotion detection, ophthalmology, and detecting dementia [138]. The article by Baltrusaitis et al. [56] provides a comprehensive review of recent developments in multimodal machine learning.
MAET – Mask Adherence Estimation Tool, an application-based study presented by Gupta and Srivastava [104]; MAET is a robust system to detect the pattern of public mask-wearing using pre-trained model YOLOv5 and integrates YOLOv5 with explainability to help the user understand at an individual and aggregate level. One more study [100] presented applied research for predicting ICU-admission ratio using a factor graph-based model. Another study [89] presented a semantic network analysis of pandemic patients’ vaccine text dataset of Reddit. Zadorozhny et al. [111] suggested a set of practical evaluations and tasks to consider when selecting the best Detection of Out-of-Distribution (OOD) samples for a particular medical dataset. During the pandemic, users shared their stories, and experiences, and governments used to convey pre-cautionary messages on social media channels including but not limited to Reddit, Twitter, and Facebook. This rich information became a useful source for researchers to collect and analyze multimodal data. And Rohan Bhambhoria et.al presented a naïve NER-named entity extraction-based paper for clinical insights on covid-19 Twitter data.
In study [139], authors introduced a Python-based library known as PyHealth. It is a complete Python healthcare AI toolkit developed for ML researchers and healthcare professionals. PyHealth accepts a wide range of healthcare data, including longitudinal EHRs, continuous signals (ECG, EEG), and clinical notes (to be added), and supports deep learning and other advanced machine learning algorithms. PyHealth has five key benefits. First, predictive health algorithms such as XGBoost and auto-encoders are included, as well as current deep learning architectures such as convolutional and adversarial models. Second, PyHealth has broad coverage with models for sequence, visual data, physiological signals, and unstructured text data. Third, PyHealth provides a consistent API, thorough documentation, and interactive examples for all methods, making complicated deep learning models simple to use. Fourth, most PyHealth models have cross-platform unit testing with continuous integration, code coverage, and maintainability checks. In addition, PyTorch supports fast GPU computing for deep learning models, enabling parallelization in select modules (data preprocessing). The PyHealth library comprises a collection of 30 AI-based models. For a comprehensive list of these healthcare AI models available in PyHealth, interested readers are encouraged to refer to Table-1 in [139].
In a recent study Joshi et al. [106] state that data sharing and collaborative model training are promising ways to improve the quality of healthcare models. However, it is usually difficult to implement such settings in practice due to data privacy concerns and relative regulations such as the GDPR and HIPAA.
Shah et al. [66] presents a tutorial study on big data and predictive analytics. They presented four major barriers to useful risk prediction:
Data quality and heterogeneity
User trust, transparency, and commercial interests
Statistical prediction
Thoughtful identification of risk-sensitive decisions
A significant amount of potential exists for big data and predictive analytics to promote better and more efficient treatment, and there have been important recent developments, particularly in the field of image analytics such as below mentioned sub-domains [66].
Clinical diagnosis and research
Disease transmission and prevention
General Healthcare
Health insurance
Service delivery system
C. The Concept of Data Fusion in Multimodal Data
Data Fusion can be considered a study of data sets from different sources communicating with each other [33], [95]. Further, this study suggests that data fusion improves the performance of a particular framework/methodology or algorithm if considered in data analysis. The concept of Fusion is generally classified into two types: model-agnostic approaches and model-based approaches. The latter is further classified into three sub-types; i) Early ii) Late and iii) Hybrid followed by the data fusion keyword. Data fusion, which uses ML and DL techniques to combine data from different sources, is becoming increasingly important in medicine. Data fusion is widely used in the research community as a proper method for multimodal data analysis [86]. Since the inception of the fusion concept several methods [33], [56], [96], [137], [161], [193], [194] have been proposed to deal with the fusion of multiple data types for such we prepared a comprehensive taxonomy view for fusion handling techniques as presented in Figure 12.
The integration of multiple sources of data is a challenging task. However, data fusion’s techniques and levels aim to provide data integration services for multimodal data where a single modality does not work. Data fusion is also challenged by noisy and irrelevant data that could lead to weak models and degraded performance [95]. Furthermore, data fusion steps including combining and normalizing data require high computational power, which is severely challenging for multi-modalities data fusion [88]. Last but not least challenge with data fusion is that no “off the shelf” technique is available that could always work for any type of data combination and could not guarantee enhanced results compared to a single modality. Nevertheless, algorithms such as GLRM-generalised low-rank modelling could be considered to combine different types of data and to develop better prediction models.
D. Current Digital Healthcare System
The term “digital health” is understood as advanced analytics based on multi-modal data. It tries to maximize the use of IoT-based sensors to enable clinicians to access the right information at the right time. With digital health systems, there are also ways to collaborate with specialists from across borders. The implementation of digital healthcare systems has revolutionized the healthcare industry by automating routine laboratory work and essential procedures, thereby enabling clinicians to allocate their time and attention to critical cases. The integration of AI and DL platforms within these systems further enhances their capabilities, allowing them to perform necessary actions and support clinicians in delivering efficient and effective diagnoses and treatments for patients. Furthermore, the digital system also helps automate billing and further documentation work. Last but not least digital healthcare systems mean providing care to a single patient while also providing care to thousands of patients all at the same time [63].
The need and increase in demand for digital/internet-based healthcare systems is rapidly growing. It is anticipated by the end of 2050, older people aged around 60 will reach 200 million and 80% [140] of them will be from developing countries, and for them, a healthcare system is a major concern.
The authors [137] propose the Tianxia120 digital medical health system for “one-step service” to both patients and hospitals. The system can rigorously promote the change of service status between doctors and patients from “passive mode” to “proactive mode” and realize online service that is similar to offline medical treatment scenarios. There are separate terminals for patients and doctors. Further authors claim that this system is full-function as well as rich in size of data and major security concerns are already tackled.
Davide Ferrari et al. [103] presented a review-based study for data-driven and AI-based clinical practices. They have thoroughly reviewed a couple of recent studies and concluded a list of recurrent research issues including but not limited to i) Data Imbalance ii) Data Inconsistency iii) Data Sparsity followed by case three real-time case studies (“My Smart Age with HIV”, “Covid–19, predicting respiratory failure” and “Covid–19, predicting oxygen therapy states”) to support the challenge list.
In current healthcare innovation, the four P’s concept is gaining popularity. P4 - Preventive, predictive, personalized (individually tailored), and participatory; cannot only ensure people’s lives much better but also save a lot of money and make healthcare more efficient. This study focused on age-related disorders and investigated the potential of a data-driven method to forecast the wellness states of aging persons, as opposed to the knowledge-driven approach that depends on easy-to-interpret measures routinely supplied by clinical specialists. The results show that the data-driven method is better at making predictions. We also show that a post hoc inference procedure can be used to explain the predictive models in a way that makes sense and opens the door to new kinds of personally tailored and preventative care [85], [96].
E. Big Data and MDL for Healthcare
Analysis of big data by MDL - Machine or Deep Learning offers considerable benefits in evaluating a large and complex set of healthcare corpus [70], [141]. However, before moving further MDL poses several challenges that need consideration as mentioned above IV-G. One of the advantages of MDL in healthcare is its flexibility and scalability compared to traditional bio-statistical methods. It can be used for various tasks, including risk stratification, diagnosis and classification, and survival predictions. MDL can also analyze diverse data types, including demographic data, laboratory findings, imaging data, and doctors’ free-text notes, and incorporate them into predictions for disease risk, diagnosis, prognosis, and appropriate treatments. However, the application of MDL in healthcare also presents unique challenges including but not limited to data pre-processing, model training, and refinement of the system with respect to the actual clinical problem are crucial. Additionally, ethical considerations, such as medico-legal implications, doctors’ understanding of machine/deep learning tools, and data privacy and security, must be considered [49], [56], [94], [97]. While reviewing an immense and complicated set of healthcare corpus, conducting an analysis of big data using MDL - Machine or Deep Learning offers a number of advantages that are worth considering [70], [141]. However, before going any further, MDL presents a number of difficulties that need to be taken into consideration, as was mentioned above in section IV-G. In comparison to more conventional approaches to bio-statistics, the flexibility and scalability of MDL make it an attractive choice for use in the healthcare industry. It can be utilized for a variety of purposes, including the stratification of risks, the diagnosis and classification of conditions, and the forecasting of survival times. MDL is also capable of analyzing many sorts of data, like as demographic information, laboratory findings, imaging data, and free-text notes written by medical professionals, and incorporating the results of these analyses into predictions on disease risk, diagnosis, prognosis, and the most relevant treatments. However, the application of MDL in the healthcare industry also presents a number of one-of-a-kind challenges. These challenges include but are not limited to, the pre-processing of data, the training of models, and the refinement of the system in relation to the actual clinical problem. Additionally, it is necessary to take into account ethical concerns, such as medico-legal implications, doctors’ understanding of machine learning and deep learning tools, as well as data privacy and security [49], [56], [94], [97].
F. Natural Language Processing in Healthcare
The study [91], proposed a model namely MedCAT – An open-source toolkit for annotation of medical concepts that is capable of self-supervise machine-learning algorithm for concepts extraction using any of the standard concept vocabulary UMLS and/or SNIMED-CT; MedCAT also provides a customizable information extraction interface. MedCAT achieved an improved F-score in comparison to the available benchmark (F1:0.448–0.738 vs 0.429–0.650). Furthermore, MedCAT is an open-source Named Entity Recognition + Linking (NER+L) and contextualization library. MedCAT is based on CogStack [142]; it is an application framework that extracts data from unstructured data. CogStack ecosystem integration makes MedCAT easy to deploy in health systems. The annotation tool, MedCATtrainer [143] lets clinicians take a glance at, change, and improve the extracted concepts through a web interface made for training MedCAT information extraction pipelines.
The authors in [93] introduce a quick and precise fully automated way to find COVID-19 in a patient’s chest CT scan also termed HRCT lung scans. They have introduced their own set of CT scan images of 48,260 from 282 healthy people and 15,589 images from 95 people with COVID-19 infections. Also, they have proposed a naïve image processing algorithm to quickly analyze the status of lungs to discard non-suspicious images from the complete input dataset; this helps to reduce preprocessing time and minimize the false detection ratio. Further, this study, combined the ResNet50V2 model with a new feature pyramid network optimized for classification challenges, allowing the model to explore images at varying resolutions without losing information on fine details. They claim that they are the first to evaluate their naïve algorithm on (Xception and ResNet50V2); this improves classification performance significantly because COVID-19 infections come in numerous sizes, including microscopic ones. With single image classification, this approach resulted in 98.49% precision and in a real-time system correctly classified 234 out of 245 input images. Few of the studies from the literature targeting Multimodal data using NLP are listed in a tabular form 8. Further, from an applications perspective to resources, we have presented a summary in Table 8 for NLP in Healthcare.
G. Health Education Promotion
Health education [27] is gaining importance and is being considered a crucial topic in healthcare discussions. A gamification-based health education promotion study presented by Hsu et al. shortly named it KABAN. The study focused on the health literacy and knowledge of older adults through game-based learning. In a study conducted by pre-trained instructors and based on instructors’ feedback, KABAN strongly supported the proposed idea that older adults’ health literacy can be enhanced through effective gamification learning designs. They discussed the reasoning behind selecting their age group i.e. elder people because from the existing studies [105] they found that this age group of people around the globe are less educated than the current eras’ young and teenagers. Older people are motivated and interested in learning more about their health through a game-based health intervention [83]. In [105], the authors also talked about how important instructors were to the success of the study and how their feedback on the design of the intervention was helpful. The shortcoming of this study is the selective age group and selective people from socially active communities. A good understanding of health will make it easier for older people to live in a way that is good for their health, and initiatives such as KABAN can help people learn about health. The authors of this study look for future interventions in this area of healthcare.
Healthcare organizations are searching for appropriate technology that will simplify resources to improve the patient experience and the organization’s overall performance. Authors from the studies [16] and [18] suggest that healthcare can be thought of as a system with three basic parts: the patient, the provider, and the system. Core medical care service providers, such as physicians, nurses, technicians, and hospital administrators, are included in this category. Critical services that are related to medical care services, such as medical research and health insurance, as well as recipients of medical care services, such as patients and the general public.
H. Block-Chain and Healthcare
In study [75], authors proposed using blockchain to secure healthcare large data administration and analysis. Blockchain technology is prohibitively expensive for most resource-constrained IoT devices destined for smart cities, necessitating substantial bandwidth and computational power. Using blockchain with IoT devices presents several challenges. To address these issues, we present a novel architecture of modified blockchain models suitable for IoT devices. Our model’s extra privacy and security features are based on advanced cryptographic primitives. These technologies use a blockchain-based network to make IoT data and transactions more secure and anonymous.
In a recent study [102] authors discussed the potential challenges associated with personal health records and the vulnerability of centralized healthcare systems. To secure the data and solve the existing challenges they proposed a blockchain-based architecture that allows patients to manage their health information securely further they claim that their novel architecture has the capability to handle issues and provides better privacy and security to patients’ records. Another study [112] proposed Blockchain-based privacy-preserving for healthcare data in the cloud. They discussed the potential benefits of cloud-based electronic healthcare systems. We also listed several software solutions targeting cloud-based healthcare in Table 14. Furthermore, blockchain technology can give structure and security to healthcare data, as discussed in another study [109], which also examines the difficulties of implementing such information in a web-based context. Last but not least this study [99], presented a comprehensive overview of the benefits of integration IoT and block-chain in healthcare applications, they claim their survey will serve as the baseline for future researchers targeting IoT and Block-chain in healthcare.
I. Electronic Health Record (EHR)
In the current times, EHR electronic health record is gaining attention in both (public, and private) sectors of hospitals, clinic, and medical service centers. EHR seems to be an additional component to deal with by doctors, physicians, assistants, and other medical staff, etc. Before EHR, doctors used to write clinical notes, and maintain records manually [7], [67] and difficulties have been reported in the literature while using EHR systems by medical consultants. Researchers have conducted many studies to highlight how medical consultants’ interactions with EHR systems may affect patient communication, for example, using a keyboard, or mouse, and gazing at the computer [67]. Further, studies have also examined the implications of EHR use on physician-patient communication and the possible implications for the quality of health care [28], [30]. One study [28] has claimed that patient-centered interaction suffers because the physician’s focus is diverted to the electronic health record (EHR) rather than on the patient. About half a decade ago, one of the studies was presented [36] LAB-IN-A-BOX; a naive framework for tracking activities during physician-patient interaction (the system was semi-automatic) and gained attention at that time and authors claimed that their approach has the potential to uncover important insights.
Electronic health records contain brief information about patients that can be followed over time, including the patient’s medical and medication history, symptoms, complaints, therapy, procedures and tests, final diagnosis, discharge meds, and treatment notes or referral notes. It provides experts with a large amount of data as a review or a key to the start in case a new consultant takes over the case. Data maintenance and collection for future work on data is a new challenge for data centers of hospitals as every second a new record is being inserted. In the domain of healthcare, EHR data plays a vital role in the development of artificial intelligence, machine learning, and big data analytical systems. Much work has been proposed on prediction, analysis, and natural language inference. Further, it has many open challenges to be tackled. Some of the open sources of data collection present a wider scope for future work such as MIMIC – Medical Information Mart for Intensive Care dataset; The MIMIC database is the largest Electronic Health Record (EHR) database that is freely available to the public and may be used to test various machine learning methods. EHR produces data in two/three forms such as structured, unstructured, and semi-structured, such as Lab results, doctor medications, and clinical notes are examples [42], [77]. Despite these developments, access to medical data to improve patient care remains a significant barrier [92].
J. Patient Centric Healthcare
After a thorough review, we here mention Patient-Centric care; it is one of the progressing areas [116], [145] that is being considered as a novelty solution in combo with other fields of healthcare such as personalized medicine solutions or maybe personalized clinical prediction, etc. Patient-centric solutions have the potential to significantly improve healthcare outcomes by focusing on the needs and preferences of individual patients. These solutions can help forecast disease outbreaks, prevent diseases, improve patient outcomes, and reduce healthcare costs. By developing clinical prediction models tailored to patients’ specific needs, healthcare providers can improve the accuracy of diagnoses and treatments, ultimately leading to better patient outcomes. Patient-centric solutions also have the potential to enhance patient engagement and satisfaction by involving patients in their own care and providing them with personalized treatment plans [195]. Here we summarized the existing Patient Centric solution or framework as presented in Table 10.
Healthcare Datasets, Modeling Tools and Techniques - Answer to RQ3
The most reliable and authentic sources of healthcare data include electronic health records (EHRs), claims and billing data, health registries and surveillance systems, health surveys, and wearable devices and remote monitoring. These sources provide valuable insights into patient health information, healthcare utilization, and population-level data. In terms of modeling tools, techniques, and commercial solutions in big data analytics for healthcare, machine learning and predictive modeling, NLP, data visualization and dashboards, and commercial analytics platforms are commonly used. These approaches enable the analysis of healthcare data, prediction of patient outcomes, extraction of information from unstructured data, and presentation of data in a user-friendly format. To address data governance and ethical considerations, strategies such as ensuring data privacy and security, obtaining informed consent, mitigating biases, and implementing ethical review and oversight processes are essential. In this section, we will discuss modelling tools and techniques, datasets, and solutions used in healthcare within the context of big data.
A. Modeling Tools and Techniques
Every year, hundreds of prediction models [81], [101] are published in scientific publications, many of which use datasets too small for the total number of participants or events. Riley et al. [98] addressed in their study to propose a new methodology for sample size calculations for new experiments. In this article, the authors demonstrate how to calculate the sample size needed to build a clinical prediction model. There have been several studies in the past that used various modelling techniques in the domain of healthcare.
Jayanthi et al. [52] conducted a comprehensive survey of predictive modelling tools for diabetes prediction, which we have summarized as general healthcare predictive tools collection and prepared a Table 11 after going through several studies such as in this study [110], authors used MTL RNN model for predictive modelling and found that the multitask models using MTL and RNNs outperformed single-task models in terms of individual-level predictions. Another study [77] used NN as predictive modelling for the top 10% of the diagnosis from the raw clinical text dataset. This study [148] employed a genetic algorithm to schedule the medical treatments using LSWT-GA which adopts a survival analysis strategy using heuristic knowledge to predict the effective schedule. Study [89] presents a detailed discussion on the mental health impacts of COVID-19 using Reddit Dataset by modelling NLP and computing technologies. The table is structured as a predictive method type and the name of the particular method.
In Table 12, we provide a comprehensive overview of frequently utilized big data technologies. Nevertheless, it is crucial to acknowledge that the primary objective of the table is to furnish a concise depiction of each item along with their respective benefits. The further detailed analysis and investigation of comparisons between these tools in terms of core services, architectural level, and outcomes have not been addressed in our review, however, we provide a concise description and key factors of each tool and their official source. Additionally, we would like to strengthen more on big data tools and techniques by summarizing them into multiple categories such as i) Distributed Computing Frameworks, such as Apache Spark and Hadoop, and ii) Streaming platforms, like Apache Kafka. iii) Machine Learning libraries, like Apache Mahout and TensorFlow provide algorithms and tools for training models, and predictive performance used for data-driven decisions. iv) Data visualization tools such as Tableau and Power BI help to present complex data in an easily understandable format. and v) No SQL databases, MongoDB, and Casandra are powerful and commonly useful tools for storing and managing unstructured and semi-structured data. By utilizing widely used big data tools and technologies, companies and researchers/academic practitioners could harness the full potential of these mentioned tools to get useful insights that facilitate the user requirements and scientific reasoning respectively.
The Multitask Learning (MTL) model is proposed to estimate bladder pressure with the assistance of time series data in this paper [110]. When it comes to modeling population-level time series data, MTL may be able to achieve a higher level of accuracy than ordinary neural networks. Taking advantage of the many types of data that are present in the population, can be accomplished by separating the prediction process for each individual participant in the population into their own individual task. They employ this innovative technology to forecast bladder pressure and then bladder contractions based on an external urethral sphincter electrocardiograph (EUS EMG) signal. The EUS EMG measures the muscle that controls the urethral sphincter. They came to the conclusion that the multitasking models are superior to the single-tasking models when it comes to making predictions about individuals. The MTL RNN model performed significantly better than the other models when it came to predicting intra- and inter-individual differences in bladder contraction.
The healthcare industry has become a promising area for data mining and machine learning due to the availability of large volumes of data. However, the lack of publicly available benchmark datasets poses a significant challenge in quantifying progress in machine learning for healthcare research. This issue has led to the development of various initiatives to facilitate the sharing and access to healthcare data, such as the Medical Information Mart for Intensive Care (MIMIC) and the National Institutes of Health (NIH) National Library of Medicine’s open-access database, PubMed Central (PMC). To solve this issue, Harutyunyan et al. [155] present four clinical prediction standards based on the “MIMIC-III database”. These include predicting death, length of stay, recognizing physiologic decline, and phenotype classification.
Applications in clinical healthcare, natural language processing, speech recognition, and computer vision can all benefit from deep learning models (also known as deep neural networks). Few studies have compared the performance of deep learning models with current machine learning models and prognostic scoring systems using publicly available healthcare datasets. This is because few studies have used deep learning models. When it comes to determining mortality, length of stay, and ICD-9 code group, the author of this study [156] investigates how well Deep Learning models, ensembles of machine learning models (the Super Learner approach), SAPS II and SOFA scores perform. The MIMIC-III (v1.4) dataset, which is available to the public, was used for the benchmarking tasks. This dataset includes all of the patients who were hospitalized in an intensive care unit at Beth Israel Deaconess Medical Center between the years 2001 and 2012. Overall, deep learning models perform better than any other strategy, particularly when utilizing ‘raw’ clinical time series data as input attributes.
B. Big Data Datasets
Obtaining authentic and recent datasets is a critical task in the healthcare domain. In the literature, we reviewed several challenges regarding the datasets such as few challenges are listed below. While these listed challenges are very few among the open challenges for releasing a safe and reliable dataset.
Dataset authentic source
Dataset access and availability
Dataset quality and completeness
Dataset bias
Dataset guides
Dataset formats availability
Dataset privacy and security
Dataset volume and scalability
Dataset heterogeneity
To address these challenges, we compiled a list of authentic sources (to the best of our knowledge) that list healthcare domain data, which can be found in Table 13. While there are numerous data sources available, we have focused on the most frequent and vital datasets that have been reported in the literature. These listed authentic sources also adhere to rules and regulations guided by HIPAA [106].
After conducting an extensive review of the literature [27], [35], [63], [101], [105], we have found that there has been an increasing focus on exploring the healthcare domain in the areas of system infrastructure and operation, quality of healthcare data, digital health education, and medical image analysis. These areas have emerged as key categories where most of the healthcare datasets can be classified, and they represent important research areas for the development and implementation of AI, ML, and DL in healthcare.
C. Current Big Data Analytics Healthcare Solutions
In the age of digitization, the volume of data and research publications is growing at an unprecedented rate. Consequently, new big data analytics solutions are being proposed almost every day [74], [82]. In this subsection, we present a table of current solutions from the literature to shed light on recent developments. Table 14 provides a list of current healthcare solutions deployed on a large and medium scale [53], with a focus on AI, ML, DL, and NLP-based solutions relevant to the application areas we discussed. Although many other solutions exist, we have customized the list to fit our specific domain.
D. The Big Data Advantage in Healthcare - Use Cases
With the digitization of Electronic Medical Records (EMR), Electronic Health Records (EHR), medical imaging, laboratory results, insurance data, and prescriptions, healthcare has generated a massive amount of data known as Big Data. Analysis of this Big Data can potentially improve the quality of medical and healthcare services by providing meaningful insights that help in informed decision-making, disease surveillance, and other healthcare and medical services. This can benefit patients, physicians, healthcare organizations, pharmaceutical companies, policymakers, and other stakeholders. Big Data applications can include individual and population health surveillance, predicting health issues, calculating medical complications and risks associated with a patient, analyzing suitable treatments, and evaluating the effectiveness of current treatment strategies [149]. Big Data can inform patients about their current and future health states, empowering them to make better-informed decisions. Integrating Big Data and healthcare makes it possible to scale the quality and accountability of health services, which offers numerous benefits, including improving the accuracy, timeliness, and effectiveness of healthcare services [150]. The benefits of using big data in healthcare are numerous and significant. One of the most important benefits is improved patient outcomes. With the ability to collect and analyze vast amounts of patient data, healthcare providers can identify patterns and trends in patient care and adjust treatment plans accordingly. This can result in more accurate diagnoses, better treatment outcomes, and improved patient satisfaction. Some of the most common use cases for big data in healthcare include,
Reducing healthcare cost
Reducing hospital re-admissions
Optimized workforce and workflows
Real-time alerting
Analysing Electronic Health Records (EHRs)
Control data for public health research
Efficient medical practices
Efficient strategic planning
Improving safety practices
Better patient engagement
Preventing unnecessary hospital and ER visits
One of the health systems tried various technologies with various vendors to reduce their debt and accurately predict the propensity to pay [151]. Healthcare.AI is one of the prominent solutions that provide a range of services for healthcare systems. These services include prediction models for previous payment behavior, payment balance, credit scores, and previous interactions, among others. Unlike other companies that rely on a single feature, Healthcare.AI offers a multi-faceted approach, enabling better analysis of healthcare data. The system has been found to be effective in increasing revenue, with one anonymous company achieving revenue of
2M revenue increase with healthcare.AI multi-feature predictive model{\$} Better strategic planning with Healthcare.AI
Improved resource optimization
Early detection of acute myocardial infarction (AMI) mortality
Use of digital tools boosting the efficiency of African healthcare systems
AI helping to identify preventable health emergency [154]
The utilization of big data has the potential to transform the healthcare industry and create new opportunities for healthcare providers to improve patient outcomes, optimize operations, and reduce costs. Based on the research conducted by Groves et al. [40], it can be concluded that big data has opened up new pathways in healthcare, namely: i) Right Living, ii) Right Care, iii) Right Provider, iv) Right Value, and v) Right Innovation. These pathways represent a plethora of use cases that can be achieved through the application of big data. The emergence of these pathways demonstrates that big data is not just a buzzword, but rather a critical component in the advancement of healthcare.
Open Research Challenges - Answer to RQ4
This section presents an overview of existing research gaps from a comprehensive standpoint, intending to provide future researchers with a valuable starting point for new investigations. Additionally, our objective is to elucidate the open research challenges that have emerged in recent years, with a focus on addressing the necessary solutions for the progression of healthcare.
A. Multimodality in Healthcare
Since the emergence of various types of data, the concept of multimodality has garnered significant attention [160]. Multimodal data offers a complementary and comprehensive source of information that cannot be adequately captured by a single modality alone. The utilization of multiple modalities has shown promising results in tasks such as natural language understanding, computer vision, audio processing, sentiment analysis, machine translation, and more [161]. The fusion of diverse modalities holds the potential for improved performance, robustness, and contextual understanding in numerous applications, including healthcare, multimedia analysis, autonomous driving, virtual reality, and human-computer interaction.
While the concept of multimodality has been explored from various perspectives, the application of multimodal approaches specifically within the healthcare domain [96], [138] is an important area for future research. In particular, addressing multimodality and multitasking, along with handling the challenges associated with Multimodal Imbalance data, requires attention. It is worth highlighting the issue of imbalanced data [162] within the realm of multimodal healthcare, as it remains a task that lacks standardized implementation. Additionally, ethical considerations represent another crucial aspect underlying every healthcare application or solution, warranting further exploration as part of the open research challenges in this field.
B. Data Mining in Healthcare
One promising area as an open challenge in Healthcare is Data Mining. Data mining is a potentially fruitful topic that remains fraught with difficulties in the healthcare realm [107], [163]. Integrating and analyzing data from several disparate sources is a substantial obstacle in the field of healthcare data mining as we discussed in the multimodality challenge. The creation of scalable and effective algorithms for the processing of large-scale healthcare datasets is another barrier that must be overcome. The amount, velocity, and diversity of data in the healthcare industry are all continuing to expand at a rapid rate, which creates issues for processing and scalability. In order to effectively process and evaluate healthcare data in a timely way, data mining algorithms need to be able to effectively manage large amounts of data and make use of distributed and parallel computing frameworks.
Authors in their studies [164], [165] identified issues including assessing diagnostic and treatment record similarity in the domain of data mining. They [164] emphasize the need for similarity metrics to examine diagnostic and therapeutic data. This entails assessing data type, granularity, and patient record features. Next, they extract typical diagnostic and treatment patterns from EMRs. The authors explain how to extract patterns from clustering findings. Clustering analysis helps find common patterns and trends in patients’ diagnostic records. They also extract common treatment patterns from clustering data to identify recurring treatment techniques. The next step is forecasting typical diagnostic patterns. Data mining and prediction models are used to forecast diagnostic trends based on patient data. Healthcare providers can forecast patient diagnoses using past data and patterns. Additionally, they evaluate and prescribe usual therapy regimens. Using usual patterns, the authors evaluate the effectiveness and appropriateness of different treatment strategies. They also investigate the possibility of prescribing treatment strategies to doctors based on patient features and historical data.
Another important difficulty in the field of healthcare data mining involves the interpretability and explainability of the models [166], [167]. It is necessary to have models that are both transparent and interpretable in order to make important decisions in healthcare. These models must also be able to give explanations that are easy to grasp for any forecasts or recommendations they make. Increasing trust in the healthcare system, lowering barriers to clinical adoption, and enhancing decision-making are all possible outcomes of developing interpretable data mining models.
C. Precision Medicine
Precision medicine also known as personalized medicine as we discussed in subsection V-J (patient-centric healthcare application), precision medicine has been recorded as an open challenge for more than a century [168]. Precision medicine’s fundamental principle involves customizing healthcare interventions based on an individual’s genetic, behavioral, and environmental characteristics is not new, but it remains a problem and an ongoing study subject.
The 2015 US Precision Medicine Initiative [169] popularized “precision medicine”. This effort promoted precision medicine via research, technology, and data exchange. Since then, genetics and other aspects of healthcare have been better understood.
Precision medicine further can be understood as an application of computational predictive modelling as defined in Table 11 where researchers aim to develop a predictive software for medicine such as in [170], authors put their efforts for prediction for COVID-19 medicine on BRICS countries as a case study using deep learning. However, practical precision medicine implementation remains difficult. This has numerous causes such as we summarized in the studies [171], [172], [173], [174]:
Precision medicine requires the integration and analysis of genetic, clinical, lifestyle, and environmental data. Integrating and harmonizing numerous data sources and building effective analytical procedures to gain insights are difficult.
Understanding the genetics of illnesses and treatment response has advanced, but applying these discoveries to clinical practice is difficult. Validation, standardization, and standards for genetic and molecular interpretation and clinical use are needed.
Precision medicine uses personal and genetic data. Ensuring patient privacy, and informed permission, and resolving ethical and legal data sharing and usage problems are crucial.
Implementing precision medicine across varied populations and healthcare settings raises concerns about access, cost, and healthcare inequities.
Healthcare practitioners need the right skills to incorporate precision medicine into clinical practice. Training programs and educational activities must educate healthcare practitioners about sophisticated technology and individualized methods.
Precision medicine has immense potential to improve healthcare, but it takes research, cooperation, and innovation to make it widely available. The discipline is evolving to overcome these limitations and fully utilize precision medicine to improve patient outcomes.
D. Ethical Considerations and Bias Mitigation
The analysis of multimodal healthcare data raises ethical concerns, particularly in relation to bias and discrimination. Biases can arise due to unbalanced data, under-representation of certain data categories, and biases introduced during data collection and labeling processes. Consequently, it is crucial for researchers to proactively address these ethical concerns and develop tools that can detect and mitigate biases in multimodal healthcare analytics, thereby promoting fair and equitable outcomes.
To ensure the integrity and fairness of multimodal healthcare data analysis, researchers must focus on several key areas. Firstly, it is essential to employ robust methodologies to identify biases present in the data, critically evaluating the data collection process and implementing appropriate measures to address and mitigate biases. Additionally, researchers should strive for transparency and explainability in their analyses, documenting data sources, preprocessing techniques, and modelling decisions, and providing clear explanations for the outputs of algorithms. By prioritizing ethical considerations and employing bias detection and mitigation techniques, researchers can contribute to the development of unbiased multimodal healthcare analytics, fostering trust and promoting equitable outcomes for all individuals involved.
E. Limitation of Pre-Trained Models for Multimodal Healthcare
The adoption of NLP applications across various technological domains has witnessed significant growth, and the emergence of pre-trained models for healthcare [175] represents a recent and highly trending topic within the healthcare sector. The concept of pre-trained or internet-trained models is derived from transfer learning [176]. While several pre-trained models have been developed for either text or image-based data, there is a noticeable absence of pre-trained models that cater specifically to the current needs of multimodal healthcare data. To emphasize the significance of these models, we have compiled a comprehensive list of existing pre-trained healthcare models. However, it is evident that there exists a considerable gap for future researchers to develop pre-trained models that can effectively handle multiple types of healthcare data. Although Table 15 showcases an example of the top five models observed in this domain, it is important to note that numerous other models based on standard architectures have been designed and developed.
Addressing the challenge of developing pre-trained models for multimodal healthcare represents one of several open research challenges that warrant attention in the future. By advancing the field of pre-trained models, researchers can greatly contribute to the effective analysis and utilization of multimodal healthcare data, thereby enhancing healthcare outcomes and driving innovation in the domain.
F. Exploration on Big Data Ecosystem
The exploration of the Big Data Ecosystem stands as a pressing challenge that necessitates the attention of future researchers [183], [184]. In our investigation, we have specifically addressed this challenge through our first research question in section IV, where we have devised an elaborate framework for data-driven health organizations. This framework encompasses an examination of performance metrics, characteristics, search tools, data quality, and privacy challenges, as well as the demands and opportunities within the healthcare domain. However, we acknowledge that this research challenge merits further exploration in the future, primarily due to the increasing complexity of the ecosystem over time and the ongoing advancements in the field. It is imperative to delve deeper into this subject to enhance our understanding and uncover novel insights for the advancement of healthcare.
G. Other Open Research Challenges
While conducting our review, it became evident that there are numerous additional research challenges within the field of healthcare analytics. We have compiled a list of these unexplored challenges, which can serve as valuable avenues for future researchers to explore and expand upon, advancing knowledge and innovations in the field. The following list highlights some of the most pressing and in-demand issues that require attention:
Trustworthy Healthcare: trustworthy healthcare comprises various dimensions for healthcare delivery that improve confidence, and ethical behavior among patients, healthcare professionals, and other relevant parties. While trustworthy healthcare also has several challenges that need to be addressed in today’s complex healthcare landscape. Challenges such as maintaining data privacy and security, addressing biases and inequalities, shared decision-making, promoting patient engagement, and ensuring transparency and accountability. Each challenge can also be considered with and without the incorporation of trustworthy healthcare.
Scalability and Computational Efficiency: Developing scalable and computationally efficient algorithms and architectures to handle the increasing volume and complexity of healthcare data.
Data Privacy-Preserving Techniques: Designing techniques and frameworks that protect the privacy and confidentiality of sensitive healthcare data during analysis and sharing.
Explainability and Interpretability: Ensuring transparency and interpretability of analytics models to provide clinicians and stakeholders with understandable insights and justifications.
Predictive Analytics and Early Detection: Leveraging advanced analytics to predict and detect healthcare events, diseases, or conditions at an early stage for timely intervention and improved outcomes.
Bias and Fairness in Analytics: Addressing biases and ensuring fairness in healthcare analytics to avoid discriminatory outcomes and ensure equitable healthcare delivery.
Validation and Generalization of Models: Validating and generalizing analytics models across diverse healthcare settings to ensure their effectiveness and reliability in real-world applications.
Real-world Data Challenges: Overcoming challenges related to the quality, heterogeneity, and integration of real-world healthcare data from multiple sources.
Adoption of Healthcare Solutions by Healthcare Professionals: Exploring factors influencing the adoption and integration of healthcare analytics solutions by healthcare professionals, promoting their effective utilization in clinical practice.
Data-driven Clinical Guidelines and Protocols: Developing data-driven approaches to inform the creation and update of clinical guidelines and protocols, ensuring evidence-based and personalized healthcare decision-making.
These open research challenges offer promising opportunities for future investigations, where researchers can contribute to the advancement of healthcare analytics and pave the way for improved healthcare delivery and patient outcomes.
Discussion and Implications of RQ’s
In this section, we present a summary of the study results, highlighting the implications of each research question addressed in the survey.
The implication of our designed Research questions (RQs) contributes to several aspects of research for Big Data Healthcare. First, each research question focuses on the systematic literature review study, providing a clear guide for examining specific topics in a detailed manner. Our RQs also align with the research framework, methodology, and analysis methods ensuring a cohesive and rigorous study design. Moreover, by addressing knowledge gaps and adding existing theories or frameworks, our RQs establish the relevance and importance of our research. The results obtained from answering these RQs enable researchers to evaluate and understand study data, facilitating the development of relevant conclusions. Additionally, the Implications of our RQs extend to generating new knowledge and insights, contributing to the expansion of understanding in the field. Overall, our RQs shape the entire research process, encompassing the emphasis and organization of the study as well as the discoveries and contributions made.
The goal of RQ-1 is to address the gap observed in the previous studies by synthesizing the extensive data ecosystem, explicitly focusing on every single component including but not limited to the healthcare life-cycle, further characteristics of big data, the search tools, additionally the role, and the need of big data in healthcare. The opportunities and challenges of BDA in healthcare. Each component is well described above in a particular section or subsection. We developed and presented a typical life-cycle in Figure 9 deployed in healthcare. Further, we developed and delivered a potential classification of BDA challenges in healthcare in Figure 11. Lastly, literature helped us design an extensive data-driven health organization framework depicted in Figure 10.
RQ-2 and RQ-3 aim to address the existing gap by highlighting the potential of promising application areas of BDA in healthcare and the authentic source of data, tools, and techniques, respectively.
To further elaborate on the discussion of RQ-2, after covering the history of health standard documentation, we also delved into the most promising application areas for BDA in healthcare, such as multimodal data analysis and fusion. Multimodal data analysis combines data from multiple sources, such as medical imaging, electronic health records, and genomics data, to better understand a patient’s health status. The fusion concept refers to integrating data from different modalities, such as combining imaging and genomics data to improve diagnostic accuracy and treatment decisions.
We also discussed the benefits of natural language processing (NLP) in healthcare, such as extracting valuable information from unstructured clinical notes and text-based sources, enabling more accurate diagnosis and treatment decisions. Furthermore, we delved into the application of electronic health records (EHRs), which have become a critical data source for BDA in healthcare. We highlighted the potential benefits of using EHR data, such as improved patient outcomes, reduced costs, and increased efficiency of healthcare delivery.
Moving on to RQ-3, we discussed the different data sources that can be used for BDA in healthcare, such as clinical data from EHRs, medical imaging data, and sensor data from wearable devices. We also highlighted the different tools and techniques that can be used for BDA in healthcare, such as machine learning, data mining, and predictive analytics.
In addition to addressing these research questions, our survey comprehensively examined the open research challenges associated with BDA in healthcare. These challenges encompassed aspects such as data quality, privacy concerns, the need for interoperability and standardization, as well as the scarcity of skilled professionals. Furthermore, we discussed the potential opportunities and benefits of applying BDA in healthcare, including improved patient outcomes, personalized medicine, and the potential for cost savings.
The response to Research Question 4 (RQ4) brings attention to many ongoing research challenges within the realm of healthcare analytics. These problems cover wider areas such as multimodality, ethical considerations, bias mitigation, limitation of pre-trained healthcare models, the need for exploration of the Big Data Ecosystem, and other pertinent aspects. These listed challenges present significant opportunities for future scholars to investigate and make meaningful contributions to the progress of healthcare.
By summarizing the implications of each research question, this study provides valuable insights into the implications of BDA in the healthcare domain. These findings serve as a bridge for the above-listed contributions and a foundation for future research endeavors, fostering the advancement of knowledge and innovations in this field.
A. RQ’s Findings and Takeaway
In this particular subsection, we provide statistics on the key findings and takeaways from this systematic literature review.
Table 16 presents a statistical analysis of references used in our study. It indicates a substantial volume of scholarly study pertaining to several facets of big data in the healthcare domain. We present this table for the ease of future researchers; they can easily go through from cited articles with respect to the sub-domain. The allocation of references among the research inquiries underscores the extensive range of this discipline and the varied domains of exploration. Concluding the research, RQ1, which centers around the life cycle, features, and search tools of Big Data, exhibits a considerable volume of references. This highlights the significance of comprehending the fundamental elements of Big Data and the requisite instruments for proficiently handling and scrutinizing healthcare data. While RQ2 investigates the many uses of big data in the healthcare sector, it has a substantial volume of references. This justifies the increasing inclination to utilize big data in order to enhance healthcare procedures and achieve better outcomes. The references encompass a diverse array of applications, including health standards documentation, multimodal data analytics, natural language processing in healthcare, integration of blockchain technology, electronic health records, and patient-centric healthcare. On the other hand, RQ3) examines the technologies, datasets, and analytics employed in the domain of big data healthcare, and exhibits a moderate quantity of references in our study and this may be due to the limitation of our employed search strategy or bias towards the keywords. The smaller number of citations also entails that there is a substantial gap which can be further filled by future researchers. This statement implies that although there is a continuous investigation in this field, further investigation and advancement of modelling tools and methodologies, datasets, and analytics are still required to fully exploit the potential of Big Data in the healthcare sector. Lastly, RQ4 has a comparatively received average number of references in relation to the remaining research inquiries, since it explores the realm of emerging trends and difficulties in the context of big data healthcare. This suggests that further investigation is required in areas such as multimodality in healthcare, data mining, precision medicine, ethical issues and prejudice, limits of pre-trained models, the study of the big data ecosystem, and open research problems. Finally, we want to refer to the Yearly Google Trend plot as shown in Figure 13 of the decade (2013 to 2023) for the “Big Data and Healthcare” search. This plot shows the great interest and evolving popularity of the particular topic over the years. We derived this real-time data from Google Trend web analytics. This plot is combined and plotted in a year-wise pattern using a monthly trend. The lowest sum of monthly trend is observed for the year 2013, while the peak was observed in 2022 with a value of 892 and for the data of 9 months of this current year till the date of compiling this manuscript Sep 2023 the monthly value sum up to 621 normalized searches indicating sustained level of interest and suggested the further exploration for the mentioned limitations in the subsequent subsection.
In summary, the takeaway from this quantitative analysis of references is to serve as an indicator of the dynamic research environment within the field of Big Data in Healthcare. This underpins the significance of comprehending the underlying principles of Big Data, investigating its wide-ranging applications, creating efficient tools and analytics, and tackling the rising trends and difficulties within this domain. Ongoing research and collaborative efforts in these domains will play a significant role in harnessing the whole potential of Big Data for enhancing healthcare practices and optimizing patient outcomes.
B. Limitations of Our Study
We acknowledge that our study does not review the healthcare pivotal components such as vital signs, diseases, patient-specific problems, and hospital manageability. We feel that there is a need for a review paper in the future that covers Big Data Healthcare perspectives with respect to the mentioned areas. Further, we acknowledge that this study lacks non-academic credible sources such as industry reports, government agencies reports regarding statistics of healthcare, and published reports. We only cover scholarly articles published between the last decade (2013 to 2023). Future researchers are suggested to tackle these considerations for a new orientation of the study.
Conclusion
The use of big data analytics has become increasingly relevant in healthcare over the past decade, as it offers the potential to revolutionize how medical professionals deliver care and manage health systems. Our comprehensive review study, which covered a wide range of published articles from 2013 to 2023, aimed to investigate the applications, implications, and impacts of big data frameworks in healthcare. Through this research, we identified novel research questions and conducted a thorough review to shed light on this important area of study.
Our findings demonstrate that the large-scale and complex nature of healthcare data presents significant challenges to big data analytics in healthcare. The data is often high-dimensional, noisy, and unstructured, which can make it difficult to draw meaningful conclusions. To overcome these challenges, it is necessary to develop reliable and trustworthy big data healthcare frameworks that prioritize patient privacy and data security. Furthermore, our study has highlighted the need to optimize big data frameworks to enhance patient outcomes, reduce costs, and improve overall quality of life. While there are still challenges to overcome, our study has provided valuable insights into the applications and implications of big data in healthcare. We believe that our research can guide healthcare professionals and researchers in developing effective and efficient big data frameworks that leverage the full potential of this technology. Additionally, our study has identified several opportunities for future research in this field, which could lead to further advancements and improvements in healthcare delivery and management.
Declaration of Interest
The authors of this manuscript declare no conflict of interest.