Challenges and Solutions for Processing Real-Time Big Data Stream: A Systematic Literature Review

Contribution: Recently, real-time data warehousing (DWH) and big data streaming have become ubiquitous due to the fact that a number of business organizations are gearing up to gain competitive advantage. The capability of organizing big data in efficient manner to reach a business decision empowers data warehousing in terms of real-time stream processing. A systematic literature review for real-time stream processing systems is presented in this paper which rigorously look at the recent developments and challenges of real-time stream processing systems and can serve as a guide for the implementation of real-time stream processing framework for all shapes of data streams. Background: Published surveys and reviews either cover papers focusing on stream analysis in applications other than real-time DWH or focusing on extraction, transformation, loading (ETL) challenges for traditional DWH. This systematic review attempts to answer four specific research questions. Research Questions: 1)Which are the relevant publication channels for real-time stream processing research? 2) Which challenges have been faced during implementation of real-time stream processing? 3) Which approaches/tools have been reported to address challenges introduced at ETL stage while processing real-time stream for real-time DWH? 4) What evidence have been reported while addressing different challenges for processing real-time stream? Methodology: A systematic literature was conducted to compile studies related to publication channels targeting real-time stream processing/joins challenges and developments. Following a formal protocol, semi-automatic and manual searches were performed for work from 2011 to 2020 excluding research in traditional data warehousing. Of 679,547 papers selected for data extraction, 74 were retained after quality assessment. Findings: This systematic literature highlights implementation challenges along with developed approaches for real-time DWH and big data stream processing systems and provides their comparisons. This study found that there exists various algorithms for implementing real-time join processing at ETL stage for structured data whereas less work for un-structured data is found in this subject matter.


I. INTRODUCTION
Real-time analytics are becoming ubiquitous for several application scenarios where well-timed business decisions are extremely important. Processing continuous and big data streams for real-time analytics is very challenging while implementing ETL stage for data warehousing (DWH) or other big data applications due to the nature of big data with respect to volume, variety, velocity, volatility, variability, veracity, and value [1], [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Zhe Xiao .
Continuous supply of big data is referred as stream. Stream of data may be generated from single or multiple big data sources shown in figure 1. A broad category of applications participate in continuous generation of massive data. Analysis of these streams is a big challenge where gathered data is heterogeneous and can be of any shape/nature i.e, structured, semi/unstructured, symmetrical or skewed. Big portion of massive data resulted from real-time stream need realtime processing/analysis as value of data considered in its freshness. Real-time stream processing (refer as in-memory processing of massive data) can be generally required into two types of application domains: first where organising data VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to reach a decision is required (real-time DWH) and second where to generate a certain reaction on real-time basis is essential particularly with low latency. Few applications of second type are also listed in figure 1. Before being loaded into these applications, streams need to be processed ensuring data quality. This necessitates real-time stream processing/ analysis. Several operations are required for stream processing which include data cleaning, query processing, stream-stream join, stream-disk join, data transformation etc. A broad category of approaches,tools and technologies have been developed so far to overcome the challenges for stream processing. These approaches possibly deal with multiple shapes and storage models of data, applying several operations on these streams.
To address real-time stream processing challenges various approaches have been developed so far: stream-stream join algorithms, stream-disk join algorithms with reduced data latency/skewed data, distributed streaming ETL, Mesa DWH, streaming processing framework, distributed join processing, sensor networks, object tracking and monitoring, and multijoin query processing in cloud DWH. In addition, processed output must be provided with low latency, limited resources, accuracy and within seconds to make real-time reactions and decisions possible. The depth of challenges targeting realtime stream processing has produced so much research that it is required to conduct a systematic analysis of proposed solutions.
The focus of this study is to present an extensive systematic literature review (SLR) to gather different approaches for real-time stream processing for all possible application domains specifically real-time DWH. We have finalized 74 studies out of 667,414 total papers for this review based on quality assessment criteria. The novelty of our SLR is that it provides a new classification criteria, real-time stream processing research targeting channels, real-time DWH/big data streaming challenges, approaches to address these challenges after validating studies empirically. This paper is organized as follows: Existing reviews related to stream processing are presented in section II. Research methodology followed to conduct this survey is discussed in section III including the objectives, quality assessment criteria and research questions. Assessment and discussion of research questions are demonstrated in section IV. Concluded discussion and future directions are presented in Section V.

II. RELATED WORK
It was found that most of the existing surveys and systematic reviews do not cover publication channels approaches, challenges and solutions targeting real-time stream processing research needed in business intelligence, and focus majorly on tools used for big data analytics and DWH design approaches from social media. A recent systematic literature review of big data stream analysis is presented by [6]. Authors have reviewed key issues for big data stream analysis and tools/technologies employed to address these issues. However, this study has not focused on research and challenges in real-time stream analysis and real-time DWH domains.
Authors in [3] presented a review on fundamental use of big data analytics in various businesses/industries in terms of helping and maintaining their resources, using Scopus digital repository for searching relevant articles. Big data technologies/platforms, their services, applications, programming languages, data aggregation tools, databases and DWHs have also been reviewed in this study. This study enlisted platforms for examining big data sets, both un-structured and structured for big data analytics which are: Hadoop, GridGain, MapReduce, HPCC and Apache Storm. They also highlighted tools for database/DWH as: Cassandra, MongoDB, CouchDB, Terrastore, Hibari, Hypertable, Hive, Infinispan, HBase, Neo4j, OrientDB, etc. However, this study does not focus on challenges and approaches developed in the field of real-time DWH and big data streaming.
Likewise, [4] conducted a study that is centered on competitive analysis of social media data and transformation into knowledge and on DWH design approaches from social media. As social media considered as massive dynamic and un-structured data, making them more challenging for companies to use, analyze and store these data. This study classified DWH design approaches from social media into two heads: incorporation of sentiment analysis in DWH schema and behaviour analysis. Nevertheless, focus of this study is not on developments addressing structured/un-structured stream processing approaches and optimization of ETL stage for real-time DWH.
Another recent comparative study of data stream analytics frameworks is presented by [5]. Different data stream processing engines have been evaluated in this study based on their partitioning, state management, message delivery and fault tolerance features. This study included Storm, Spark Streaming, Flink, Kafka Streams and IBM Streams as data stream processing engines in their review. Focus of this survey is not on extracting knowledge and identification of important data stream components which is basic requirement for real-time stream analytics.
Our review distinguishes itself from the above reviews by focusing on the publication channels in real-time stream processing, big data streaming, closely examining the ETL implementation challenges, and identifying the developments have been reported in join operation for real-time DWH. In addition, we employ a more rigorous approach than all of the above reviews by following strict criteria and quality assessment scoring. Table 1 gives aspect wise comparison of existing surveys with our survey. There seem to be no existing survey that cover all features related to real-time stream processing as well as not considering real-time DWH literature.

III. RESEARCH METHODOLOGY
Guidelines for systematic reviews provided in software engineering research by [7] and [8] are followed by our survey. According to these guidelines, we have included three main phases in our research methodology: plan, conduct and report of review. Figure 2 shows the research methodology, which demonstrates search process for relevant research activities, definition of a categorization scheme, and mapping of articles. VOLUME 8, 2020

FIGURE 2. Research methodology.
A highly structured process has been followed in this review that involved: a) To develop a library of articles related to developments of real-time stream processing during ETL phase or others, and make this dataset available to other researchers. b) To identify more significant work that provides direction to investigate challenges for real-time stream processing, ETL and real-time DWH. c) To distinguish research gaps for ETL, real-time stream processing and DWH in recent studies. d) Characterise existing approaches and solutions for the challenges while implementing real-time stream processing for heterogeneous, structured/unstructured data and clarify the similarities and differences between them.

2) RESEARCH QUESTIONS
It is important to formulate the primary RQs in order to conduct this SLR. These RQs are developed to identify relevant publication channels, challenges/developments and evidences of approaches for real-time stream processing systems as mentioned in table 2.

B. REVIEW CONDUCT
To strengthen the reliability and validity for our search results two reviewers participated for the inclusion of papers. The process of conducting this review has been articulated in four steps presented below. In first step, relevant primary studies have been searched from most commonly used digital libraries. Selection of studies based on pre-defined inclusion/exclusion criteria has been performed during second step. We have designed quality assessment criteria to further enhance quality of our review described in third step. Backward snowballing is then performed to extract important candidate papers during final fourth step.

1) SEMI-AUTOMATED SEARCH IN DIGITAL LIBRARIES
A systematic research has been carried out to filter irrelevant studies and extract appropriate information. Therefore, semiautomatic and manual search techniques have been followed while exploring the search terms. Semi-automated search has been conducted in seven digital libraries mentioned below: com/] Apart from this, some more digital libraries were also explored but not included due to accessibility constraints. The objective of manual search is to collect more literature relevant to real-time stream processing and DWH. Extracted information can be more relevant for limited search terms therefore following conditions were applied to limit our search terms: • Based on formulated RQs, determine primary keywords. • Identification of secondary keywords and synonyms for additional keywords.
• 'AND' and 'OR' Boolean operators have been incorporated with keywords to develop a search string. Possible arrangements of search string used can be noted from figure 3. Primary keywords were selected as key identifiers for research of real-time stream processing. Primary keywords were chosen along with any of secondary or additional keywords. Combination of keywords, Boolean operators and wildcard have developed a final search string mentioned as: (real time OR real time data warehous*) AND (stream OR semi-stream OR ETL OR challenges OR (extract*,transform*,load*)) AND (join OR process* OR analy* OR problem*) Table 3 demonstrates the final search strings used to search the seven digital libraries. Semi-automatic search was limited to only titles for ACM journals, IEEEXplore and ScienceDirect. Due to limit of five wild card characters for search in IEEEXplore, search string needed to be slightly changed for this library. Irrelevant hits were reduced due to this setting. Other digital libraries were explored with ''all fields'' setting, as these do not allow a more specific search configuration. Search string being too restrictive failed to find relevant articles for IGI Global digital library, therefore final search string designed for this library contains less number of keywords shown in table 3. Final search string failed when applied for digital library Hindawi, therefore search was conducted with few keywords resulted in some relevant hits.

1) Inclusion criteria:
a) Papers included in review must be in the domain of real-time stream processing. b) Papers must target RQs. c) Papers published in journals, conferences or workshops are included in the review. d) Papers discussing developments and applications of real-time stream processing. 2) Exclusion criteria: a) Remove papers written in non-english. b) Remove papers that do not discuss real-time stream processing in DWH or big data domain. c) Remove the papers published before 2011. d) Remove papers discussing simulation domains or traditional DWH. e) Remove papers that were written by same research group with same data (most recent was kept in this case).

3) SELECTION BASED ON QUALITY ASSESSMENT
Selection of relevant studies on the basis of quality assessment (QA) is considered as most important step for conducting any review. As the primary studies vary in design therefore quantitative, qualitative, and mixed-method critical appraisal tool used by [9] and [10] are followed to perform QA in our review. In order to enhance our study, we have carried out QA by designing a questionnaire to evaluate the quality of selected articles. The QA of our study was conducted by two reviewers and each study is scored based on the following criteria: a) The study has awarded score (1) if it contributes towards real-time stream processing or continuous data loading in DWH, otherwise scored (0). b) If clear solutions to the challenges for implementation of real-time stream processing or DWH have been provided by the study: ''Yes (2)'', ''Limited (1)'', and ''No (0)'' were the possible scores. c) Score (1) is awarded to studies which presents empirical results otherwise scored (0). d) By taking computer science conference rankings [11], and the journal and country ranking lists [12] into account, the studies were rated. Possible scores for publications from recognized and stable sources are shown in table 4. A final score has been calculated for each study after adding scores of above questions: (an integer between 0 to 8). Articles achieving scores 3 or more have been included in finalized results.

4) SELECTION BASED ON SNOWBALLING
After performing quality assessment, we conducted backward snowballing [13] through reference list of each finalized study to extract papers. Only those important candidate papers are selected which passed through inclusion/exclusion criteria. Once the paper is found, inclusion/exclusion of that paper has been decided after reading its abstract and then other parts of paper. After having examined selected papers thoroughly we identified one more study [5], and totally added up to 74 primary studies.

C. REVIEW REPORT
Overview of selected studies is provided in this section.

1) OVERVIEW OF THE INTERMEDIATE SELECTION PROCESS OUTCOME
ETL challenges, real-time stream processing and DWH are correlated and extremely active fields in business intelligence, therefore our review methodology had to empirically and systematically draw relevant studies from all related digital libraries. The next stage of our systematic review is to select the papers that will form the knowledge base for this review. About 667,414 papers are left after removing papers older than year 2011.
After building a knowledge base from seven digital publishers in computer science, authors examined title, abstract, and the corresponding full paper if required of each search result. Papers less than four pages long and irrelevant papers were eliminated in this process.
To ascertain the relevance and contribution, accepted publications have been read thoroughly during inspection phase. To achieve the core goal of this study, we build a systematic knowledge base of articles based on their contributions.

2) OVERVIEW OF THE SELECTED STUDIES
Significant results of primary search, filtering and inspection phases, covering ten digital libraries, are presented in table 5. The search resulted in a very big number of papers (667414) while filtering/inspection phases helped reduce this number to 74 articles.

IV. ASSESSMENT AND DISCUSSION OF RESEARCH QUESTIONS
This section concludes the results of our study and provides the descriptive evaluation of each study in tabular format. We analyzed 74 finalized studies based on our RQs in this section.
A. ASSESSMENT OF RQ1: WHICH ARE THE RELEVANT PUBLICATION CHANNELS FOR REAL-TIME STREAM PROCESSING RESEARCH?
Analysis of existing developments and challenges for realtime stream processing is a key challenge for researchers for the development of business intelligence technologies. For this purpose, identification of high quality publication venues and scientometric analysis based on meta information in the area of real-time stream processing is required. In this section, an insightful knowledge of publication sources, types, year and geographical distribution, publication channel wise distribution of selected studies for the evaluation of real-time stream processing research is presented.   Selected paper count each year is shown in figure 4. Note that highest number of selected papers were published last year indicating growing need of research in the field of real-time stream processing and DWH. Figure 5 shows percentage of studies selected from journals, conferences or workshop. Journal publications are generally considered superior specially with a high impact factor, therefore, we have included 56% journal publications in our SLR, all published in Q1-Q4 quartile journals. On the other hand, conference articles are as valuable as journal publications in terms of measuring the performance of a scientific publications, therefore, 42% of selected studies are from good ranked conference articles. Figure 6 presents percentage of geological distribution of selected research papers. Researchers from Asia and New Zealand contributed most towards developments in real-time stream processing indicating increasing need to shorten the time lag between data acquisition and decision making in these regions.
The overall quality assessment score of finalized studies with detail of overall classification result is mentioned in table 6. Selected papers were classified based on four factors: research type, empirically validated, applied approach and application of study. We have categorized types of research as: SLR, solution proposal, evaluation research or experience paper. It is calculated from table 6 that 97% of selected studies score more that 3 and 82% of final studies have empirically validated their approaches through experiments awarded score 1 shown in category (c) of quality assessment criteria. Studies score less than 3 have been excluded from this SLR.
Major application domains identified during the analysis of selected studies are: real-time DWH, streaming big data for social media and sensor networks, distributed join and stream processing and real-time stream processing. In addition to that, only eighteen papers out of seventy four score zero for category (d) of quality assessment criteria showing unstable/unrecognized publication sources, rest of them score higher indicating competent sources. Due to the relevancy of these eighteen studies, we have included them in our survey. These studies appeared in making important contribution to the area domain. Table 7 highlights all the publication sources/channels, number of articles per publication source and their percentage contribution towards this study. It is noted that articles related to real-time stream processing applications and techniques have not been published in any particular sources. Along with domain specific sources, various open access sources also welcome stream processing related articles. About 5% of finalized studies have been published in Q1 ranked journal ''IEEE Access'' and another 5% in ''International Conference on Digital Information Management'' conference. In this section, an insightful knowledge of the issues while implementing real-time stream processing is presented. Two major applications domains of real-time stream processing are categorized as: ETL/real-time DWH (will be discussed during assessment of RQ3), and applications other  than DWH. Discussion on requirements/challenges and developments for real-time stream processing applications other than real-time ETL are presented separately in subsequent subsections:

1) REQUIREMENTS/CHALLENGES FOR STREAMING BIG DATA IN APPLICATIONS OTHER THAN DWH
Various studies have highlighted and addressed many requirements and challenges in the field of streaming big data and  In-memory computing can significantly reduces execution time when input totally fits into memory or multiple iterations over that input required. Experimental analysis of recent realtime processing system (Apache Spark) with Hadoop is presented in [82]. Spark outperforms Hadoop in two experiments when input is on disk and when input is totally cached in RAM due to in-memory processing feature of Spark.
Due to inherent dynamic characteristics of big data its difficult to apply existing data mining tools/technologies. Pre-processing of big data streams, effective resource allocation strategies and parallelization are the issues identified by [6], [38]. Open source tools/technologies for big data analytics such as: Spark streaming, Apache Storm, Splunk stream, Yahoo!S4, NoSQL, Apache Samza etc are highlighted in former study. Big data analysis platforms and tools have been reviewed in a study [3] along with their applications, such as: Hadoop, GridGain, MapReduce, HPCC and Apache Storm. Whereas difficulties in selecting the right stream processing framework were identified and addressed for different use cases while developing a streaming analytics infrastructure by [5]. This study presents critical review of key features of some stream processing engines including Storm, Spark Streaming, Flink, Kafka Streams, IBM Steams. They concluded Kafka Streams and IBM Streams are good options for time-critical application.
A competitive real-time intelligent data processing system name Stream Cube has been implemented in a recent study [54] to handle real-time big data and to bring powerful AI tools into data processing field. Two studies [32], [37] have proposed method and solution for distributed, replicated, and highly available data processing, storage and query system for structured data named: Mesa. Mesa is built using common Google infrastructure and services, including BigTable and Colossus.
To enhance the column store indexes and in-memory tables, [33] proposed solution to significantly improve performance on hybrid workloads. Efficient look ups and column store scan operator also have been addressed in this study. Furthermore, need of approximate computing techniques related to real-time data streams, like: energy-aware approximation, approximation with heterogeneous resources, intelligent data processing and pricing model approximation have been reviewed in another recent study [42].
Moreover, information extraction (IE) techniques required with the rapid growth of multifaceted also called as multidimensional unstructured data which are explored in a survey by [51]. Task-dependent and task-independent are the limitations of IE covering all data types. Another study [53] proposed a stream processing framework along with Column Access-aware Instream Data Cache (CAIDC) supporting low response time while maintaining data consistency to migrate RDBMS to NoSQL. Low latency is required while supporting log based triger in the presence of updates to maintain data consistency and to ensure heavy hitter queries in stream processing framework.
Better resource utilization and real-time scheduling are the key challenges identified and addressed in study [55] for real-time processing of streaming big data. They proposed a hybrid clustering multiprocessor real-time scheduling algorithm and designed real time streaming big data (RT-SBD) processing engine to address these challenges. Their experimental results conclude that proposed solution outperforms the Storm engine in terms of tuple latency, proportional deadline miss ratio, and system throughput.
No dynamic memory allocation, lock free concurrent updates and online pattern detection are the key features required for optimized real-time processing in the applications of large sets of moving objects. These challenges are addressed by the development of multi-layered grid join (MLG-join) algorithm by [56] and a parallel algorithm for timely detection of spatial clusters developed by [58].

a: IoT
To provide real-time services to users in internet of things (IoT) based smart transportation environment, an architecture has been proposed and implemented by [57]. This framework is implemented based on Spark with Hadoop and MapReduce technique that process and handle huge amount of data in real-time. In addition, another IoT based framework developed in a study [60] to analyze students' performance on real-time basis based on sensor and screen activity data. They applied visual attention techniques for their analysis including: Top-down visual attention, Visual saliency/bottom-up attention, Saliency using natural statistics and A boolean map based saliency. Likewise understanding sensor data and distributed stream engines are the constraints highlighted by [62] and [71].
Processing of geographically distributed data has been surveyed in a study [19], without shifting whole datasets to a single location. In order to address the challenge of scalable processing and low latency for IoT cloud, a robotic application is developed by [72]. [73] in another study setup a real time system for processing heterogeneous sensor streams from multiple sources with low latency where Apache Storm is responsible for distributed real time sensor data processing. Authors in [74] proposed a framework to address challenge of continuous growth of massive data streams in a smart city network. This framework consists of three layers, where 2nd layer is responsible for real-time stream processing and data filtering making real-time decisions possible. Proposed framework is then tested with the help of authentic datasets on Hadoop ecosystem proving proposed framework as an improved smart city architecture. Additionally, density based clustering in real-time challenge is addressed by [75] by developing an algorithm which obtains high quality results with low computation time.
A real-time stream processing pipeline and current research activities in real-time spatiotemporal data domain are highlighted and compared by [81] and [80] respectively. Apache Storm, Apache Kafka and GeoMQTT broker are utilized as core tools for the development of pipeline architecture in former study that is capable for real-time processing of spatiotemporal data streams. Whereas, the challenge of event processing capabilities in the area of IoT geospatial architectures is highlighted in latter study. Inconsistency among traditional data access methods and event-driven approaches, and heterogeneous approaches for defining event patterns are few key issues identified by this study which need to be tackled to take full advantage of eventing in GI Science. Esper, Apache Storm, Apache Kafka, ESRI GeoEvent Server and Public Cloud Platforms(Cloud Pub/Sub, AWS IoT Core) are the relevant IoT event processing tools identified in this study.

b: SOCIAL MEDIA
Authors in [76] proposed a method to analyze and process data stream fetched from Twitter data using Hadoop. After analyzing processing time with the use of Hive and Pig on Twitter data, this study conclude Pig appeared more efficient than Hive in terms of execution time and support to semistructured data. Real-time processing on geolocated data from social media apps using hadoop has been performed in a case study by [77] and implement k-NN model to investigate the power of machine learning algorithms on un-structured big data. Possibility of real-time analysis of huge multimedia stream from online social networks is highlighted in studies [61], [79]. To overcome the difficulty of details consideration of distributed computing and low latency, a framework has been introduced in this study that hides platform details and provide simple interface to programmer. This study provided technical experimental comparison among three big data stream processing applications: Spark Streaming, Storm and Flink. Storm appeared to be slightly faster than Flink whereas Spark performed worst among all during experiments for automatic license plate recognition datasets.
Many researchers have looked into challenges for relational/structured data stream processing and proposed various solutions. Tools and technologies developed for realtime stream processing solutions can be broadly categorized as: Hadoop, Apache Spark, Apache Storm, Splunk Stream, Yahoo!S4, Apache Samza, GridGain, MapReduce, HPCC, Flink, Kafka Streams, IBM Streams, Mesa, Stream Cube, CAIDC, RT-SBD, MLG-join etc. After assessment of selected studies it is found that not much attention has been directed towards unstructured real-time stream processing. There is a need to put more attention to the identification of challenges faced during implementation of unstructured data stream processing for all application domains. These challenges create opportunities for application of new processing technology, which are more suited to unstructured big data streams. Due to complex and dynamic nature of streaming data, the analysis of stream processing approaches has become difficult and challenging. Rigorous studies have been performed for comparative analysis of forty two (42) selected studies that score in between 3-8 during quality assessment evaluation addressing real-time stream processing for DWH. Discussion on requirements/challenges and developments of approaches addressing identified problems are presented separately in subsequent subsections:
• efficient loading of data streams consisting of complex events (concatenation of simple events) • dealing with repeated data streams • maintaining low memory budget for growing streams • processing varying attributes of the stream such as data distribution and arrival rate.
• loading strategy of disk-based relational data blocks • managing different access rates while joining of growing streams with disk-based relations • maintaining regular wait for the join of each stream • heterogeneous data source integration, data source overload, master data overload, schema-less data bases Methods and techniques that have been employed in analysing real-time streams are outlined in table 8 oldest to newest order. This comparative analysis has been carried out based on five main factors: 1) methodology adopted by each study, 2) challenges identified and addressed in each study, 3) specific tool/data structure used or developed in the design of each study, 4) supporting shape of data and 5) evidence of proposed approach. It is clearly depicted from table 8 that various join algorithms and ETL tools have been developed till today for improving the efficiency of stream processing(join/queries) for relational databases/streams addressing challenges identified during assessment of RQ3. We have classified these algorithms/approaches into following categories: • stream-disk join for structured data • stream-stream join • sql query decomposition • multi-join query processing in cloud DWHs • survey of design approaches from distributed systems, social media and real-time ETL tools • architecture/framework for supporting distributed streaming ETL and data integration in real-time DWH • development of stream ETL engine • distributed on demand ETL framework • code-based real-time ETL tools Other emerging concept related to near real-time ETL has been addressed recently in [78]. They identified and proposed a solution for distributed on demand ETL, and developed a stream processing framework based on Kafka, Beam and Spark Streaming. This tool is able to execute workloads 10 times faster when compared to other stream processing frameworks for near real-time ETL by maintaining horizontal scalability and fault tolerance. Many researchers have implemented algorithms based on hash tables/maps as core data structure and database implementations using MySQL in their studies as shown in table 8. Identification of technical implementation details; like methodology and data structure, will help researchers in further optimization of existing approaches. However, little attention has been directed towards implementation of real-time ETL/DWH models/tools/architectures for structured/semi-structured/unstructured data streams. This survey has found out that 51 out of 74 selected studies contained empirical results. It has been observed that there has not been any publicly accepted performance benchmark for real-time stream data processing systems so far, however we have identified few performance benchmark adopted by selected studies. Table 8 shows that most of the selected studies have verified proposed approaches through experimental evaluation either by making use of synthetic dataset or reallife dataset or both. Standard benchmark dataset for real-time streaming analytics has not been widely adopted. However, few of the researchers that used standardized benchmarking are briefly discussed below.
Authors in [16], [22], [28] validated their approaches by making use of TPC-H benchmark whereas, new TPC-DS benchmark has been used in experiments by [31]. The data    was collected using a single Intel Haswell CPU with 12 cores, 2.30 GHz and hyper-threading disabled to analyse query performance by [33]. Whereas Yahoo! Cloud Serving Benchmark has been used to test the performance of work in study [53]. Some real-world trajectory datasets have been adopted by [56] and [58]: a fleet of trucks, a city buses for experimental evaluation of proposed methodologies whereas, synthetic datasets generated using the benchmark data generator were also used during evaluation by [56]. Semi-stream join algorithms developed by [20], [23], [43], [69], [70] were tested by using both synthetic and real-life datasets. They also analyzed memory and time requirements. In addition, a modified well-known Star Schema Benchmark (SSB), called SSB-RT, is used during experimental evaluation of rewrite/merge framework by [44], which embeds real-time features, as well as TPC-H benchmark.
While validating optimized algorithms addressing common challenges for IoT stream processing, one of the selected studies [75] use KDD CUP99 Network Intrusion Detection dataset that comes from the 1998 DARPA Intrusion Detection as real-life data set along with 3 synthetic datasets.

V. CONCLUDED DISCUSSION
The objective of this survey is to provide guidance for researchers in the subject of real-time stream analysis for DWH and big data applications. For this purpose, we have investigated applications, developments and challenges in terms of methodology, data structure used and shape of data. Our exploration highlights main target publication channels for real-time stream processing research in real-time DWH domain and in other big data applications such as: (IoT, Social media, Google etc). Our literature further highlights implementation challenges along with developed approaches/tools and evaluation evidences for real-time stream processing in all mentioned application domains.

A. GENERAL OBSERVATIONS
Observations from the literature reveal that there exists various algorithms for implementing real-time join processing at ETL stage for structured data and there seem to be less tools and technologies that offer real-time join processing for unstructured data. It is further observed that little attention of researchers is found in existing studies to discuss the data structures used by every approach for the development of their algorithms which if exists might help researchers to address research gaps in this subject domain. Research efforts should be geared towards advancing processing approaches that are suitable for all existing/important type of streaming data. In addition, it is rare to find specific algorithm/approach that collectively addressed all identified challenges. It is further observed that many researchers have looked into different features and suitability of existing stream processing engines, however there is still need for the development of stream engines flexible for modification according to business needs.

B. FUTURE DIRECTIONS
Fitting DWH in cloud architecture is of tremendous business need as cloud storage is economical and scalable. If DWH is integrated with cloud computing it can handle relational and non-relational data and can be offered as a service. More tools and technologies could be developed for implementing cloud DWH concept. Moreover, purpose build ETL tools that are ready to connect open data sources or which can delay transformation phase until it is needed are becoming more effective from industry's viewpoint.
ERUM MEHMOOD was born in Pakistan. She received the M.Phil. degree in computer science from NCBAE, Lahore, Pakistan, in 2017. Her M.Phil. dissertation is in the area of stream processing for real-time data warehousing. She is currently pursuing the Ph.D. degree with the University of Management and Technology, Lahore, under the supervision of Dr. Tayyaba Anees.
She is currently working as a Lecturer of computer science with the Government Degree College, Lahore. Her research interests include big data analytics, stream processing, ETL, and real-time data warehousing.
TAYYABA ANEES was born in Pakistan. She received the Ph.D. degree from the Vienna University of Technology, Vienna, Austria, in 2012. Her Ph.D. dissertation is in the area of service-oriented architecture and web services availability domain. She has worked as the Project Assistant at the Vienna University of Technology for four years. She is currently working as the Director Software Engineering Program/Assistant Professor at the Software Engineering Department, University of Management and Technology, Lahore. Her research interests include service-oriented architecture, web services, software availability, software safety, software engineering, software fault tolerance, and real-time data warehousing. VOLUME 8, 2020