A Comprehensive Review on Big Data for Industries: Challenges and Opportunities

Technological advancements in large industries like power, minerals, and manufacturing are generating massive data every second. Big data techniques have opened up numerous opportunities to utilize massive datasets in several effective ways to improve the efficacy of related industries. This paper presents a review of big data technologies used in the power, mineral, and manufacturing industries for various purposes. We analyze the meta-data of the collected papers before reviewing and selecting papers by applying selection criteria and paper quality assessment strategy. Then we propose a taxonomy of big data application areas in the power, mineral, and manufacturing industries. We have studied current big data architectures and techniques implemented in industry sectors and have uncovered the big data research gaps in industry sectors. To address the gaps, we point out some relevant research questions and, to answer the questions, we make some future research recommendations that might explore interesting research ideas for building a big data-driven industry. As the careful use of big data benefits every other industry sector; hence, supportive big data frameworks need to be developed to speed up the big data analysis process. Proper multi-dimensional big data assessment is also needed to take into account for serving effective data analysis tasks. Industry automation is also heavily influenced by the proper utilization of big data. While an intelligent agent can make many processes and heavy production loads in the manufacturing industry, it can work in a risky environment such as mines efficiently. To train agents for working in a specific environment big data can be used.


I. INTRODUCTION
The headway of information communication technologies (ICTs), the internet of things (IoT), and the age of industry 4.0 have brought all industries very near to automation [1]. With these advancements, a massive amount of data is generated every second, and it becomes clear that data is the most significant factor in the age of big data [2]. Big data The associate editor coordinating the review of this manuscript and approving it for publication was Yunfeng Wen .
can be called intellectual petroleum for all socio-technical sectors [3]. Revolutionary changes have resulted over the last few years with the adoption of big data technologies in leading industries such as power, mineral, and manufacturing.
Through the utilization of the invaluable data, many industries are reframing their operations in processing and reshaping their business models [1]. According to a survey in 2012 by International Data Corporation Energy, 70% of US oil companies were unconcerned about the application of big data techniques in the oil and gas industries [4]. Another recent survey of General Electric and Accenture found that 81% of oil and gas companies considered big data as the top priority [5]. Thus, big data applications are increasing in energy sectors creating significant opportunities in energy conservation, energy management, environmental protection, energy consumption, and generated production data [4].
Nowadays, the large industries are benefited from big data techniques integrating IoT applications, machine learning (ML), and data mining algorithms to learn about their consumers, markets, and business trends from their operational data (e.g., transaction price, electricity sales, electricity consumption and customers data) [6]. Many smartphone applications [7] with the help of integrated sensors are involved in collecting household continuous power consumption data from customers. The customers' consumption patterns help the companies to make significant business decisions, offer beneficial consumption policies to customers, and load distributions of an area. The production data e.g., power generation and voltage stability data in the power industry allows the continuous monitoring of the system and detection of faults, or anomalies. Production data in the manufacturing industry can identify defective products, machines, and tools as well.
However, as industries produce highly unstructured and large dimensional data from diverse sources, the accumulation of such massive data in a unified structure and utilization of it is very challenging. For instance, oil and gas companies collect 2D, 3D, and 4D geophysical seismicgenerated unstructured, complex data with the help of data-gathering sensors in subterranean wells to monitor operational resources. By developing region-wise data sharing models, oil and gas industries accumulate environmental data such as geographical properties, and marine life details in the sea-beds. These data-centric models may help to develop precise environmental models using the combination of ML with predictive analytics, and then environmental data-sharing models further can be considered for drilling fluid selection [8]. Besides, environmental models can diminish the current risks of inefficient drilling fluids that increase drilling time and costs and harm the environment. Thus, oil and gas organizations can select the optimum drilling fluid for a region. Environmental data sharing models can also be incentivized for oil and gas operators. Due to the current drop in oil and gas prices, they may help oil and gas operators to reduce the cost of decommissioning wells using the insights obtained from these models [9].
Moreover, large industries must tremendously emphasize controlled conditions [10] for operation and production. Though big data techniques are assembling vigorously in most large industries by establishing successful projects, they demand improved operational efficiencies and optimization of processes. Big data techniques such as data mining, training of the system, interpretation through predictive analysis by applying neural networks, classification and clustering algorithms can enhance productivity and efficiency [11]. Moreover, the big companies must focus on improving costs with increased profit margins and reducing the natural dangers (associated with power generation and mineral extractions) by using different data management techniques [12]. In this review paper, we try to find answers to the following questions: • Which research topics on big data technologies related to the industry have been addressed so far?
• What are the big data research gaps that exist in the industry sectors?
• What are the open research questions that come up with potential solutions to the gaps?
• What would be the future directions to bridge the gaps? Therefore, the motivation of the research study is to discover the big data research gaps that exist in the industry sectors. The main contributions made in this research study can be summarized as follows: • We study the state-of-the-art big data technologies that have been applied to the development of the power, mineral, and manufacturing industries.
• We investigate existing research gaps in the power, mineral, and manufacturing industries. • We propose some open research questions to bridge existing big data research gaps in the power, mineral, and manufacturing industries.
• We recommend some potential future research directions to eliminate the gaps and promote a big-data-driven industry. The remaining paper is organized as follows. In section III, a taxonomy of big data application areas in the power, mineral, and manufacturing industries is proposed. The state-of-the-art research in these industries is discussed in sections IV-A, IV-B, and IV-C, respectively. Open research challenges and future directions are discussed in section V. Finally, section VII concludes the paper.

II. REVIEW METHODOLOGY OF BIG DATA APPROACHES FOR INDUSTRY
In this section, we discuss the process of conducting a review of research papers on big data for the industry. For searching and collecting research papers, we analyze the meta-data of research papers. The process of research papers' meta-data analysis is illustrated in Fig. 1.
We divide the entire process into four steps -identification, screening, assessment, and inclusion. In the identification step, we search papers using keywords following the search strategy discussed in II-A. At first, we selected some good sources of high-quality papers such as IEEE digital library, SpringerLink, ACM digital library, Science Direct, etc. We have searched the selected sources separately and have found a total of 2033 papers from IEEE (843), SpringerLink (358), ACM (117), Science Direct (430), and others (285). We apply the paper selection criteria discussed in II-B and remove 17 repeated papers that were found in the multiple databases and then the remaining 2016 research papers go to the screening step. In this step, we go through the papers' titles and abstracts to examine if they are written VOLUME 11, 2023 in English and related to big data for the industry. Thus, we remove 1708 papers and the rest of the 308 papers go to the research paper quality assessment step where they are filtered by the quality assessment process discussed in II-C and demonstrated in Fig. 3. Finally, we get 132 papers for review.

A. SEARCH STRATEGY
We search the research papers on big data for industry sectors published before 2022 in some good quality popular research paper databases such as IEEE digital library, ACM digital library, SpringerLink, Elsevier, Multidisciplinary Digital Publishing Institute (MDPI), Google Scholar, Willey, etc. We search in each database using specific keywords such as ((big data <AND> industry) <OR> (power big data) <OR> (minerals big data) <OR> (process industry data)).

B. STUDY SELECTION AND INCLUSION CRITERIA
To filter good quality papers from the collected papers, we set some inclusion and exclusion criteria are shown in Fig. 2. The selection and inclusion criteria are as follows: • The research paper published before 2022 • Research topic relevant to big data for industry • The research paper is written in the English language • The research paper contributed to addressing its formulated research questions • The research paper has got a quality score greater than or equal to three in the paper quality assessment process demonstrated in Fig 3.

C. COLLECTED PAPER QUALITY ASSESSMENT
We set some quality assessment questions to evaluate the quality of a research paper. Each question is worth one point. If a paper can answer a quality assessment question completely, it gains one point. Partially answering, the paper gains 0.5 points; otherwise, it gains a point of zero. Finally, all the points achieved by a paper are summed up which is the quality score of the paper. In this process, the quality of each paper is evaluated. The quality assessment process is demonstrated in Fig 3. The paper quality assessment questions are as follows: • Does the paper clearly state its aims? • Does the paper make a substantial novel contribution(s) in the field of big data for the industry?
• Does the paper discuss big data challenge(s)?
• Does the paper able to answer the formulated research question(s)? In Table. 1, a year-wise paper distribution of the finally selected papers is listed. The year-wise frequency of papers is demonstrated in Fig. 5. We distributed the papers into six groups based on the range of years. As big data is a current emerging topic, we found that there are less number of papers before 2011. So, we listed the older papers till 2014 in a group. After that every year, we found a large number of publications.

III. TAXONOMY OF BIG DATA APPROACHES FOR INDUSTRY
From the literature review, several research papers on big data technologies implemented in the industry sectors have been found. To review finally selected papers in an organized manner, we propose a taxonomy shown in Fig. 4. After collecting the industry-related papers, we realize that based on the industry domains, we can categorize the existing research papers on big data for the industry into three main categories -the power industry, the minerals industry, and the manufacturing industry. Research papers that discuss big data technologies to handle power data, various methods, algorithms to improve the performance of smart grid systems, etc. are included in the power industry category. The minerals industry category includes research papers that discuss oil and gas field data accumulation, seismic data digitization and visualization, geophysical pattern recognition, drilling field identification, and so on. The manufacturing industry category includes production big-data handling techniques, product and machine fault detection methods, product quality assurance methods, and so on. Each category is divided into multiple sub-categories depending on the application sub-areas in each industry domain.
The power industry is sub-categorized into (i) power data quality assessment; (ii) power data fusion and cleaning; (iii) distributed power data mining; (iv) power data communication, privacy, and security; (iv) power data analytics. Among all the sub-categories, power data analytics is the most diverse sub-category. Therefore, we divide power data analytics into renewable energy prediction, system monitoring, fault detection, and user and business analytics. The minerals industry is sub-categorized into (i) minerals data storage and resource management; (ii) minerals data processing; and (iii) minerals data analytics. Like power data analytics, the implementation of minerals data analytics is also diverse. So, minerals data analytics is divided into exploration, drilling and completion, reservoir management, production engineering, pipeline monitoring, and maintenance. The manufacturing industry is sub-categorized into (i) manufacturing data processing, security, and transmission; (ii) process state monitoring; (iii) product quality assessment. Based on the implementation, manufacturing data analytics is divided into production management, product anomaly detection, and supply chain management. Though we classify the applied methods into different categories to assist our study, the techniques applied in one industry domain can also be applied in another domain to an extent except for some special cases. So, at the end of the study, we emphasize industry domain-independent big data techniques and discuss the overall challenges and future research opportunities in the industry sectors.
In Table 2 the applied techniques and methods used in big data in the power, mineral, and manufacturing industry are listed. We have listed the market tools that are used in big data processing and visualization in Table 3. Fig. 6 and Fig. 7 depicted the percentage of big data techniques to acquire and   process industry data and ML and data mining techniques to analyze and apply the data, respectively. High-quality data is the most essential requirement for the adaptation of big data technologies and data analytics [39]. The performance and accuracy of any method heavily depend on the quality of big data and can be severely affected by poorquality data.

IV. LITERATURE REVIEW
A commonly adopted technique power data quality assessment is to frame quality assessment techniques to collect and clean the dataset using big data frameworks. A separate big data quality assessment framework for real-time and historical electric power data can be efficient in the power system 748 VOLUME 11, 2023  big data assessment system which is proposed in [2]. In this scenario, the power grid system can be separated into two separate sub-parts -headquarters and the provincial power grid. The power grid data can be acquired from provincial VOLUME 11, 2023  to headquarters using Kafka, and then real-time data can be stored using HBase. History data can be collected by a socket connection established between headquarters and the provincial grid; thus, headquarter can get access to the FTP server of the provincial grid, and historical data can be downloaded and stored in HBase and HDFS. Besides, an integrated database environment helps to store data that includes a relational database Oracle, a NoSQL database, HBase, along with a distributed file system HDFS. Before determining the type of assessment, redundant blank lines and blanks have been eliminated. The data quality assessment strategy determines if the data needs to be cleaned. For data cleaning, there are various data cleaning methods such as outlier detection. The data assessment strategy can be varied depending on data assessment types such as subjective and objective. However, an ideal assessment method must have the techniques to convert unstructured and semi-structured data to structured data and apply advanced techniques for extraction of meaningful features of data [113].
In addition to that, specific evaluation criteria for the power data quality assessment method must be defined. A clear process for selecting a quality assessment strategy for a certain domain was also ignored so far as well. The proficiency of such techniques must be proven by quality evaluation of power data through the practical experimental assessment of power grid systems.

2) DISTRIBUTED POWER DATA MINING
Efficient big data analysis systems must have powerful data processing and mining tools, algorithms, and platforms. Conventional data processing systems have been developed utilizing OLAP (Online Analytical Processing) system and OLTP (Online Transaction Processing) systems that are lacking standalone operational models. Hence, they proved insufficient for taking a long time in processing big data and unsatisfactory performance [126].
Enterprises are taking the help of data lakes to ingest data from on-premises, cloud, or edge-computing systems, to maintain full fidelity while storing and processing data, to analyze data using any language and applications. Distributed processing systems such as Hadoop and Spark, have brought a solution to the problem of traditional data pressing systems by incorporating deep learning frameworks. However, general-purpose mining algorithms are not serving well in discovering and utilization of specific purpose data such as industry data. So, building mining platforms e.g., ( [91], [132]) for domain-specific industry data has become a research focus of recent researchers.
Besides, for collecting and managing large power data virtually, IoT, cloud computing, and fog computing have proven highly efficient [104]. Stergiou and Psannis [104] proposed a framework for the energy consumption of industrial data centers across different heterogeneous machines. They embraced emerging reinforcement learning and federated 750 VOLUME 11, 2023 learning techniques. Recent research trends are using advanced distributed data processing frameworks such as Spark, Hadoop, YARN, and other frameworks with ML or deep learning models e.g., such as [126] embraced deep learning distributed algorithms e.g., LeNet-5 and LSTM networks, and implemented a Directed Acyclic Graph (DAG) to reduce the operational complexity and facilitate the reuse of components. However, how feature extraction, model learning, optimization, and insight have been utilized in the power industry is yet to explore.

3) POWER DATA FUSION AND CLEANING
Data fusion and cleaning is the foremost step in big data processing and analytics. Though data fusion and cleaning are very challenging because of the heterogeneity of big data. it is implemented by many big data platforms such as [122] where homogeneous data has been fused. Cluster and fuzzy methods are one of the efficient techniques for high-order heterogeneous data cleaning and fusion proposed in [109]. However, most of the methods were associated with a loss of data in the data cleaning process, which is not conducive to the mining of data in subsequent state assessments. To solve this problem, Lv et al. [84] proposed a data fusion method that can fuse multi-source heterogeneous grid data in text file format and store manual records of the grid into a unified format of data files. They proposed a data cleaning method based on ML using a support vector machine (SVM), radial basis function (RBF), neural network (NN), random forests (RF), and multi-layer perceptron (MLP).

4) POWER DATA COMMUNICATION, PRIVACY, AND SECURITY
The unrestricted nature of the smart grid enables the interconnection between the power grid and users [70], [80]. Access to huge numbers of users, extensive use of intelligent data acquisition, and wireless network transmission yield massive amounts of data. In power system transmission, each information level produces a huge amount of big datasets. With the surge of noisy data, conventional clustering algorithms are not efficient for widespread noisy data [80]. Moreover, the secure storage of data is one of the crucial challenges during the application of big data [80]. Including power system data, the privacy and security of user behavior data should be ensured. Besides, as the smart grid is an intelligent network that connects energy users' actions, reduces energy consumption and cost, and increases reliability by using advanced communications technology, suitable demand management is required to generate and transmit energy [79].

1) Renewable energy prediction
Integration of smart grids with distributed renewable energy sources such as wind, solar, etc. causes increased challenges to the power industry. Though the operational capability of the power system to be fed into the energy grids has increased over time, the inconsistent nature of renewable energy may negatively impact the power supply [14]. So, the detection of possible renewable energy sources, and the prediction of the production of energy have vital importance. For prediction, artificial neural network outperforms most of the earlier methods. For instance, [14] predicted the prediction intervals using two separate methods to assess confidence in the prediction. In the first method, multilayer perceptron neural networks are trained with a multi-objective genetic algorithm, and in the second method, extreme learning is trained combined with kdtree. The prediction of short-term wind on a real dataset of hourly wind speed measurements also provides us insight into the energy resources that will be required. However, though wind energy is one of the frontrunners in technological breakthroughs leading us towards more efficient power production systems, the drawbacks cannot be ignored. The establishment and maintenance of turbine and wind facilities are highly costly. This extreme expense is directly related to undeveloped technology. Through technological innovations, the reliability and energy output can be increased and system expenses can be reduced.

2) Power System Monitoring and Fault Detection
Smart grid systems required continuous motoring of the power systems [87] and big data mining and ML algorithms facilitate the early predictions and forecasting regarding the states of power systems. Fault detection in power systems has great importance to avoid sudden power failure. The advancement of sensor technology allows collecting real-time information on power system health, operational status, and so on.
In addition to the sensor-collected data, other data such as meteorological information proved useful to forecast power failure [58]. Therefore, fault detection in power systems is challenging as it involves multi-source heterogeneous data. Fault detection methods generally facilitate by clustering and classification techniques depending on the purposes. Smart grid fault detection can be considered a one-class classification problem if there is less variety in smart grid big data, otherwise, the classification could be multiclass classification. Classification can be done using various techniques while the fuzzy method, and decision tree algorithm are mostly used particularly for linear data and neural network mode can be used for non-leaner data. In the fuzzy method, a value is associated with every specific feature, and depending on the threshold the data can be classified. In the decision tree algorithm, based on the feature value decision is made on data instances, thus grid data can be classified as faulty or anomalous. Santisa et al. [45] and [59] considered it as a one-class classification problem and proposed a combined method of dissimilarity measures learning by evolutionary learning and clustering techniques. Then they analyzed the results VOLUME 11, 2023 using a fuzzy set-based decision rule. They used power system operational data, Spatio-temporal data, physical components state data such as currents and voltages, weather conditions, etc. In [92] fault has been detected due to power swing within half cycle time using a decision tree. A decision tree algorithm was also applied for developing an intelligent relying on a transmission system by Jena and Samantaray [57]. They extracted 21 features from phasor measurement unit (PMU) data by applying Kalman filtering and feeding them into decision tree algorithms to provide the transmission line relaying decision. Papadopoulos et al. [93] applied a decision tree incorporating hierarchical clustering for detecting the dynamic behavior of power systems after an interruption and detected unstable non-synchronous groups.
Another purpose of fault detection is developing early warning systems. Reference [112] applied extreme learning machine (ELM) algorithm to develop an intelligent early-warning system for reliable online detection of risky events in the power system. Reference [96] analyzed the factors that affect transmission line galloping and proposed a bi-level classifier by applying SVM and AdaBoost. In power systems transmission, a crucial characteristic of operation is maintaining voltage stability. Rapid decision-making has become challenging using massive amounts of data collected from geographically distributed locations applying traditional SCADA systems. A subset of ML i.e., active learning has been proposed by Malbasa et al. [87] to predict the voltage stability of the power systems. The data-driven model supports online updates and offline training. Through active learning, unlabeled datasets are labeled automatically and data processing is done efficiently. By collecting the system operation data and meteorological conditions of the power system, Sheng et al. [81] adopted traditional association rule mining algorithms such as Apriori, AprioriTid, and AprioriHybrid incorporating a probabilistic graphical model to monitor the transformer state and predict failure.

3) Load forecasting
Load forecasting is vital for energy management, market demand analysis, and power system operation [60]. Load forecasting has been proposed by applying various techniques. However, centralized forecasting is challenging for a centralized power system due to weather diversity and load variation in different regions. Liu et al. [58] proposed distributed short-term load forecasting techniques from local weather data.
The power system network is distributed in subnets based on optimal region partition and calculating similarity between vectors of influencing factors by cosine distance to select representative samples as training datasets. They used neural networks, autoregressive integrated moving averages, and autoregressive moving average models to separate subnets. Li et al. [85] proposed an extreme learning machine (ELM) by a wavelet-based ensemble scheme and to improve the performance of the model they integrated the Levenberg-Marquardt method. For feature selection and selecting input variables, they used conditional mutual information, and partial least squares regression has been implemented to combine all the particular forecasting results. However, feature selection or extraction is one of the most challenging tasks involved in ML models.

4) Power User and Business Analytics
Due to the improvement of living standards, the number of high-quality power appliances is increasing in houses. This creates an extra load demand during peak hours on the smart grid and is a challenge for safe operation [6]. The power load demand during different times of the day and locations widely varies. Therefore, to keep track of customer demand, power generation companies need to analyze load demand and ensure power supply without interruption. Big data technology can help to store residential power consumption data and discover user consumption behavior using data mining and ML algorithms. As a demonstration, Wu and Tan [6] proposed a big data storage framework for user power data and analyze the user behavior using the Apriori algorithm. Besides, Zhang et al. [111] emphasized on optimization of load demand response to manage home energy systems. This user data analytics assists power companies in making business decisions and updating policies for power users [26].
The complete process of power big data acquisition, processing, and application is depicted in Fig. 8 including the corresponding used technologies. A summary of some selected papers related to the power industry that applied big data technologies are listed in Tables 4 and 5. We have listed the major contributions, dataset, implementation and evaluation details, and limitations of selected papers. A categorywise list of the previous works in the power industry that embraced big data techniques is listed in Table. 6. Big data has spread throughout diverse sub-areas of the minerals industry and has become one of the driving forces of the global economy and business. Minerals' big data may help to discover suitable oil and gas fields by nominal destruction of the environment; establish efficient mineral extraction systems; explore the relationship with customers, and benefit markets and opportunities. Therefore, research on the storage, analysis, and visualization of unstructured and semi-structure data has become the center of research interest and big data innovation in the mineral industry [130].

1) MINERALS DATA STORAGE AND RESOURCE MANAGEMENT
Typically, mineral systems involve an extensive amount of reports, geographical location maps, queries, official documents, and so on. Real-time monitoring and management of the exploration, mining, drilling, reservoir change, and utilization of associated management rights are impeded severely because of low efficient data collection, storage, and management systems [33]. The problem of inadequate natural energy resources compared to demand, inconsistent prices, environmental risks, and competition among other energy sources can be solved by proper utilization of data accumulated from diverse mineral sources [20].
Big data storage and management frameworks have demonstrated the ability to efficiently support large amounts of mineral data storage and management. Li et al. [33] investigated the architecture of a mineral resource management system and proposed a resource management system integrating big data and GIS technology. Besides, the proposed system also included various rights management such as mineral mining management, exploration management, GIS map management, geological exploration management, resource VOLUME 11, 2023  reserve management, resource database management, statistical analysis management of data, etc. [33]. In their proposed system, mineral assets data has been stored on the server-side of the responsible monitoring department of the government and can be accessed using an internal local area network connection. Mineral business enterprises can access the resources using the internet. However, the security of the system is a crucial issue that has been ignored. To protect invaluable mineral resource data from hacking, security systems need to be strengthened. Bello et al. [47] developed a big data architecture for automation of distributed production and downhole sensing data transmission, management, and visualization through the real-time reservoir and well monitoring applications. They further explored a web-based framework for data exchange between industries and business enterprises, monitoring, and interpretation [47]. Another integrated architectural model with big data business analytics and transaction data has been proposed by Alguliyev et al. [20].

2) MINERALS DATA PROCESSING
Due to progress in the petroleum and IoT sectors, extensive amounts of petroleum geoscience data have been growing in recent years and have spread to Peta Byte or Zeta Byte. Petroleum data include seismic data (85%), well log data (6%), petroleum engineering data (5%), rock physical data (2%), and others (2%) [50]. So, it has become very challenging to handle a large amount of petroleum geoscience instantly using CPU only [89]. Han et al. applied a hybrid CPU/GPU system to process petroleum geoscience big data. MPI and CUDA parallel technology has been used to reduce computational costs. They used 8 GPU for the computation of actual seismic data and it took 48GB in 3.1 hours [50]. However, to run parallel computational tasks smoothly, the efficiency of computers is a crucial factor. For the computation of big data MPI, and CUDA require high-performance systems which are extremely costly. Therefore, alternative efficient computing technologies need to be proposed.

1) Exploration through Seismic Pattern
Due to the advancement of geo-phones, a large amount of data is generated and captured at every moment of the fracture job [48]. Explanation of seismic data is essential for visual understanding. Many well-known oil and gas companies like ExxonMobil have utilized seismic visualization to detect or predict the distribution of fractures in tight reservoirs that improves streamline and well placement. Various sensors e.g., hydrophones or geophones capture data from low-frequency waves caused by tectonic activities, and nowadays the well-developed oil and gas industry forms data-driven 3D visualization frameworks on topography, speed display, and depth imaging [8].
Geographical and geophysical data aggregated from oil and gas-bearing basin and fields contributed a huge amount of data [11]. Olneva et al. [11] proposed a framework using Hadoop to analyze seismic datasets, extract significant geological features, characterize the reservoir and identify geological issues [48]. Multiple high-performance computers and advanced data analysis algorithms are involved to interpret seismic data.
As an illustration, Roden proposed principal component analysis (PCA) and self-organizing maps (SOMs) to interpret shifts from a large amount of seismic data [30]. As PCA and SOMs are unsupervised ML algorithms, they can extract unknown hidden features from seismic data and help to understand the geology. By visualization, geological hidden features can be easily visible from a 2D color map. VOLUME 11, 2023 As a practical realization, Olneva et al. [11] applied big data techniques to the database of the west Siberia petroleum basin, which is a great storage of diverse uncommon characteristics and hydrocarbon accumulation to discover new patterns in the distribution field and evolve novel exploration techniques. The authors (i) applied k-means clustering for 1D data, multivariant regression using drilling results from 5,000 wells data to demonstrate the regional structure of west Siberia through visualization of regional maps and charts, and (ii) build a training image sample of seismic events in the Achimov sequence from 3D seismic data and geological patterns from more than 40,000 sq. kilometer of 3D regional data. Alfaleh et al. [19] applied topological data analysis (TDA) to analyze the shape of complex data, and discover clusters and their statistical importance for examining reservoir connectivity and compartmentalization. The application of TDA allows forecasting with high accuracy, new plans for growth, efficiency measurement, and optimization. Norne model [133] has been used to simulate the case study and inverted 4D timelapse seismic data generated by reservoir simulations have been used.

2) Drilling and Completion
Analysis of a huge amount of data generated during the drilling process has a great impact on structuring the pipeline and further in detecting failures and ensuring safety during drilling operations. Manual processing of the real-time large amount of data prevents potential production. This process can be advanced by developing data-driven models that will intensify the execution of resources and increases production on wellheads [5]. Big data analytics and ML have become a current endeavor to improve the added value of drilling [40]. Through the utilization of wells' data collected by sensors, drilling models can be evaluated and geologic estimation of drilling procedures can be possible [65]. Besides, early detection of anomalies affects penetration and ensures avoidance of undesired events such as kicks, blowouts, etc. The quality of generated data is also necessary to evaluate to prevent the misuse of the drilling data and avoid future calamities resulting from wrong decision-making [83]. Through optimization, drilling operations can unite the logging [34]. Duffy et al. identified and standardized the best practices by incorporating an automated drilling state detection and monitoring service that speed up the production and boost the rig performance. The rigs' activity data can be transformed into meaningful performance patterns of crews after classification of data and integration of rigs' operational data. From identification of the most efficient crews, let them observe their working procedures and make other crews follow those procedures [65].

3) Reservoir Management
Big data provides an opportunity for reservoir engineers to monitor and sorting of reservoir simulation results. The explanation, system design, and prediction of the reservoir simulation parameters heavily rely on the analysis of stratigraphic rock [30], [115]. However, it is very difficult to digitalize 3D seismic data and estimate relative permeability parameters, and bottom hole pressure, which restricts the monitoring task of reservoir engineers [131]. Integration of cloud computing with big data provides optimizations of the reservoir parameters e.g., gas lift, formation of water injection, water displacement spacing, and pattern in real-time [75]. Xiao and Sun applied a big data application model to predict the reservoir dynamics by assimilating all the data in the reservoir and production system and letting the reservoir engineers do continuous monitoring. The author later established relationships between different systems combining connected nodes and boundary conditions [131]. However, as there are so many reservoir parameters and varied patterns, a more efficient way of finding related features and boundary conditions should be investigated. To understand the induced and natural fracture features in the subsurface and their effect on the fluid flow and transport, Udegbe et al. [123] adopted a face detection approach using the cascade AdaBoost algorithm to discover patterns of fracture properties and gas shale production data classification using pattern recognition approaches from vectorized 1D image data. The performance of the proposed method was evaluated for hydraulically fractured wells. However, as ML models need to be trained with extracted features from production data, it is difficult to process and extract significant features to train the model in real-time.

4) Production Engineering
Production of oil and gas industries is heavily impacted by various damages e.g., downhole casting, water injection, and life cycle of oil, gas, and water wells. Song and Zhou [78] adopted a method to predict casting damage by applying PCA for dimensionality reduction of data and then gradient boosting decision tree for supervised classification. They proposed a three-node Spark big data platform to collect test datasets from an oil field in mid-east China containing 446 wells data among which 352 are undamaged wells and 94 are casing damaged wells. In the data extraction process, they selected the 10 most significant parameters that are responsible for casting damage such as casting outer diameter, the thickness of the walls, the density of flow path between the near reservoir and the wellbore, sand layer bottom, sand layer top, etc. Then the dimensionality of parameters has been reduced by PCA to 5 dimensions and the risk of casting has been assessed using the Gradient Boosting Decision Tree algorithm. However, experimental data indicated that some lacking in the collected dataset negatively impact the performance of the models, such as the absence of significant parameters and missing values, etc. Ockree et al. [98] discussed many ML and data mining algorithms and their performance in classification and predicting well production. To demonstrate the working procedure of an ML model, they applied an RF algorithm to the wells' production dataset. For data sampling in the preprocessing step, they used a bootstrapping algorithm that randomly sampled data disregarding duplicate data. The bootstrapping algorithm has been used for replication of sample data and in this work the authors utilized approximately 1,000 replicate wells' data. Cadei et al. developed a model that can forecast the H2S trespassing events and provide a rapid and broad solution by analyzing the main cause to early troubleshoot the fault [17]. They also discussed the effectiveness of various ML models as binary classifiers such as logistic regression, decision tree, and neural networks for forecasting the H2S trespassing events and they found that neural networks achieve high accuracy where logistic regression and decision tree improve the transparency of the forecasting. For feature extraction and training of the models, they used three types of abnormal events isolated from raw data, which were plant shutdown, gas sweetener shutdown, and PI recording failure.

5) Pipeline monitoring and maintenance
The pipeline is one of the important components in the oil and gas industries, while a huge percentage of petroleum is transported through the pipeline [82]. According to Canadian Energy Pipeline Association (CEPA) [134], the pipeline has been transporting 97% of natural gas and crude oil in Canada. Despite being considered one of the reliable and economical ways of transporting oil and gas, pipelines are frequently affected by various anomalies such as corrosion, cracks, and dents. Various ML techniques have been used to detect anomalies in many research. Layouni et al. [82] discussed the numerical and non-numerical techniques to determine the defect lengths and depths and location of Metal-loss in the oil and gas pipeline and proposed a method to detect metal loss by applying pattern-adapted wavelets and two ML algorithms i.e., artificial neural network and linear regression through investigating the magnetic flux leakage data collected from the scanning of oil and gas pipelines. They selected maximum magnitude, peak-to-peak distance, mean average, standard deviation, and integral of the normalized signal from 1,300 data items as defect depth heavily depends on these features.
A summary of some selected papers related to the minerals industry that applied big data technologies are listed in Tables 7 and 8. The complete process of mineral big data acquisition, processing, and application is depicted in Fig. 9 including the corresponding used technologies. A categorywise paper list of the previous works in the minerals industry that embraced big data techniques is listed in Table. 9.

C. STATE-OF-THE-ART BIG DATA TECHNOLOGIES IN THE MANUFACTURING INDUSTRY
The highly available, efficient, and diverse sensors have aided the revolutionary transformation towards automation by supervising the functionality process, tools, machines, quality evaluation, fault prediction, and so on [1]. Through effective use of sensors and application of data fusion techniques, significant knowledge can be extracted and that can be further used for business growth.

1) MANUFACTURE DATA PROCESSING, SECURITY, AND TRANSMISSION
Manufacturing big data generally includes machine and toolsrelated data, operational data, business enterprise data, and external data. Wei et al. [71] developed the architecture of the service-oriented manufacturing data access model and discussed the various functionalities. The developed platform provides interfaces for various purposes such as web service, resetting the database, file upload, online filling of data, accessing the video, Web crawlers, etc. Responsible persons for data access configuration configured the access mode based on the requirements. Structured, unstructured, and semi-structured data can be transferred from the data center to the service center and vice versa with the help of a data service interface. In the service center, there were three modules i.e., data collection units, data service units, and data distribution units. Finally, the service center is connected to a distributed big data storage system through a data bus channel. Liu et al. proposed a manufacturing model for the hydrostatic bearing system incorporating cloud with big data techniques [64]. The architecture and process scheduling techniques of the manufacturing industry in the cloud service VOLUME 11, 2023  platform have been investigated in the context of big data analytics in the paper [51].
It is crucial to ensure data security when capturing and transmitting data to and from the cloud computing environment. Data access should be controlled with the help of authenticated mechanisms so that data collected from one plant cannot be visible to people of other plants [69] and transmission should be secured by various powerful encryption algorithms.

2) MANUFACTURE STATE MONITORING
As the different machinery varies in their internal structure and working process, diverse techniques have been adopted to monitor their states, and performances and to detect faults [124]. Kumar et al. developed a health state of a cutting tool monitoring and estimation to aid automatic diagnostics, maintenance, and prediction using a polynomial regression model. The proposed model was built based on sequential clustering and applied to time-series sensor signals as unlabeled datasets [56], [72]. An inspection-replacement policy has been proposed by Zhang et al. [127] as a maintenance strategy for heterogeneous populations. In this policy, inspections are conducted at the early stage to find out and replace defective products, and at the later stage, a preventive replacement has been performed to avoid wear-out failures. Wang [18] applied machining processes using hidden Markov  TABLE 8. Summary of some selected papers related to big data technology for the minerals industry (cont).

TABLE 9.
Category-wise list of papers that implemented big data technology in the mineral industry. VOLUME 11, 2023 models (HMMs) for tool state detection. During tuning of the proposed model, feature vectors have extracted from variations of signals using a codebook that was built for vector quantization of the extracted features. Baek et al. developed a system to monitor the operational states of systems and detected faults through the identification of significant sensor signals using statistical variance and the Fisher criterion [52]. Wang et al. [13] emphasized a crucial issue of cost-effective interval time between critical level and the monitoring in condition-based maintenance. The authors investigated the relationship between the critical level and interval in condition monitoring to reduce cost and downtime and proposed a regression model based on the random coefficient growth model and assumed that coefficients followed the probability density distributions.
Besides, in the manufacturing industry manual, human efforts are reduced by intelligent agents. For example, Liu et al. [135] proposed a hierarchical structure model welding manufacturing system that introduced a leader following multi-agent robot to intelligentized welding manufacturing. The big data collected using IoT and sensor technologies helps to train the agent incorporating reinforcement learning.

3) PRODUCT QUALITY ASSESSMENT
Batch processing plays a vital role in various production processes and a massive amount of data is generated from this process. The data has a time dimension and a corresponding process variable and is collected during a batch, so the data has 3 dimensions. Product quality in certain batches is important to find faults in the product processing systems, assess product quality, and finally ensure sound production growth.
The assessment of the quality of the product by multivariate classification of collected data has been investigated by Garcia-Munoz [41]. To reduce the dimension of data, they used PCA, and after removing 12 outliers, a final PCA model of two components had been found that showed a separate spectrum for good quality and bad quality products. Mac-Gregor et al. [95] also used latent variables regression models such as PCA and Projection to Latent Structures/Partial Least Squares (PLS) for analysis, monitoring, optimization, and control in the batch process of process industry. Through the projection of process data into low dimensional latent variable space, these dimensionality reduction methods can deal with highly correlated multivariate process data. A latent variable was also used to control the batch products in [53], product analysis and design in [49], for monitoring, testing, and performance measurement of products in [23], and optimization in [28].

1) Production Management
In the manufacturing industry, enhancement of efficacy of the system increases production growth and optimization of the process are the most crucial challenges in the age of globalization. Li et al. [51] followed a data analytics approach to discover an optimal strategy for workload management and proposed a novel scheduling algorithm based on a cloud manufacturing service platform [94]. Tao et al. [121] also proposed another similar cloud manufacturing service system based on IoT and cloud computing.

2) Product Anomaly Detection
To increase the successful production in the manufacturing industry, anomaly or fault detection in the products has no alternatives. ML and DL techniques are currently widely used to detect faults in products. Jiang et al. [136] proposed a supervised anomaly detection model using YOLOv3, where anomalous products have been identified from balanced images. They also proposed a semi-supervised anomaly detection model using Fast-AnoGAN. The semi-supervised method generates new images using a trained WGAN-GP model. Zhang et al. [99] detected the anomalies in product quality inspection using Gaussian Restricted Boltzmann Machine. This model can handle high-dimensional and highly imbalanced distributions on product data.

3) Supply Chain Management
Performance measurement plays a significant role in supply chain management of the manufacturing industry to measure the efficiency of a system. With the rapidly changing goals and limited personnel, performance measurement of individuals in a short time is very challenging for the organizations. High-level networking sensors such as RFID-enabled sensors offer the acquisition of real-time data for production logistic control in supply chain management. Supply chain visibility can enhance the predictive quality, monitor inventory, and boost customer service by tracking lot size, and distribution of production [67]. Large companies like Amazon and Walmart mine their clients' data for product promotion [61]. They visualize the supply and demand signal between retail stores and suppliers; optimize supply chain decisions.
A summary of some selected papers related to the manufacturing industry that applied big data technologies is listed in Table. 10 and 11. The complete process of manufacturing big data acquisition, processing, and application is depicted in Fig. 9 including the corresponding used technologies. A category-wise list of the previous works in the manufacturing industry that embraced big data techniques is listed in Table. 12.

V. BIG DATA OPEN RESEARCH CHALLENGES IN THE INDUSTRY A. INDUSTRY DATA QUALITY ASSESSMENT
The quality of collected data heavily impacts the data analysis tasks. So, the assessment of the quality of the captured data is a crucial need that further calls for the necessity of data cleaning, and filtering for missing or invalid values. Currently, precise evaluation metrics for the assessment of big data quality have not been defined and are lacking proper guidance. The current concept of big data has not defined the quality and criteria of big data. Big data quality depends on data type, format, features, domain, and so on [39], [113]. Moreover, big data frameworks that support rapid integration of big data from multiple industries are yet to be developed. Rapid quality assessment frameworks for massive amounts of industry data need to be developed. Therefore, the research questions addressing the existing challenges for the quality assessment of industry data are as follows: • How to define a big data quality assessment model including proper evaluation criteria?
• Can we develop a big data framework for the acquisition, storage, and processing of mixed structured data?
• How to examine the applicability of an assessment model for a particular industry?
• How can we develop appropriate big data infrastructure to accelerate the quality assessment process of the high-velocity industry data?
• How can we develop big data frameworks that would integrate and assess data quality across multiple industries when necessary?

B. INDUSTRY DATA FUSION AND CLEANING
Depending on the type of industry domain, the type and dimensions of the collected or stored dataset vary a lot. Therefore, data fusion frameworks need to be more flexible in type and dimension. Besides, many research works such as [84] focused on the frameworks to store the data, but were lacking clear discussion on how the ML techniques work on data cleaning and preparation for data mining techniques. The performance of the ML models has not been evaluated and optimized. The effective performance of heterogeneous data VOLUME 11, 2023   fusion and storage solutions for practical industry systems has not been demonstrated. Therefore, the research questions addressing the existing challenges in the fusion and cleaning of industry data are as follows: • How can we develop big data fusion framework that is flexible to data type and storage format as necessary?
• How the model performance can be evaluated to facilitate further optimization?
• How to utilize the cleaned and fused heterogeneous multi-source industry data in solving current problems to prove the efficiency and accuracy of the data cleaning framework?

C. INDUSTRY DATA COMMUNICATION, PRIVACY, AND SECURITY
One of the great challenges of communication and transmission of industry data is data noise. Previous research on data noise detection using traditional clustering algorithms is not much efficient for big data. Ensuring data security for the user and system is another big problem. When users' data has been utilized for various data analytics purposes, user privacy must be ensured. Therefore, the research questions addressing the existing challenges of communication, privacy, and security of industry data are as follows: • Can we develop a more efficient noise-protective big data architecture?
• How to detect noisy data using advanced data mining or ML algorithms?
• How advanced big data architecture can identify and manage the demand of users?
• How does big data infrastructure ensure the privacy of systems and users when using data for other purposes?

D. DISTRIBUTED INDUSTRY DATA MINING
Industry big data highly varies from one industry to another industry. For instance, the power and manufacturing industry produces lower-dimensional sequence data compared to high-dimensional geo-location maps used in the minerals industry. Therefore, generic distributed mining platforms will not serve industry-specific purposes. In addition to this, as ML and deep learning models are black-box type models, the decision-making process of those algorithms is very difficult to understand. As a result, optimization of the models becomes hard and is only limited to changes in neural models' hyperparameters.
Therefore, the research questions addressing the existing challenges in distributed mining of industry data are as follows: • How to develop distributed domain-specific industry data mining platform along with big data technologies?
• How can we interpret neural models to ensure reusability and optimization?

E. PRODUCT QUALITY ASSESSMENT
Industry data is becoming highly non-linear and highdimensional. Linear dimensionality reduction methods like PCA can only present the few most relevant components for analysis. So, linear dimensionality reduction methods have become obsolete for the extraction of rigorous emerging behavior from massive datasets. Besides, consumers' experiences regarding a product are very significant in measuring products' current quality and future improvements. Therefore, the research questions addressing the existing challenges in product quality assessment from industry data are as follows: • Which big data are significant for the understanding of the dynamic behavior of the manufacturing process and how can we capture the non-linear relationships of parameters?
• How can we develop the pipeline between product consumers' demand, experiences, and product assessment and quality enhancement? Industry systems need continuous monitoring to detect faults and it can be done through big data analysis. One of the most important and challenging tasks of data analysis is feature extraction. Almost in all of the previous research, feature extraction from industry data has been extensively done by applying ML or DL models. However, the causal correlation between the features remains hidden. The latent relationship may explore the reason for the fault or anomaly. Therefore, the research question addressing the existing challenges in monitoring and identifying faults in the systems is as follows: • How to discover the correlation/causation rules from the industry data?

2) MINERALS EXPLORATION AND DRILLING
Limited monitoring of fracture jobs has been done from seismic events at this time. The time lag between visualization and explanation of micro-seismic events restricts the capability to take real-time responses. Because of the narrow utilization of vast amounts of data, significant patterns have been remaining unexplored and hardly utilized for future task modeling and decision-making. Besides, practical implementation of most of these proposed ideas is yet to be applied in the mineral industry due to the involvement of natural risk and huge cost. VOLUME 11, 2023 Therefore, the research questions addressing the existing challenges in the monitoring of seismic events for exploration and drilling are as follows: • How to enhance technology for continuous monitoring of seismic events?
• How can we develop rapid visualization and interpretation of micro-seismic events and take real-time emergency measures?
• How to increase the use of seismic data to explore more latent features and thus offer an efficient data-driven solution to exploration?
• How can we implement the drilling ideas that avoid environmental harm and estimate cost?

3) RESERVOIR AND PRODUCTION MANAGEMENT
Datasets are a crucial component in the training of data mining and ML models. Improper or insufficient data hinders data-driven solutions. Therefore, more data should be provided or collected using big data technologies for the progress of mineral data analytics. Moreover, a proper framework should be developed for collecting good quality data and assessment of mineral industry data quality. Besides, data analytics requires ML models to be trained with production data and extract features. As system monitoring is a continuous real-time process, it is very challenging to collect data, pre-process, extract significant features, and train ML models in a real-time industry environment. In addition, due to a lack of dataset, some authors [98] replicate the small dataset. But, replication or repetition and synthesis of datasets have severe drawbacks on the training of ML models. In this way, models learn the same features repetitively which demands unnecessary increased time and memory. We understand that models need a sufficient amount of data to be trained and for this reason, the necessity of duplication of an insufficient amount of data is justified; however, we cannot ignore the negative effect of duplication of data. Besides, more data can add to more varieties of patterns and can make trained models more efficient. Therefore, more operational and production original data from the oil and gas industries need to be collected for data analytics.
Forecasting or classification using neural network models is a black box type method that lacks enough transparency and interpretation. To serve the analysis, rapid and clear interpretation and visualization of results are crucial. Therefore, the research questions addressing the existing challenges in monitoring mineral production are-• How can we get enough good-quality datasets to use for various big data analytics purposes?
• How can we extract features and discover dynamically latent relationships between features, reservoir, and production system?

VI. DISCUSSION AND FUTURE RECOMMENDATIONS
The industry sectors have been revolutionized by big data technologies and have created unprecedented socio-economic development. In this review paper, we discussed the leading big data techniques involving industry and explored the underlying challenges to draw the attention of big data researchers and data analytics. In this paper, we recommend possible solutions to big data research challenges so that future researchers can work on those.
• Big data architecture has been proposed to gather industry data, and utilized it for several purposes. The missing value, noisy data, etc. in collected data impact negatively the later processing or use of this invaluable data. Therefore, data quality assessment is an urgent necessity. The currently proposed data quality assessment models are not precise and demand more guidance. A big data quality assessment model can be proposed and must define (i) data type, format, and domain; (ii) dimensions of data used for mapping the quality; (iii) quality metrics to consider; (iv) attribute or feature evaluation techniques; (v) data sampling strategies; (vi) assessment techniques. Non-relational databases e.g., NoSQL may be used to store heterogeneous data and an empirical big data quality assessment method can be investigated for individual industry domains. A weight coefficient must be defined for each assessment indicator. With big data frameworks like Spark, Storm, etc. to collect high volume data, high processing speed memory and algorithms have to be developed. State-of-the-art big data integration frameworks can be designed to collect multiple industry data and a single assessment model can be applied.
• Data cleaning and pre-processing are closely related to data quality assessment. Common data cleaning algorithms are used for typical industry data but do not consider high-dimensional data such as 3D or 4D seismic data, geographical location maps, and so on. So, an advanced big data framework for high dimensional mineral data storage and processing needs to be developed. The proposed data fusion and cleaning algorithms' performance needs to be measured in real-time industry environments. Advanced data cleaning algorithms may help to clean the noisy data during data transmission in the power industry as power data is susceptible to noise. Semi-supervised self-learning clustering algorithms can be developed to assist in the global data clustering of power system data. Besides, researchers should consider efficient noise-protective big data framework including big data sampling techniques to remove noise from smart grid-transmitted datasets.
• API and middleware can transfer energy data between smart grid and users [70]. By analyzing such data, users' demands can be identified and power systems can work to serve the increased demand. However, system and users' data privacy is a crucial challenge. Advanced multi-layer secured big data framework may serve to avoid potential network attacks. Anonymity algorithms may help to hide the identity of individual users during the use of their electricity consumption data.
• Higher-dimensional, multi-source industry data is another great challenge that needs to be considered. Cutting-edge advancement of deep learning neural model offers dimensionality reduction method of nonlinear data e.g., autoencoder. The autoencoders are unsupervised neural networks and can discover features with reduced dimensions. The autoencoders can be trained to use for forecasting as well. Moreover, the correlation of data can be explored by analysis of the decision rule conditions and its results, measuring relevance score, layer-wise propagation, etc. Therefore, to implement ML algorithms on big data, existing big data frameworks must be developed with a supportive processing system so that algorithms can be trained on the huge amount of big data and learn the pattern of data.
• The scarcity of datasets limits many efforts at big data innovation. Big data analytics depends on large-good quality datasets and sometimes labeled datasets. To predict or forecast events, datasets greatly assist data analytics methods. Data mining, ML algorithms are capable of real-time monitoring of seismic events from seismic data. We have to build large open-source datasets for mineral data analytics research to explore more. Through drilling simulation, virtual and physical data and parameters can be compared and uncertainty can be determined [43]. As a result, cost optimization is possible. Therefore, a data-driven simulator should be developed.
• Along with industry data, social media reviews are the primary source of consumer data and play a vital role in user and business analytics. Using advanced big data frameworks, users' experiences, comments, etc. can be collected and ML algorithms can efficiently discover the assessment of a product by users. The manufacturing industry should draw critical attention to consumers' review and collection of their demand, which may help to increase successful production growth.
• Finally, it is a very urgent need to build an open-source database of industry data. Then, the integration of ML algorithms with big data technologies can provide efficient and dynamic means of capturing latent features of reservoir parameters and data and correlation between them. Advanced ML techniques such as PCA, latent dirichlet allocation (LDA), various types of autoencoder, etc. offer not only feature extraction and covariance but also dimensionality reduction and data labeling facilities. Explainable AI tools e.g., local interpretable model-agnostic explanations (LIME), Shapley Additive Explanations (SHAP), facets, what-if tools, etc. can help to discover the explanation of the prediction properly and assists the process of model optimization.

VII. CONCLUDING REMARKS
This paper presented a review of papers related to big data technologies implemented in the power, mineral, and manufacturing industries. We also discussed the paper collection, selection, and assessment criteria done before the review task.
For high-quality paper collection, we searched in IEEE digital library, ACM digital library, SpringerLink, Elsevier, Multidisciplinary Digital Publishing Institute (MDPI), Google Scholar, Willey, etc. and based on our selected criteria, we filtered only good quality papers for review. We proposed a taxonomy of applications of big data technologies in power, mineral, and manufacturing industries that have been demonstrated on three levels. Along with this, we presented the year-wise distribution and frequency of reviewed papers, and big data technologies used to acquire and process the massive amount of industry data. We also demonstrated the frequency of ML and data mining techniques that are used for industry data processing. Then, we discussed state-of-the-art big data technologies that have been proposed to collect, store, manage, and analyze power, mineral, and manufacturing data. We have investigated the existing big data research gaps in the industry sectors and tried to bridge the gaps by recommending suggestions for data-driven industry approaches. The big data quality assessment framework need to be precisely upgraded for multi-dimensional big data to get outperforms the later processing and storage in the general industry. Moreover, noise cancellation big data framework may help to avoid noise during the transmitted data collection process which further improves the quality of big data. Besides, multilayer secure protection is required for customers and business data for business data analysis. While for industry data analysis purposes, ML and data mining techniques proved tremendous efficiency, these tasks demand high-performance processing systems; therefore, the existing big data frameworks need to be re-framed to support ML and data mining algorithms/ models to be trained with big data.
For industry automation, big data can play a vital role by training intelligent agents and thus reduces human effort, cost, and production time. Introducing intelligent agents has crucial significance, especially in certain industrial environments to reduce life risk.