Big Data Life Cycle in Shop-Floor–Trends and Challenges

Big data is defined as a large set of data that could be structured or unstructured. In manufacturing shop-floor, big data incorporates data collected at every stage of the production process. This includes data from machines, connecting devices, and even manufacturing operators. The large size of the data available on the manufacturing shop-floor presents a need for the establishment of tools and techniques along with associated best practices to leverage the advantage of data-driven performance improvement and optimization. There also exists a need for a better understanding of the approaches and techniques at various stages of the data life cycle. In the work carried out, the data life-cycle in shop-floor is studied with a focus on each of the components - Data sources, collection, transmission, storage, processing, and visualization. A narrative literature review driven by two research questions is provided to study trends and challenges in the field. The selection of papers is supported by an analysis of n-grams. Those are used to comprehensively characterize the main technological and methodological aspects and as starting point to discuss potential future research directions. A detailed review of the current trends in different data life cycle stages is provided. In the end, the discussion of the existing challenges is also presented.

The associate editor coordinating the review of this manuscript and approving it for publication was Chong Leong Gan .

I. INTRODUCTION
The evolution of data storing and analyzing has been a key factor in the development of manufacturing processes. During the pre-industrial revolution, low quantities of data were stored and were mostly transmitted verbally, which led to low production volumes and low quality products. Thereafter, during the first industrial revolution, two kinds of data were being recorded, i.e. machine and worker data. Worker data (attendance and performance) and machine data helped to improve productivity and maintenance, respectively. The mass production model introduced in the second industrial revolution also shifted the job of data processing to educated managers. Scientific methods and statistical models helped in all stages of manufacturing from production planning to inventory management [1]. With the introduction of IT in manufacturing, computer systems, such as CAM and FEA, and information systems, such as MES and ERP, helped in product creation, process optimization, and management. The merge between data and manufacturing in the information age has helped in the shift from dedicated production to flexible production. The extension of IT with unified communication, i.e. ICT further enhanced the role of data in manufacturing. The concept of SM has emerged as a new paradigm focused on responding in real time to constant changing demand and conditions in factories, supply networks, and customer needs [2]. Three key SM technologies include: (i) CPS (physical assets integrated with computational capabilities), (ii) IoT (highly connected devices with embedded sensors), and big data [3]. The big data age has arisen with the massive use of mobile and smart devices, the great availability of IoT devices, and cloud computing, when traditional methods were not sufficient for adequate information processing [4]. In general, big data refers to the storage and analysis of data sets that are characterized by large volume and variety of sources, high velocity of generation and processing, and value generation from its analysis [5].
In the age of big data technologies, various data sources generate manufacturing data, which are collected from connected software solutions, sensors, and IoT devices. On a high level, manufacturing data may be categorized into management, equipment, user, product, operational, and process data [6] and [1]. On a low level, manufacturing data may be categorized into structured, semi-structured, and unstructured data [7]. Structured data have clear relationships between their attributes and is the simplest data type to store and organize, usually represented as tables. Unstructured data comprise most manufacturing data, has no associated data model, and cannot be organized using tables or spreadsheets. Examples of unstructured data include images, audio, text, video. Semi-structured data do not reside in relational databases but have an organizational structure that makes them easier to analyze. Examples of semi-structured data include XML, JSON, and HTML.
The collection and processing of the data in the shop-floor is critical, as most manufacturing operations are carried out there. The advent of IoT and new industrial protocols have supported the acquisition of the information from manufacturing cells, products, transport systems, and people [8]. Thus, many data-driven SM applications have emerged recently, e.g., smart design, smart planning and process optimization, material distribution and tracking, process monitoring, quality control, and smart equipment maintenance [1]. This SM applications rely on transforming primary data to information, making manufacturing processes more intelligent. Examples of shop-floor data include energy consumption, quality test, equipment status, equipment parameters, resource loading, delivery time, and material data [1]. However, despite the benefits foreseen by the usage and processing of data in the shop-floor, challenges in SM need to be considered.
The 5Vs characteristics of big data are widely acknowledged as challenges of big data in manufacturing, including: (i) volume (level of data size), (ii) velocity (ingesting or processing big data in streams or batches, in real time or non-real time), (iii) variety (dealing with complex big data formats, schemas, semantic models and information), (iv) value (analysing data to deliver added-value to some events), (v) and veracity (validate data consistency and trustworthiness) [9]. In addition, cybersecurity is an important aspect in manufacturing. Since big data platforms connect physical spaces with cyber spaces, the danger of not considering cybersecurity might swiftly spread to physical parts of manufacturing systems [7].
Influx of big data generated from multitude of production systems (data sources) on the shop-floor complicates decision making. Combined with multiple data sources, varied transmission protocols and storage requirements for production systems on shop-floor further complicate decision making. As such, the increasing size of data on the shop-floor requires accurately classifying data for reliable decision making. This study aims to develop a homogeneous approach to gathering and utilizing data on shop-floor in manufacturing environments, based on influences and insights of a literature review. Therefore, the complete data life cycle is reviewed.
A need for reviewing the data life cycle in the shop-floor is identified, as research in this field has focused on other aspects of big data, i.e. applications, manufacturing systems and processes, decision making, economics, supply chain, business management, and product life cycle (see Table 1). This work focuses mainly on big data life cycle in the shopfloor, where increasing complexities of data life cycle management requires a detailed review. The effective use of data sources for generating big data for objective completion is studied. Needs, requirements, and methods for data collection and data transmission are also reviewed. Special focus is given to homogenising data acquired, as multiple production systems operate on several protocols and technologies, generating heterogeneous data. Thereafter, data storage, data processing, and data visualisation applied to shop-floor in manufacturing is reviewed. Finally, the review builds on the aforementioned aspects of the data life cycle to elaborate on data application.
This contribution leverages the data life cycle for capturing big data in shop-floor. Specifically, the suitability and adaptation of big data life cycle to shop-floor in manufacturing is the main goal in this contribution. This study, addressing the need for big data on shop-floor, establishes the approach for data acquisition, processing, and utilisation for decision making. Challenges towards real-time data-driven manufacturing are also elaborated.
The rest of this paper is organized as follows. Section II presents the data life cycle to have an uniform terminology for big data in shop-floor. Section III presents the methodology used to understand current trends and future challenges of big data in shop-floor. Section IV presents the results of the literature review, based on data life cycle presented in Section II. Section V presents a discussion of the results and existing challenges in implementing big data in shop-floor. Finally, Section VI presents the conclusions, as well as on outlook on future work.

II. DATA LIFE CYCLE
Big data, and data in general, requires to be structured into specific content formats and context to be useful for users [16]. Big data is useful for automating processes in manufacturing, as it enables machines to communicate among themselves and enables users to extract information and knowledge. As such, research has focused on the data life cycle and how to extract knowledge from varied, heterogeneous data sources, enabling informed decision making. In this context, the data life cycle in shop floor has been presented as consisting of seven stages. This list was developed considering similar works (Table 2) to have a simplified uniform terminology. Furthermore, Figure 1 presents a visual representation of the seven stages of the data life cycle.
1) Data sources: Data sources generate big quantities of data across all the manufacturing value chain and product life cycle, bringing the concept of big data to the shop-floor. According to Demchenko et . [9], big data is characterized by the 5Vs model, i.e. high volume (big quantities of data), variety (data have different formats and sources), velocity (data is rapidly generated), variety (heterogeneous data in varied formats), and value (data has value, which needs to be extracted and analyzed). In this regard, the 5Vs model applies to big data sources in the shop-floor. Data sources includes manufacturing information systems, industrial IoT technologies, internet sources (e.g. e-commerce platforms and social networks), smart products, and governmental public data [1]. 2) Data collection: After data sources generate data, data collection is performed. The collection is performed mainly by IoT technologies, by means of smart sensor nodes equipped with sensing devices, such as accelerometers and temperature sensors, and the data is then transmitted using standardized communication protocols [23]. Data collection may be performed at different frequencies, referred to as sampling frequency or sampling rate, based on the processing power of sensor nodes and the requirements of the variables being measured. In addition to shop floor data sources, other data collection sources, such as third-party application program interfaces or web crawling of internet sources, may be used to collect data, further enriching and expanding the context of data collected during the process. 3) Data transmission: Data transmission maintains the communications between the elements involved in the data life cycle, e.g. manufacturing systems and manufacturing resources. Defining standardized means of transmission, communication and application protocols define how the elements communicate data among each other, for example data transmission rate and communication range, ensuring real-time, secure, and scalable data transmission [24]. As with data collection, data may be transmitted at different frequencies, based on the requirements of the monitoring strategy, such as real-time data transmission or batch data transmission. 4) Data storage: Data obtained during data collection must be stored securely and integrally. Nevertheless, data sources have different formats and may be structured, semi-structured, and unstructured [25]. As stated in [26], the second design principle of knowledge discovery in big data is that one size does not fit all, and several different storage types must be considered. Besides structured data storage, object-based storage provides a flexible solution for storing semi-structured and unstructured data, thus covering the integrity requirement of data storage. In addition, by means of  cloud computing, data storage may achieve cost effectiveness and high-processing power, as well as security, scalability and heterogeneity. 5) Data processing: Data processing builds upon data storage and refers to the operations required to extract information, i.e. knowledge from heterogeneous data sources. By processing raw data, hidden information and patterns may be revealed, providing stakeholders with valuable information for decision making. Different processing techniques and tools may be used depending on analysis to be done on the data. Big data may be processed efficiently by means of data cleaning, data reduction, data analysis, and data mining techniques, owing to advances in artificial intelligence, cloud computing and IoT [1]. As such, the first design principle of knowledge discovery in big data is that data processing should be supported by a variety of data processing methods and analysis environments [26]. 6) Data visualization: Data visualization provides the means to visually understand the information extracted during data processing. Data may be visualized in dashboards, including statements, charts, graphs and augmented reality [27], and data may be queried in real time or on demand, based on the users needs, enabling decision making based on historical or real-time data.
In addition, data visualization should be accessible and easy to understand, as stated in the third design principle of knowledge discovery in big data [26]. As such, popular open standards and lightweight architectures should be used for presenting results, as well as exposing the results using application program interfaces for third-party software integration. 7) Data application: Data application refers to data analytics performed during the entire product life cycle, providing stakeholders with tools for decision making. Data analytics may be applied during the design phase, translating customer needs into product features and quality requirements [28]. Thereafter, during production, data analytics monitor the production process and lead to informed decision making regarding the manufacturing process, improving product quality and reducing production costs [10]. Finally, during product operation and maintenance, data analytics may be used to predict possible faults and to provide preventive maintenance, elongating the life cycle of the product and improving relationships with costumers [10].

III. OBJECTIVE AND METHOD
The work focuses on understanding the trends and challenges in implementing big data on shop-floor applications, emphasizing their data life cycle. For this purpose, a narrative literature review was carried out supported by the extraction of n-grams that allow the preliminary exploration of related trending research. The following research questions guide hereafter the development of this review. Narrative reviews contemplate the identification of several key studies that describe a problem of interest to have a general overview of a field [29]. Despite having a less rigorous approach compared to a more systematic one, in this paper we support the selection of references of interest by extracting monograms, bigrams, trigrams, and quatrograms, related to the main RQs and objective of the work.

B. OBJECTIVES
This review's objectives are twofold and is aligned with the research questions presented above.
• In terms of RQ1: What are the recent trends in big data life cycle in shop-floor? The main interest is to briefly characterize technologies and approaches, considering the seven stages of the data life cycle presented in section II of this review. In the end, a total of 61 articles were chosen, and further analyzed to answer each of the RQs. Fig. 2 presents the methodology used in this review. . 3 presents the result of monograms, bigrams, trigrams, quatrograms from the set of papers collected. In general, we should highlight the presence of technological enablers like internet of things, cyber-physical systems, artificial intelligence (neural networks), digital twin models, cloud computing and other which are supporting the implementation of big data in the shop floor. Other representative key words are related to specific applications e.g. predictive maintenance, energy optimization, product quality, process monitoring, anomaly detection and decision making process. From another perspective, cloud computing and edge computing are also highlighted as computation infrastructure to treat the data. Various of these properties are used as a baseline to characterize the data shop-floor data life cycle in the next section of this review.

IV. RECENT TRENDS IN SHOP-FLOOR BIG DATA LIFE CYCLE
This section explains the results of the narrative review of publications related to big data life cycle. The section is divided into different stages of data life cycle. The results presented are a collective overview of studies presented in the last decade on each of these stages related to big data in manufacturing shop-floor.
A. DATA SOURCES Different applications in the context of smart manufacturing require different data sources ( Figure 4). They are mostly based on the utilization of IoT devices i.e. sensors that collect data from machines, shop-floor, products, people and environmental variables. Other important data sources are the ones provided by heterogeneous product requirements, specially in product driven manufacturing applications.
For decision making activities, examples of data sources include customer requirement documents, datasets, and CAD models. These sources are multi-modal with different forms and, hence, require separate processing methods. Another example is information embedded in CAD models. In this context, Collada can be used as the data format to describe CAD models. If CAD models are designed in CATIA V5, then converters from CATIA V5 to Collada can be used to obtain Collada models [30].
Devices used to monitor energy in shop-floor include smart meters, current and voltage clamps, and machine-integrated devices that provide out-of-the-box instantaneous power consumption [31]. Industrial robots, for example, can provide power consumption for each joint of the robot directly from a robot controller [32]. Experimental data regarding actuation torques and servo drive voltages, used directly to derive input power of plants, can be captured with energy sensors, such as clamps [33]. Alternatively, single-phase and 3-phase smart plugs have become popular for monitoring the energy consumption of manufacturing equipment on the shop-floor [34].
Human data can also provide additional context information to current shop-floor situations. This data provide a better user experience for operators, improving productivity and decision quality. Human data can be divided into human attribute data and state data. Human attribute data are comprised by demographic and characteristic information that does not change or changes sporadically (e.g. age, profession, education status, and skills). This data may be used for ''user modelling'' to deliver information or services according, for instance, to the proficiency, skills, and interest of the user. Human state data refers to a collection of all kinds of data that may be used to model abstract human characteristics, such as behaviour and comfort [35]. Traditional IoT devices may acquire data about the state of operators (e.g. current position and vital functions). For instance, wearable trackers measure human performance under stressful or difficult conditions, analyzing the data and sending warnings when needed [36]. Furthermore, operators can use portable smart devices (e.g. smartphone, smartwatch, and tablet) with NFC readers to check into a location and receive information about relevant parts of the production system equipped with NFC or RFID tag. [36]. The behaviour can also be inferred through interactions that users have with machines or applications, capturing the interactions with plugins or applications, such as Google Analytics and Matomo. Acquired data can be uploaded to cloud services using IoT technology, where it is processed and analyzed to deliver personalized information to operators and supervisors, informing about potential issues.
Most applications for data-driven automation rely on optimal decision making, considering status of machines and conveyors (availability) [37]. Smart sensors have been used to track equipment and people e.g. RFID tags [38], [39], [40], [41]. Smart sensors have also been used to monitor best conditions of machines, e.g., in terms of temperature [42]. In addition, information of images (quality control) has been used as a decision factor for autonomous reconfiguration and adaptation processes [43].
Data-based maintenance sensors that have been used in literature include vibration [44], [45], acoustic emission [44], [45], temperature [40], [44], current [44], [45], velocity [40], pressure [40], and forces [46], implemented in various parts of the machine. The sensor may exist in the machine [47] or may be installed as add-on sensors dependent on the application. PLC controllers provide process-related data, such as cutting speed, feed, and depth of cut [44]. Applicationspecific data sources also contribute in monitoring and maintenance activities. For example, 3D laser scanners have been used to evaluate tool flank wear [45]. Other sources have used device status (such as alarms and logs) [47] and historical failure data [48] logged after quality inspection, aiding in identifying product failure patterns. RFID tags also have been used to identify defective products, comparing with the failure data [40].
Accuracy and quality of data play a vital role in successful implementation of intelligent systems, depending on the effectiveness of data sources. However, data gaps and incompatibility in system applications may be found. To overcome them, proper calibration of data sources is needed. Data sources consist of automation system resources (such as sensors, actuators, PLC, SCADA, DCS, and CNC systems), identification systems (such as RFID, AutoID, barcodes, and vision systems), communication standards between production resources (such as fieldbus and wired and wireless communication), with accompanying data exchange standards (such as OPCUA, MTConnect, and MQTT).
Automation technologies allow a significant reduction of human participation on the shop-floor during production operation. On the one hand, there are processes that may not be automated, mainly due to infeasibility of economic outcome. Specific production processes may involve manual work to be carried out in different manners. The employee carrying out the work may enter the information to a management support system. Nevertheless, the information accumulated from employees through this approach is highly unreliable and cannot be used for machine adaptation. On the otherhand, production systems may perform automated data acquisition without human intervention. Data accumulated in this manner can be used for decision making. However, interfaces and processing of the data may be necessary. Most common data sources in automated production systems for machine adaptation have been identified to be control and measurement devices, measurement instruments (such as sensors and transducers), PLCs (and other control mechanisms), and robots.

B. DATA COLLECTION
The data collection techniques for decision-making are dependent on the data sources. In case of customer requirement, natural language processing techniques, such as named entity recognition [49], relation extraction [50], and attribute extraction [51], have been used. If data come from datasets, deep learning techniques and sampling techniques have been used to collect data [52].
There are mainly two types of data collection techniques, manual data acquisition and automatic data acquisition. Manual data acquisition techniques are employee dependent and are gathered through a manufacturing support system. However, they are highly inconsistent and unreliable [53]. Automated data collection is performed by automated systems like sensors, measuring, and control devices that correspond to changes in physical processes [54].
Data collection in shop-floor depends on the nature of the data, i.e. structured and unstructured [55]. Multiple frameworks are in-place that incorporate data collection strategies for structured and unstructured data [55]. Data collection for machine adaptation is a six-step process involving initialisation, configuration, capturing, analysing, and focusing [56].  Cui et al. [7] stated that almost half of big data collection applications were distributed in monitoring (25%) and predictive applications (24%), characterized for real-time process and non-real-time process, respectively. Real-time process data analysis in manufacturing refers to methods where data from production lines are acquired, processed, and delivered to operators. Thus, it is possible to timely detect anomalies or to quickly know the status of the shop floor, production, machines, and personnel [57]. This is one of the basic needs for operators on the shop-floor, who require a synthesized and centralized view of multiple data sources, which could be highly dispersed. Nevertheless, predictive applications do not necessarily require a real-time data collection and focuses on extracting patterns and trends based on historical process data for optimization and management innovation [57].
Although real-time data collection is preferred, in practice, it is seldom the case for maintenance-related data. Add-on sensors, such as temperature, vibration, pressure, force, and process data from PLC controllers (cutting speed, feed, and depth of cut), may provide near real-time data. Device status and logs have been periodically collected and stored [47]. Wear information has been collected after a predefined amount of time to accurately analysis the wear (e.g. tool wear is measured every 20min in [44]). Process parameters and performance metrics (historic data) have been collected after each production run/shift [40], such as maintenance history and failure records [48]. Almost all data relevant for monitoring or maintenance are time series, being assigned time stamps during collection. Data collection techniques ( Figure 5) include support for RESTful/configurable application layer protocols, OPC unified architectures, and distributed data acquisition (e.g. Flume [47]).
Automation activities rely on event-driven data collection techniques e.g. time driven, quantity driven, operation driven [58]. Event driven approaches allow the storage of manufacturing information after a specific time interval. These techniques are also useful to query manufacturing services for process automation purposes. Optimal decision making usually require storage of historical data and the comparison with a real-time monitoring data collection [40].
For time-driven data collection, energy data from manufacturing equipment has been studied. Energy is usually monitored in given time intervals, such as every 15 minutes, monitoring total energy consumption. However, some applications, such as profiling the robotic motions and understanding the parameters affecting the energy consumption, requires real-time energy data sampled in few milliseconds [59].

C. DATA TRANSMISSION
Data transmission protocols includes sockets, OPC-UA, MQTT, TCP/IP (such as PLC simulator), or other communication protocols ( Figure 6). Data transmission protocols depend on the application domain and may be dynamically chosen. Data transmission is used as the communication channel between different devices, including IoT devices, workstations, and digital twins. When workstations in manufacturing environments use different operating systems, OPC-UA is a suggested solution. Cloud-based systems have also been recommended, as modularity among components of the pipeline is promoted [60].
The transmission of data for further processing depends on the logging frequency of the data. High-frequency data may be stored first in storage devices of monitoring solutions. Thereafter, collected data are transmitted manually in batch to processing computers via Ethernet connections. Some monitoring solutions also offer transmitting data via WiFi. Transmitting energy data via WiFi has the benefit of transport flexibility and high transmission distance. However, WiFi comes with shortcomings, such as high latency and transmission unreliability. Hence, industrial standards such as Modbus and Profinet have been used for mission-critical applications [59], [61].
Process automation may require connecting manufacturing resources to the Internet. Generally, the connection has been done by Ethernet [37] and wireless communications [38]. Data transmission has also been implemented using industrial standards with higher reliability, such as OPC-UA, Modbus, and Profibus [62]. IoT communication has been used to perform data transmission using publish/subscribe messaging, e.g. MQTT protocol [43], for event-driven process automation purposes.
Real-time data may be transmitted using WiFi, Zigbee, and 4G through Internet and using VPNs. Non-real time data may be transmitted through technologies or application like Apache Sqoop and Data/X [40]. Production and sensor data with high frequency have been transmitted through Ethernet to a local server and then, after feature extraction, have been sent to cloud servers in the Internet using WiFi protocol [44].
The introduction of IoT in the shop-floor has increased the transmission of low-frequency sensor information directly from the source through WiFi from various sources. This has also had an impact on the latency of the system response. Data transmission rates play a vital role that depend on the manufacturing application. To incorporate multiple data formats, standards, and needs for machine adaptation, a combination of technologies is proposed in this study to assists in data transmission. To this end, a data transmission framework is necessary to improve data transmission across the production domain.

D. DATA STORAGE
Common data formats to store machine information are XML and JSON files [38]. Different data types include structured (formatted as tables), semi-structured (such as XML, JSON, and HTML) and unstructured data (such as documents, images, audio, video, text, and emails) [58]. Table 3 oresents data storage types and technologies used in manufacturing shop-floor. Unstructured data are first processed to extract relevant information internally before being stored in databases. For example, tool wear information has been extracted from wear images using image processing software and converted into flank/crater wear values along with their time stamps [45].
Depending on the data type, data may be stored using several techniques. Traditionally, RDBMS and DDBS have been used for structured data. RDBMS are characterized by well-defined schemas and relationships. For example, basic user information may be stored in traditional database systems such as MySQL, PostgreSQL, and SQLite. RDBMS have been used for user interaction data storage. For instance, Matomo, an user analytics platform, captures user interaction streams (e.g. clicks and page views) in MySQL and MariaDB databases. However, RDBMS offer limited scalability.
NO SQL databases (e.g. MongoDB and Cassandra) have proven to be better approaches for semi-structured (JSON, XML) and unstructured (audio, video) data. In addition, XML has been used to transform structured data to semi-structured data [40]). HDFS may also be used for dealing with unstructured data. Some examples of these kind of databases include: • Cassandra to store event data of automation controller. • MongoDB (document NoSQL database) to store machine data.
• TSDBS, such as OpenTSDB and InfluxDB, to store and access sensor time-series data. VOLUME 11, 2023 Data models are also used to represent manufacturing data. Data models are comprised of two parts: (i) run time conditions (process knowledge and time-sensitive dimension) and (ii) process model (production requirements of products). Once data models are defined, knowledge graphs may be used to store data. There are two main types of storage for knowledge graphs: RDF-based storage and graph-based storage. An important design principle of RDF-based storage is the ease of data distribution and sharing, while graph-based storage focuses on efficient graph queries and search. The Neo4j system is a widely used graph database [63]. It has an active community, and the system itself is efficient in querying. However, it lacks of support for quasi-distribution.
Smart manufacturing applications have used distributed file systems (for data-at-rest) and databases (for data-atmotion) for storage [37]. Historical data are ingested from databases to predict production planning performance, safety critical aspects, and network designs. In addition, Hadoop and MapReduce techniques may be used to reduce the storage space required for big data.
Production and sensor data from the machines have been initially stored in industrial computers connected to machines, which are then processed internally using feature extraction to understand the states of the machines. Thereafter, the data have been sent to cloud servers for managed and storage in a database, acting as remote server for data storage [44].
Automation applications relying in storage of manufacturing information, as well as services, have increased the responsiveness and interoperability of the shop-floor and thus, the automation capacity. The choice of storage solutions greatly affects the application. High-frequency big-data files require special solutions such as Hadoop and Spark that can deal with the high volume property of big data. Data have been recorded in regular time intervals, resulting in time-series data [64]. To this end, special database solutions for storing time-series data, such as InfluxDB, may be used. Also, relational database methods have been used for their reliability. Futhermore, some monitoring solutions have stored the collected energy in storage devices using CSV files.

E. DATA PROCESSING
When data are collected and transformed into usable form, data processing takes place. Data processing must be done appropriately to avoid having detrimental impact on the final product, or data output. It is typically performed by data scientists or teams of data scientists. Different techniques can be used for data processing. Figure 7 presents the traditional data processing process performed in the shop-floor. Data processing is a computationally intensive task. First, data should be resampled to match the recorded timestamps. Resampling methods such as averaging, forward filling, or backward filling have been used in literature [65]. Averaging method takes an average value within a pre-defined time interval and replaces the missing values with the average value in the data. In forward-and backward-filling methods, missing timestamps are filled with values before and after the missing timestamp, respectively. Once data has been processed, it has been fed into application-dependent algorithms such as ARIMA, Seasonal ARIMA, Bayesian Optimization, clustering, neural networks [66], genetic algorithms [67] and parameter identification methods [68].
Several approaches exist for data processing in decision making. Several studies have used a method based on multi-neural collaboration to extract knowledge and the extracted knowledge has been classified according to labels. An ontology model and schema layer of the knowledge graph has been defined and the knowledge has been represented with fuzzy comprehensive evaluation [69]. Knowledge has been directly described as production rules [70] and as knowledge graph [71]. Owing to the wide range of knowledge sources, the knowledge base that has been constructed according to the two steps above has high redundancy. To this end, latent semantic analysis, similarity calculations and attribute weighting may be used to eliminate redundancy in the knowledge. First, the entity triples in the preliminary knowledge base have been mapped with the Protege ontology library, and then the semantic web rule language (SWRL) has been used to represent the empirical rule knowledge. Finally, the data layer has been instantiated to construct the final knowledge base [72].
As for the data processing in HMI, in addition to using several data mining and machine learning techniques, the development of analytic solutions requires selecting the right strategy according to diverse scenarios. Streaming, VOLUME 11, 2023 large-batch, and small-batch analytics are the three main processing strategies for big data [81]. Streaming is a processing technique for real-time analysis of data streams, particularly necessary when data arrives at high velocity. Large-batch processing is the most traditional form of processing where big data volumes are collected, representing large periods of time (e.g. hours, day, week) and being analysed with complex machine learning models. For batch processing, real-time data processing is not a priority. Small-batch processing (also known as micro-batch) is the process of small cumulus of data on a small time window (e.g. seconds, minutes).
Data processing can be also used for automation. Intelligent decision making for process automation and self-organization requires the analysis of machine status and energy consumption. This makes necessary the use of machine learning techniques. Some examples for process automation include neural networks, support vector machines, and k-nearest neighbours [42]. Negotiation based approaches with machine learning have been used for choosing proper routing and transportation of products, e.g. for storing or scrapping [43]. Genetic algorithms have also been used under the scope of ML. For process automation, genetic algorithms find optimal production resources e.g. the ones with minimum energy consumption or the ones that require less production time. In general, classical machine learning techniques are enough for this type applications.
In maintenance sector, feature extraction of the time series data from sensors like vibration/forces include both time-domain and frequency domain feature extraction. Time domain features include RMS, peak, mean, standard deviation, skewness, kurtosis, and crest factor [44]. Frequency domain features include main frequency, harmonics, frequency band energy percentage. Before feature extraction of high-frequency data, noise reduction should be performed to the signal. Data and pattern mining models for maintenance (e.g. Apriori [40] or FPGrowth [48]) could be used for knowledge and rules generation. Generated knowledge along with production data could aid in fault diagnosis and prediction. Correlation analysis has provided internal relationships between device and faults [47].
Traditional and Deep machine Learning techniques have been used for data analytics. Clustering algorithms have been identified to be the most common machine learning algorithm for preliminary grouping of sensor data and for creating labels according to their process state [44], [48]. Clustering algorithms have been followed by classification algorithms based on traditional machine learning (e.g. k-means in [48]) or deep learning (e.g. CNN in [46]). Technologies that have been used for data analysis in maintenance include STORM [40] (distributed computing), STORM cluster [47] (resource scheduling), Hadoop [40] (offline prediction -considering both current status and historical information).
The collected data needs to be processed to generate insights. Primary steps in data processing involve cleaning the data to remove noisy and incorrect format issues. Streamlink (Flink, Storm), micro-batching (Spark) and batching data processing (MapReduce) provide technologies to clean and process big data volumes. Manufacturing applications like complex event processing by Storm, and detecting deviations by Flink, prediction and quality control by MapReduce are some examples where these technologies are used to process manufacturing data. Knowledge can be generated by harvesting big data technologies on the generated big data. Apache Hive-Mind based platforms have aided knowledge generation for predictive maintenance. Hadoop and OWL technologies can manage knowledge of intelligent applications for smart manufacturing applications.

F. DATA VISUALIZATION
Data visualization is an integral part of data analysis which concentrates on the use of tables and graphs for presenting quantitative and qualitative information, and as a way that users can communicate with the data [83]. However, few state-of-the-art works describe methods for data visualization in the context of smart manufacturing automation and big data. Data visualization is usually implemented in the form of dashboards, a type of graphical user interface that consolidates a grand amount of data (i.e. sensor, operational, and maintenance data). Dashboards are used to monitor and access production status or in some cases as a direct interface between the customer and the shop floor. They are often interactive and users can filter and query data, zoom in/out, and scroll. Many of the visualizations contained in dashboards show changes over time and are updated as new data is released, thus displaying real-time data updated every few seconds or minutes. In general, data visualization can include [84]: • Different types of charts and graphs, tables, time trends, etc.
• Interactive widgets (i.e. knobs, dimers, keypads, etc.) used to interact with CPS, IoT devices and applications, based on current data analysis.
• Visualization of geo-referenced data (machines in different locations, operators location tracking, external sensors) From the technological perspective, in research, scholars prefer the use of Python programming language to develop machine learning models. Therefore, for data visualization, Python libraries such as Seaborn or Matplotlib are chosen to develop charts and graphs. Reference [85] used matplotlib to visualize a heat map o to find the correlation between the variables involved in milling tool wear. Depending on the tools and technology used (e.g. SQL databases, graph databases), visualisation methods integrated into the development environment can be used [63].
However, these options are not intuitive or designed for end-users. At the moment, multiple platforms and frameworks can produce analytics applications and visualizations easily with very aesthetically pleasing results. Grafana is one of the most popular open-source platforms for interactive data visualization. Reference [79] used Grafana to create a dashboard for visualising energy data at the workstation level to show operational KPI and power consumption trends. Similarly, [81] developed dashboards using Grafana and Amazon QuickSight for its compatibility with Spark to display the results of small-batch processing for the detection of anomalies on CNC Machines. Other similar products include Qlikview, Tableau, Kibana, and Splunk.
Even when these platforms are claimed for their ease of use; the target users are data scientists and engineers, business analysts, or DevOps engineers. For end-users (i.e customers, operators, supervisors) customized applications accessible through mobile devices or web interfaces using browsers [62] is the best option. In [44], a Web and iOS-based user interface is used in real-time for decision-making on the assessment of health. In [47], the manufacturing data processed is sent to backstage supporters and the diagnosis or prognosis reports are visualized on large screens through a web application (Single View integrated failure map pattern monitoring and cause [48]) or sent to mobile devices of the maintenance personnel. These kinds of applications will require some sort of software development. Javascript is the ultimate web standard for reactive applications, with multiple frameworks such as React, AngularJs, NodeJs, etc. There are specific Javascript libraries that allow the development of interactive visualizations such as CanvasJS or ChartJS. Reference [86] developed a web application for historical analysis and real-time tracking of the assembly line performance. The web is created with a combination of HTML5, CSS, JavaScript, the JavaScript Data-Driven Documents (D3) library, the Three.js, and several JavaScript framework & utility libraries including Underscore.js, Backbone.js, and JQuery. Table 5 describes some of the platforms, softwares and libraries found in the literature for visualisation.
From the user perspective, it is important to consider that manufacturing processes involve different types of users where multiple variables intervene (i.e. expertise, role, age, etc). Therefore, users will have different perceptions of visual data presentation and interactive data analysis [57]. Usercentered design as a methodology can help to understand the requirements and needs of determined roles in the industry. Table 4 provides a brief overview of the relevant references for big data shop-floor reviewed in this work. Different manufacturing applications require different data sources. Data sources comprise mostly smart sensors and IoT devices that convert physical variables into digitized measurable units. Smart decision making in product driven manufacturing applications rely on specifications of production requirements. Manufacturing automation concepts are based on logic-based or negotiation based approaches. In particular, it has been identified that data-driven automation has been considered less, making this as an opportunity for future research.

V. DISCUSSION ON THE RECENT TRENDS AND CHALLENGES
Some applications rely fundamentally in data acquisition and number of sensors placed in shop-floor machines and resources. Two examples are maintenance and energy optimization. One the one hand, maintenance has relied on acoustic emissions, temperature, velocity, pressure, and other variables to understand health status of machines. On the other hand, energy optimization application have relied mostly on measurement of electrical variables, e.g. smart meters, current and voltage clamps, and single-phase and 3-phase smart plugs. With the advent of human-centre manufacturing applications, the acquisition of data from operators has become a trend in current research, specially data used to model human characteristics, such as behaviour and comfort. Wearable trackers can measure human performance under stressful or difficult conditions. Consideration should be given to data sources that contain collection of data that should not be used due to regulations i.e General Data Protection Regulation.
Data collection may be performed with either manual or automatic data acquisition. Main trade-offs happen in form, consistency and reliability of the data. Data collection is dependent on the type of data source and comes from sources, such as IoT devices, evaluations, simulations, and predictions, in structured or unstructured formats. Data collection has been usually accompanied by an underlying framework that leverages step-wise processes to gather desired data for decision-making. Predictive maintenance, monitoring, energy consumption, and event-driven automation applications require data to be collected as per specific requirements. These requirements include real-time, time-driven, and periodic data collection, as well as application-specific criterion.
Data transmission may be performed with sockets, OPC-UA, MQTT, TCP/IP (such as PLC simulator), or other communication protocols depending on the application domain and can be dynamically chosen. Data transmission is the middleware between digital twins and the shop-floor. Moreover, it is the communication channel between devices in digital twins and their physical counterpart. The introduction of IoT on the shop floor has increased the transmission of low-frequency sensor information directly from sources through wireless communication. This has had impact on the latency of the response of the system. Industrial wireless communication devices include industrial switches, industrial routing, and wireless access points.
As manufacturers becomes increasingly reliant on sensors and various data sources, data storage has become an increasingly important concern. In particular, the ability to store big data has been given special attention. A trend has been identified in manufacturers, moving from traditional RDBMS databases to NoSQ and NewSQL databases when considering scalabilty. Moreover, a need has been identified to develop techniques to not only store data in a structured manner but also filter redundant data and delete data which is no longer relevant. This could greatly reduces storage costs and complexity. However, it has been recognized that there are few studies considering this aspect.
Data processing techniques have been widely used in manufacturing. With the development of IoT, 5G and 6G, and cloud computing technologies, the data quantity from manufacturing systems has increased rapidly. With industrial big data, achievements beyond expectations have been made in product design, manufacturing, and maintenance processes. Data processing has been a core technology to empower intelligent manufacturing systems.
Finally, visualization has been identified to usually be a neglected aspect in research. As presented in the results, multiple scholars prefer Python libraries for simple static visualization. However, to provide adequate commercial implementations of big data applications, visualization is as essential as the other stages. The capability of applications to further exploit data from user behaviour, improving the visualization aspect in manufacturing, needs further research. Furthermore, there is a lack of standardization that requires researchers and engineers to identify generic abstractions for industrial data and understand different users groups. Thus, new frameworks for visualization applications may be developed.
Challenges: Challenges found in literature have been compiled in this study from the results and discussion of the review process. Although some of the challenges below are application-specific, they were found quite often in the reviewed literature.
• Data measurement solutions usually come with inherent measurement errors. Although these errors are relatively small, transferability has been affected. For instance, the same sensor for the same equipment performing the same application can yield different energy consumption values. Noisy and non-deterministic measurement values challenge data-processing and decision-making algorithms.
• Frequency of collected data is identified as another challenge in literature. Sampling at a high rate produces big data that is difficult to transmit and process in real-time. However, some applications require high-frequency data, such as energy parameter profiling applications. Therefore, trade-offs should be considered in data collection on the shop-floor.
• Data acquisition systems, incorporating all information gathered during the production process, are needed to collect data, discover knowledge, and share it among all stakeholders.
• Real-time processing, analysis, production reporting, and monitoring of data-driven sources must be implemented for real-time analysis of sensor data.
• Reliable data and valuable knowledge is needed to support optimized decision-making of product life-cycle management.
• Data heterogeneity must be processed in shop-floor systems comprised of multi-source heterogeneous data and complex processes, such as fault prediction using traditional signal processing techniques considering the 5V challenges posed by industrial big data.
• Data visualization disgned should be improved for human interaction. Visual and task complexity must be consider for data visualization, such as complex dashboards and unorganized big data. In addition, a high number of steps to realize a task may cause mistakes and reduce the performance of operators.
• The lack of implementations of cybersecurity and data privacy remains a challenge in shop-floor systems, in particular for big data analytics.
• Governance of big data handles data integrity, quality, provenance, retention, processing, and analysis in the full data life cycle. Governance of industrial big data should consider the issues of cybersecurity and data privacy as well.

VI. CONCLUSION
In this work, a basis for the development of an homogeneous approach to gather and use big data on the shop-floor in manufacturing environments has been presented. A literature review of research regarding big data in manufacturing has been performed, targeting the complete data life cycle. In this regard, the needs, requirements and methods for the seven stages of the big-data life cycle in manufacturing have been presented and discussed. Therefore, approaches for data acquisition, processing and utilisation for decision making in shop-floor in manufacturing have been established and challenges in each stage have been elaborated.
As results of this study, approaches have been identified in each stage of the big-data life cycle in manufacturing, focusing on maintenance, automation, quality, decision making, energy optimization, user interaction, and adaptability. Data sources, such as sensors, documents and models, have been identified and elaborated, detailing their usage and benefits, as well as possible drawbacks. Thereupon, data collection techniques have been presented, i.e. manual data acquisition and automatic data acquisition, describing the benefits and drawbacks of each. Furthermore, a separation between monitoring and predictive applications has been described, highlighting the effect that the intended application has in data collection. Having presented data collection techniques, data transmission protocols and techniques have been studied. Techniques and protocols for data transmission have been presented, as well as the cases in which each may be used. Following, data storage possibilities have been presented. Since data may be structured, semi-structured and unstructured, storage options have been discussed for each type of data structure, as well as the methods to integrate data in different formats and from different sources. In the context of data processing, several approaches towards data processing have been presented, as well as leading technologies for big data processing. In general, artificial intelligence and statistical approaches have been identified as the main contributors in this stage. Finally, data visualization methods, an integral part of data analysis, have been described in the context of smart manufacturing automation and big data. Several platforms and frameworks for data visualization have been reviewed and programming languages suitable for creating dashboards and visualization applications have been described.
A discussion of the trends and challenges obtained from the review process has been presented. It has been identified that the primary data sources include smart sensors and IoT devices. Nevertheless, human-centered manufacturing applications have included data acquisition from operators, allowing modelling of behaviour and comfort. An important consideration that has been highlighted, regardless of the source of the data, is data privacy and restrictions that may apply due to regulations.
Regarding data transmission, several protocols have been identified and their usage will depend on the technologies being used and the application. Data format, data size, transmission distance and transmission rates have a determining effect on which protocols to use and how to integrate the data being sent. In data storage, moving from traditional structured data storage, such as RDBMS, to unstructured and semi-structured data storage, such as NoSQL and NewSQL, has been identified as the leading trend. In addition, it has been identified that there is a lack of focus on irrelevant data filtering and deletion, which might help to reduce cost and processing power in applications where there are economical or storage constraints.
In general, this research has identified several challenges in literature. Challenges involve possible errors in the collected data, which may lead to inaccurate measurements, as well as the challenges regarding the handling of varied sampling frequencies and the impact on the transmission technologies used. Furthermore, challenges regarding heterogeneity of data have been identified, where the integration of varied data sources could represent a challenge during data storage, processing, and visualization, deriving in incorrect analysis of data or complexity in understanding the data obtained during the data life cycle. Finally, cybersecurity and data privacy have been identified as important challenges, as several studies have lacked attention in this regard.
Future work will focus on developing a consolidated framework and methodology for big-data life cycle. Based on the findings of this review, it is expected that this work will serve as basis for future frameworks for big-data life cycle on the shop floor.
JOSÉ JOAQUÍN PERALTA ABADÍA received the B.Sc. degree in computer science and the M.B.A. degree from Universidad de Costa Rica, Costa Rica, and the M.Sc. degree in artificial intelligence from Universidad Politécnica de Madrid, Spain. He is currently pursuing the Ph.D. degree with Mondragon Unibertsitatea, Spain, funded by the H2020 DIMAND Project. He is also a researcher in artificial intelligence, computer science, and business administration. His research interests include artificial intelligence applied in manufacturing for process optimization.
ANGELA CARRERA-RIVERA received the B.Sc. degree in information systems from Escuela Superior Politecnica del Litoral, Ecuador, and the M.Sc. degree in information technology engineering from the University of Melbourne, Australia. She is currently pursuing the Ph.D. degree with the Faculty of Computer Science, Mondragon University.
AGAJAN TORAYEV received the Diploma degree in applied mathematics and informatics from Magtymguly Turkmen State University, and the Master of Science degree in computer, specializing in intelligent systems, machine learning, and deep learning from the University of Bonn. He is currently pursuing the Ph.D. degree in manufacturing engineering with the University of Nottingham. He is a Researcher with the University of Nottingham on the H2020 DIMAND Project. He is also a researcher in applied mathematics, informatics, and computer science. His current research interests include optimal manufacturing configurations selection for rapidly changing requirements.
HAMOOD UR REHMAN received the B.E. degree in industrial and manufacturing engineering from the NED University of Engineering and Technology, Pakistan, and the M.Sc. degree in mechanical engineering modeling from the Budapest University of Technology and Economics. He is currently pursuing the Ph.D. degree with the University of Nottingham, U.K. He was with various organizations focused on manufacturing, before joining the master's degree. He was a Lecturer with NED University, after his master's studies. He is with TQC Ltd. (Automation and Test Solutions) as a Robotics and Control Systems Engineer. His research interests include robotics, automation, digital manufacturing, and self-configuration in smart manufacturing systems.
FAN MO received the bachelor's degree in engineering from Tongji University, Shanghai, China, and the master's degree from the University of Stuttgart, Germany. He is currently pursuing the Ph.D. degree with the Institute of Advanced Manufacturing, University of Nottingham, U.K. He can speak English and German fluently. During and after his studies, he gained practical experience through internships and full-time positions with BMW, Daimler, and Volkswagen in Germany and China. He is currently a Marie Curie Researcher supported by the Horizon 2020 DiManD Project funded by the European Union. His research interests include robotics, knowledge graph, artificial intelligence, and multiagent programming.
SANAZ NIKGHADAM-HOJJATI (Member, IEEE) received the Ph.D. degree in information technology management (business intelligence) from I.A.U., in 2017. She was a Postdoctoral Researcher with the Nova School of Science and Technology, Nova University of Lisbon, from 2018 to 2019, where she is a Senior Researcher with UNINOVA Institute. In addition, she has worked as a university-invited professor, and also she is the Director of the Women in Science, Technology, Engineering, and Mathematics (WoSTEM) Program, UNINOVA. She has published several books and academic articles in a number of peer-reviewed journals and presented various academic papers at conferences. Her research interests include computational creativity, affective computing, business intelligence, human behavior, emerging technologies, ICT, and innovation management. She has led and participated in several European Union projects, and Portuguese and Iranian National projects.
JOSÉ BARATA (Member, IEEE) received the Ph.D. degree in robotics and integrated manufacturing from the NOVA University of Lisbon, in 2004. He is a Professor with the Department of Electrical Engineering, NOVA University of Lisbon, and a Senior Researcher with the UNINOVA-Instituto de Desenvolvimento de Novas Tecnologias. He has participated in more than 15 international research projects involving different programs, including NMP, IST, ITEA, and ESPRIT. Since 2004, he has been leading the UNINOVA participation in EU projects, namely, EUPASS, self-learning, IDEAS, PRIME, RIVER-WATCH, ROBO-PARTNER, and PROSECO. In the last years, he has participated actively in researching SOA-based approaches for the implementation of intelligent manufacturing devices, such as within the Inlife Project. He has authored or coauthored over 100 original articles in international journals and international conferences. His main research interests include intelligent manufacturing, with an emphasis on complex adaptive systems, involving intelligent manufacturing devices. He is a member of the IEEE Technical Committee on Industrial Agents (IES), the Self-Organization and Cybernetics for Informatics (SMC), and the Education in Engineering and Industrial Technologies (IES).